Fighting Words

Based on Monroe et al.’s Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict.

Implementation adapted from Jack Hessel’s implementation.

Example usage: finding the fighting words of r/atheism and r/Christianity.

class convokit.fighting_words.fightingWords.FightingWords(obj_type='utterance', text_func=None, cv=None, ngram_range=None, prior=0.1, class1_attribute_name='fighting_words_class1', class2_attribute_name='fighting_words_class2')

Based on Monroe et al.’s “Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict”

Implementation adapted from Jack Hessel’s https://github.com/jmhessel/FightingWords

Identifies the fighting words of two groups of corpus components (e.g. two groups of utterances), which we define as the groups: ‘class1’ and ‘class2’

Runs on the Corpus’s Speakers, Utterances, or Conversations (as specified by obj_type). By default, the text used for the different object types:

  • For utterances, this would be the utterance text.

  • For conversations, this would be joined texts of all the utterances in the conversation

  • For speakers, this would be the joined texts of all the utterances by the speaker

Other custom text configurations can be configured using the text_func argument

Parameters
  • obj_type – ‘utterance’, ‘conversation’, or ‘speaker’

  • text_func – function for getting text from the Corpus component object. By default, this is configured based on the obj_type.

  • cv – optional CountVectorizer. default: an sklearn CV with min_df=10, max_df=.5, and ngram_range=(1,3) with max 15000 features

  • ngram_range – range of ngrams to use if using default cv

  • prior – either a float describing a uniform prior, or a vector describing a prior over vocabulary items. If using a predefined vocabulary, make sure to specify that when you make your CountVectorizer object.

  • class1_attribute_name – metadata attribute name to store class1 ngrams under during the transform() step. Default is ‘fighting_words_class1’.

  • class2_attribute_name – metadata attribute name to store class2 ngrams under during the transform() step. Default is ‘fighting_words_class2’.

Variables

cv – modifiable countvectorizer

static clean_text(in_string)

Cleans the text using Python clean-text package: fixes unicode, transliterates all characters to closest ASCII, lowercases text, removes line breaks and punctuation, replaces (urls, emails, phone numbers, numbers, currency) with corresponding <TOKEN>

Parameters

in_string – input string

Returns

cleaned string

fit(corpus: convokit.model.corpus.Corpus, class1_func: Callable[[convokit.model.corpusComponent.CorpusComponent], bool], class2_func: Callable[[convokit.model.corpusComponent.CorpusComponent], bool], y=None, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function FightingWords.<lambda>>)
Learn the fighting words from a corpus, with an optional selector that selects for corpus components prior to

grouping the corpus components into class1 / class2.

A warning will be printed if there are components that appear in both class1 and class2, as FightingWords

is typically used for disjoint sets of texts.

Parameters
  • corpus – target Corpus

  • class1_func – selector function for identifying corpus components that belong to class 1

  • class2_func – selector function for identifying corpus components that belong to class 2

  • selector – a (lambda) function that takes a CorpusComponent and returns True/False; this selects for Corpus components that should be considered in this fitting step

Returns

fitted FightingWords Transformer

get_class(ngram)

Get the class that ngram more belongs to.

Parameters

ngram – ngram of interest

Returns

“class1” if the ngram has non-negative z-score, “class2” if ngram has positive z-score, None if ngram not in vocabulary

get_model()

Get the FightingWords CountVectorizer model

get_ngram_zscores(class1_name='class1', class2_name='class2')

Get a DataFrame of ngrams and their corresponding zscores and class labels.

Parameters
  • class1_name – readable name for objects in class1

  • class2_name – readable name for objects in class2

Returns

a DataFrame of ngrams with zscores and classes, indexed by the ngrams

get_ngrams_past_threshold(threshold: float = 1.0) → Tuple[List[str], List[str]]

Returns the (ordered) ngrams that have absolute z-scores that exceed a specified threshold, for both classes

Parameters

threshold – by default, threshold z-score = 1

Returns

two ordered lists of ngrams (with descending z-score): first list is for class 1, second list is for class 2

get_top_k_ngrams(top_k: int = 10) → Tuple[List[str], List[str]]

Returns the (ordered) top k ngrams for both classes.

Parameters

top_k – by default, k = 10

Returns

two ordered lists of ngrams (with descending z-score): first list is for class 1, second list is for class 2.

get_zscore(ngram)

Get z-score of a given ngram.

Parameters

ngram – ngram of interest

Returns

z-score value, None if ngram not in vocabulary

plot_fighting_words(max_label_size=15, class1_name='class1', class2_name='class2', config=None)

Plots the distribution of fighting words.

Adapted from Xanda Schofield’s https://gist.github.com/xandaschofield/3c4070b2f232b185ce6a09e47b4e7473

Specifically, the weighted log-odds ratio is plotted against frequency of word within topic.

Only the most significant ngrams will have text labels. The most significant ngrams are specified by the config parameter. By default, the annotation method is to annotate the corpus components with the top 10 fighting words of each class.

Parameters
  • max_label_size – For the text labels, set the largest possible size for any text label (the rest will be scaled accordingly)

  • class1_name – descriptive name for class1 corpus component objects

  • class2_name – descriptive name for class2 corpus component objects

  • config – a dictionary of configuration parameters for setting which fighting words are significant enough to annotate. The dictionary should hold the keys: annot_method (‘top_k’ or ‘threshold’), and either ‘threshold’ (a float for the min absolute z-score to be considered significant) or ‘top_k’ (an int to set the value of k). By default, config is {‘annot_method’: ‘top_k’, ‘top_k’: 10}.

Returns

None (plot is generated)

set_model(cv)

Set the FightingWords CountVectorizer model

summarize(corpus: convokit.model.corpus.Corpus, plot: bool = False, class1_name='class1', class2_name='class2')

Returns a DataFrame of ngram with zscores and classes, and optionally plots the fighting words distribution. FightingWords Transformer must be fitted prior to running this.

Parameters
  • corpus – corpus to learn fighting words from if not already fitted

  • plot – if True, generates a plot for the fighting words distribution

  • class1_name – descriptive name for class1 corpus component objects

  • class2_name – descriptive name for class2 corpus component objects

Returns

DataFrame of ngrams with zscores and classes, indexed by the ngrams (plot is optionally generated)

transform(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function FightingWords.<lambda>>, config=None) → convokit.model.corpus.Corpus

Annotates the corpus component objects with the lists of fighting words that the object contains.

The relevant fighting words to use are specified by the config parameter. By default, the annotation method is to annotate the corpus components with the top 10 fighting words of each class.

Lists are stored under the metadata attributes defined when initializing the FightingWords Transformer.

Parameters
  • corpus – corpus to annotate

  • selector – a (lambda) function that takes a CorpusComponent and returns True/False; this selects for corpus components that should be annotated with the fighting words

  • config – a dictionary of configuration parameters for setting which fighting words are significant enough to annotate. The dictionary should hold the keys: annot_method (‘top_k’ or ‘threshold’), and either ‘threshold’ (a float for the min absolute z-score to be considered significant) or ‘top_k’ (an int to set the value of k). By default, config is {‘annot_method’: ‘top_k’, ‘top_k’: 10}.

Returns

annotated corpus