Fighting Words¶

Based on Monroe et al.’s Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict.

Implementation adapted from Jack Hessel’s implementation.

Example usage: finding the fighting words of r/atheism and r/Christianity.

class convokit.fighting_words.fightingWords.FightingWords(obj_type='utterance', text_func=None, cv=None, ngram_range=None, prior=0.1, class1_attribute_name='fighting_words_class1', class2_attribute_name='fighting_words_class2')¶

Based on Monroe et al.’s “Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict”

Implementation adapted from Jack Hessel’s https://github.com/jmhessel/FightingWords

Identifies the fighting words of two groups of corpus components (e.g. two groups of utterances), which we define as the groups: ‘class1’ and ‘class2’

Runs on the Corpus’s Speakers, Utterances, or Conversations (as specified by obj_type). By default, the text used for the different object types:

For utterances, this would be the utterance text.
For conversations, this would be joined texts of all the utterances in the conversation
For speakers, this would be the joined texts of all the utterances by the speaker

Other custom text configurations can be configured using the text_func argument

Parameters

obj_type – ‘utterance’, ‘conversation’, or ‘speaker’
text_func – function for getting text from the Corpus component object. By default, this is configured based on the obj_type.
cv – optional CountVectorizer. default: an sklearn CV with min_df=10, max_df=.5, and ngram_range=(1,3) with max 15000 features
ngram_range – range of ngrams to use if using default cv
prior – either a float describing a uniform prior, or a vector describing a prior over vocabulary items. If using a predefined vocabulary, make sure to specify that when you make your CountVectorizer object.
class1_attribute_name – metadata attribute name to store class1 ngrams under during the transform() step. Default is ‘fighting_words_class1’.
class2_attribute_name – metadata attribute name to store class2 ngrams under during the transform() step. Default is ‘fighting_words_class2’.

Variables

cv – modifiable countvectorizer

static clean_text(in_string)¶

Cleans the text using Python clean-text package: fixes unicode, transliterates all characters to closest ASCII, lowercases text, removes line breaks and punctuation, replaces (urls, emails, phone numbers, numbers, currency) with corresponding <TOKEN>

Parameters: in_string – input string
Returns: cleaned string

fit(corpus: convokit.model.corpus.Corpus, class1_func: Callable[[convokit.model.corpusComponent.CorpusComponent], bool], class2_func: Callable[[convokit.model.corpusComponent.CorpusComponent], bool], y=None, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function FightingWords.<lambda>>)¶

Learn the fighting words from a corpus, with an optional selector that selects for corpus components prior to: grouping the corpus components into class1 / class2.
A warning will be printed if there are components that appear in both class1 and class2, as FightingWords: is typically used for disjoint sets of texts.

Parameters

corpus – target Corpus
class1_func – selector function for identifying corpus components that belong to class 1
class2_func – selector function for identifying corpus components that belong to class 2
selector – a (lambda) function that takes a CorpusComponent and returns True/False; this selects for Corpus components that should be considered in this fitting step

Returns

fitted FightingWords Transformer

get_class(ngram)¶

Get the class that ngram more belongs to.

Parameters: ngram – ngram of interest
Returns: “class1” if the ngram has non-negative z-score, “class2” if ngram has positive z-score, None if ngram not in vocabulary

get_model()¶: Get the FightingWords CountVectorizer model

get_ngram_zscores(class1_name='class1', class2_name='class2')¶

Get a DataFrame of ngrams and their corresponding zscores and class labels.

Parameters

class1_name – readable name for objects in class1
class2_name – readable name for objects in class2

Returns

a DataFrame of ngrams with zscores and classes, indexed by the ngrams

get_ngrams_past_threshold(threshold: float = 1.0) → Tuple[List[str], List[str]]¶

Returns the (ordered) ngrams that have absolute z-scores that exceed a specified threshold, for both classes

Parameters: threshold – by default, threshold z-score = 1
Returns: two ordered lists of ngrams (with descending z-score): first list is for class 1, second list is for class 2

get_top_k_ngrams(top_k: int = 10) → Tuple[List[str], List[str]]¶

Returns the (ordered) top k ngrams for both classes.

Parameters: top_k – by default, k = 10
Returns: two ordered lists of ngrams (with descending z-score): first list is for class 1, second list is for class 2.

get_zscore(ngram)¶

Get z-score of a given ngram.

Parameters: ngram – ngram of interest
Returns: z-score value, None if ngram not in vocabulary

plot_fighting_words(max_label_size=15, class1_name='class1', class2_name='class2', config=None)¶

Plots the distribution of fighting words.

Adapted from Xanda Schofield’s https://gist.github.com/xandaschofield/3c4070b2f232b185ce6a09e47b4e7473

Specifically, the weighted log-odds ratio is plotted against frequency of word within topic.

Only the most significant ngrams will have text labels. The most significant ngrams are specified by the config parameter. By default, the annotation method is to annotate the corpus components with the top 10 fighting words of each class.

Parameters

max_label_size – For the text labels, set the largest possible size for any text label (the rest will be scaled accordingly)
class1_name – descriptive name for class1 corpus component objects
class2_name – descriptive name for class2 corpus component objects
config – a dictionary of configuration parameters for setting which fighting words are significant enough to annotate. The dictionary should hold the keys: annot_method (‘top_k’ or ‘threshold’), and either ‘threshold’ (a float for the min absolute z-score to be considered significant) or ‘top_k’ (an int to set the value of k). By default, config is {‘annot_method’: ‘top_k’, ‘top_k’: 10}.

Returns

None (plot is generated)

set_model(cv)¶: Set the FightingWords CountVectorizer model

summarize(corpus: convokit.model.corpus.Corpus, plot: bool = False, class1_name='class1', class2_name='class2')¶

Returns a DataFrame of ngram with zscores and classes, and optionally plots the fighting words distribution. FightingWords Transformer must be fitted prior to running this.

Parameters

corpus – corpus to learn fighting words from if not already fitted
plot – if True, generates a plot for the fighting words distribution
class1_name – descriptive name for class1 corpus component objects
class2_name – descriptive name for class2 corpus component objects

Returns

DataFrame of ngrams with zscores and classes, indexed by the ngrams (plot is optionally generated)

transform(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function FightingWords.<lambda>>, config=None) → convokit.model.corpus.Corpus¶

Annotates the corpus component objects with the lists of fighting words that the object contains.

The relevant fighting words to use are specified by the config parameter. By default, the annotation method is to annotate the corpus components with the top 10 fighting words of each class.

Lists are stored under the metadata attributes defined when initializing the FightingWords Transformer.

Parameters

corpus – corpus to annotate
selector – a (lambda) function that takes a CorpusComponent and returns True/False; this selects for corpus components that should be annotated with the fighting words
config – a dictionary of configuration parameters for setting which fighting words are significant enough to annotate. The dictionary should hold the keys: annot_method (‘top_k’ or ‘threshold’), and either ‘threshold’ (a float for the min absolute z-score to be considered significant) or ‘top_k’ (an int to set the value of k). By default, config is {‘annot_method’: ‘top_k’, ‘top_k’: 10}.

Returns

annotated corpus