Fighting Words¶
Based on Monroe et al.’s Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict.
Implementation adapted from Jack Hessel’s implementation.
Example usage: finding the fighting words of r/atheism and r/Christianity.
-
class
convokit.fighting_words.fightingWords.
FightingWords
(obj_type='utterance', text_func=None, cv=None, ngram_range=None, prior=0.1, class1_attribute_name='fighting_words_class1', class2_attribute_name='fighting_words_class2')¶ Based on Monroe et al.’s “Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict”
Implementation adapted from Jack Hessel’s https://github.com/jmhessel/FightingWords
Identifies the fighting words of two groups of corpus components (e.g. two groups of utterances), which we define as the groups: ‘class1’ and ‘class2’
Runs on the Corpus’s Speakers, Utterances, or Conversations (as specified by obj_type). By default, the text used for the different object types:
For utterances, this would be the utterance text.
For conversations, this would be joined texts of all the utterances in the conversation
For speakers, this would be the joined texts of all the utterances by the speaker
Other custom text configurations can be configured using the text_func argument
- Parameters
obj_type – ‘utterance’, ‘conversation’, or ‘speaker’
text_func – function for getting text from the Corpus component object. By default, this is configured based on the obj_type.
cv – optional CountVectorizer. default: an sklearn CV with min_df=10, max_df=.5, and ngram_range=(1,3) with max 15000 features
ngram_range – range of ngrams to use if using default cv
prior – either a float describing a uniform prior, or a vector describing a prior over vocabulary items. If using a predefined vocabulary, make sure to specify that when you make your CountVectorizer object.
class1_attribute_name – metadata attribute name to store class1 ngrams under during the transform() step. Default is ‘fighting_words_class1’.
class2_attribute_name – metadata attribute name to store class2 ngrams under during the transform() step. Default is ‘fighting_words_class2’.
- Variables
cv – modifiable countvectorizer
-
static
clean_text
(in_string)¶ Cleans the text using Python clean-text package: fixes unicode, transliterates all characters to closest ASCII, lowercases text, removes line breaks and punctuation, replaces (urls, emails, phone numbers, numbers, currency) with corresponding <TOKEN>
- Parameters
in_string – input string
- Returns
cleaned string
-
fit
(corpus: convokit.model.corpus.Corpus, class1_func: Callable[[convokit.model.corpusComponent.CorpusComponent], bool], class2_func: Callable[[convokit.model.corpusComponent.CorpusComponent], bool], y=None, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function FightingWords.<lambda>>)¶ - Learn the fighting words from a corpus, with an optional selector that selects for corpus components prior to
grouping the corpus components into class1 / class2.
- A warning will be printed if there are components that appear in both class1 and class2, as FightingWords
is typically used for disjoint sets of texts.
- Parameters
corpus – target Corpus
class1_func – selector function for identifying corpus components that belong to class 1
class2_func – selector function for identifying corpus components that belong to class 2
selector – a (lambda) function that takes a CorpusComponent and returns True/False; this selects for Corpus components that should be considered in this fitting step
- Returns
fitted FightingWords Transformer
-
get_class
(ngram)¶ Get the class that ngram more belongs to.
- Parameters
ngram – ngram of interest
- Returns
“class1” if the ngram has non-negative z-score, “class2” if ngram has positive z-score, None if ngram not in vocabulary
-
get_model
()¶ Get the FightingWords CountVectorizer model
-
get_ngram_zscores
(class1_name='class1', class2_name='class2')¶ Get a DataFrame of ngrams and their corresponding zscores and class labels.
- Parameters
class1_name – readable name for objects in class1
class2_name – readable name for objects in class2
- Returns
a DataFrame of ngrams with zscores and classes, indexed by the ngrams
-
get_ngrams_past_threshold
(threshold: float = 1.0) → Tuple[List[str], List[str]]¶ Returns the (ordered) ngrams that have absolute z-scores that exceed a specified threshold, for both classes
- Parameters
threshold – by default, threshold z-score = 1
- Returns
two ordered lists of ngrams (with descending z-score): first list is for class 1, second list is for class 2
-
get_top_k_ngrams
(top_k: int = 10) → Tuple[List[str], List[str]]¶ Returns the (ordered) top k ngrams for both classes.
- Parameters
top_k – by default, k = 10
- Returns
two ordered lists of ngrams (with descending z-score): first list is for class 1, second list is for class 2.
-
get_zscore
(ngram)¶ Get z-score of a given ngram.
- Parameters
ngram – ngram of interest
- Returns
z-score value, None if ngram not in vocabulary
-
plot_fighting_words
(max_label_size=15, class1_name='class1', class2_name='class2', config=None)¶ Plots the distribution of fighting words.
Adapted from Xanda Schofield’s https://gist.github.com/xandaschofield/3c4070b2f232b185ce6a09e47b4e7473
Specifically, the weighted log-odds ratio is plotted against frequency of word within topic.
Only the most significant ngrams will have text labels. The most significant ngrams are specified by the config parameter. By default, the annotation method is to annotate the corpus components with the top 10 fighting words of each class.
- Parameters
max_label_size – For the text labels, set the largest possible size for any text label (the rest will be scaled accordingly)
class1_name – descriptive name for class1 corpus component objects
class2_name – descriptive name for class2 corpus component objects
config – a dictionary of configuration parameters for setting which fighting words are significant enough to annotate. The dictionary should hold the keys: annot_method (‘top_k’ or ‘threshold’), and either ‘threshold’ (a float for the min absolute z-score to be considered significant) or ‘top_k’ (an int to set the value of k). By default, config is {‘annot_method’: ‘top_k’, ‘top_k’: 10}.
- Returns
None (plot is generated)
-
set_model
(cv)¶ Set the FightingWords CountVectorizer model
-
summarize
(corpus: convokit.model.corpus.Corpus, plot: bool = False, class1_name='class1', class2_name='class2')¶ Returns a DataFrame of ngram with zscores and classes, and optionally plots the fighting words distribution. FightingWords Transformer must be fitted prior to running this.
- Parameters
corpus – corpus to learn fighting words from if not already fitted
plot – if True, generates a plot for the fighting words distribution
class1_name – descriptive name for class1 corpus component objects
class2_name – descriptive name for class2 corpus component objects
- Returns
DataFrame of ngrams with zscores and classes, indexed by the ngrams (plot is optionally generated)
-
transform
(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function FightingWords.<lambda>>, config=None) → convokit.model.corpus.Corpus¶ Annotates the corpus component objects with the lists of fighting words that the object contains.
The relevant fighting words to use are specified by the config parameter. By default, the annotation method is to annotate the corpus components with the top 10 fighting words of each class.
Lists are stored under the metadata attributes defined when initializing the FightingWords Transformer.
- Parameters
corpus – corpus to annotate
selector – a (lambda) function that takes a CorpusComponent and returns True/False; this selects for corpus components that should be annotated with the fighting words
config – a dictionary of configuration parameters for setting which fighting words are significant enough to annotate. The dictionary should hold the keys: annot_method (‘top_k’ or ‘threshold’), and either ‘threshold’ (a float for the min absolute z-score to be considered significant) or ‘top_k’ (an int to set the value of k). By default, config is {‘annot_method’: ‘top_k’, ‘top_k’: 10}.
- Returns
annotated corpus