Speaker Convo Diversity

Implements linguistic diversity measures as described in this paper.

Example usage: speaker conversation attributes

class convokit.speakerConvoDiversity.speakerConvoDiversity.SpeakerConvoDiversity(output_field, cmp_select_fn=<function SpeakerConvoDiversity.<lambda>>, ref_select_fn=<function SpeakerConvoDiversity.<lambda>>, select_fn=<function SpeakerConvoDiversity.<lambda>>, divergence_fn=<function compute_divergences>, speaker_convo_cols=[], speaker_cols=[], convo_cols=[], groupby=[], aux_input={}, recompute_tokens=False, verbosity=0)

implements methodology to compute the linguistic divergence between a speaker’s activity in each conversation in a corpus (i.e., the language of their utterances) and a reference language model trained over a different set of conversations/speakers. See SpeakerConvoDiversityWrapper for more specific implementation which compares language used by individuals within fixed lifestages, and see the implementation of this wrapper for examples of calls to this transformer.

The transformer assumes that a corpus has already been tokenized (via a call to TextParser).

In general, this is appropriate for cases when the reference language model you wish to compare against varies across different speaker/conversations; in contrast, if you wish to compare many conversations to a _single_ language model (e.g., one trained on past conversations) then this will be inefficient.

This will produce attributes per speaker-conversation (i.e., the behavior of a speaker in a conversation); hence it takes as parameters functions which will subset the data at a speaker-conversation level. these functions operate on a table which has as columns:
  • speaker: speaker ID

  • convo_id: conversation ID

  • convo_idx: n where this conversation is the nth that the speaker participated in

  • tokens: all utterances the speaker contributed to the conversation, concatenated together as a single list of words

  • any other speaker-conversation, speaker, or conversation-level metadata required to filter input and select reference language models per speaker-conversation (passed in via the speaker_convo_cols, speaker_cols and convo_cols parameters)

The table is the output of calling Corpus.get_full_attribute_table; see documentation of that function for further reference.

The transformer supports two broad types of comparisons:
  • if groupby=[], then each text will be compared against a single reference text (specified by select_fn)

  • if groupby=[key] then each text will be compared against a set of reference texts, where each reference text represents a different chunk of the data, aggregated by key (e.g., each text could be compared against the utterances contributed by different speakers, such that in each iteration of a divergence computation, the text is compared against just the utterances of a single speaker.)

Parameters
  • cmp_select_fn – the subset of speaker-conversation entries to compute divergences for. function of the form fn(df, aux) where df is a data frame indexed by speaker-conversation, and aux is any auxiliary parametsr required; returns a boolean mask over the dataframe.

  • ref_select_fn – the subset of speaker-conversation entries to compute reference language models over. function of the form fn(df, aux) where df is a data frame indexed by speaker-conversation, and aux is any auxiliary parameters required; returns a boolean mask over the dataframe.

  • select_fn – function of the form fn(df,row, aux) where df is a data frame indexed by speaker-conversation, row is a row of a dataframe indexed by speaker-conversation, and aux is any auxiliary parameters required; returns a boolean mask over the dataframe.

  • divergence_fn – function to compute divergence between a speaker-conversation and reference texts. By default, the transformer will compute unigram perplexity scores, as implemented by the compute_divergences function. However, you can also specify your own divergence function (e.g., some sort of bigram divergence) using the same function signature.

  • speaker_convo_cols – additional speaker-convo attributes used as input to the selector functions

  • speaker_cols – additional speaker-level attributes

  • convo_cols – additional conversation-level attributes

  • groupby – whether to aggregate the reference texts according to the specified keys (leave empty to avoid aggregation).

  • aux_input – a dictionary of auxiliary input to the selector functions and the divergence computation

  • recompute_tokens – whether to reprocess tokens by aggregating all tokens across different utterances made by a speaker in a conversation. by default, will cache existing output.

  • verbosity – frequency of status messages.

transform(corpus)

Modify the provided corpus. This is an abstract method that must be implemented by any Transformer subclass

Parameters

corpus – the Corpus to transform

Returns

modified version of the input Corpus. Note that unlike the scikit-learn equivalent, transform() operates inplace on the Corpus (though for convenience and compatibility with scikit-learn, it also returns the modified Corpus).

class convokit.speakerConvoDiversity.speakerConvoDiversity.SpeakerConvoDiversityWrapper(output_field='div', lifestage_size=20, max_exp=120, sample_size=200, min_n_utterances=1, n_iters=50, cohort_delta=5184000, verbosity=100)

Implements methodology for calculating linguistic diversity per life-stage. A wrapper around SpeakerConvoDiversity.

Outputs the following (speaker, conversation) attributes:
  • div__self (within-diversity)

  • div__other (across-diversity)

  • div__adj (relative diversity)

Note that np.nan is returned for (speaker, conversation) pairs with not enough text.

Parameters
  • output_field – prefix of attributes to output, defaults to ‘div’

  • lifestage_size – number of conversations per lifestage

  • max_exp – highest experience level (i.e., # convos taken) to compute diversity scores for.

  • sample_size – number of words to sample per convo

  • min_n_utterances – minimum number of utterances a speaker contributes per convo for that (speaker, convo) to get scored

  • n_iters – number of samples to take for perplexity scoring

  • cohort_delta – timespan between when speakers start for them to be counted as part of the same cohort. defaults to 2 months

  • verbosity – amount of output to print

transform(corpus)

Modify the provided corpus. This is an abstract method that must be implemented by any Transformer subclass

Parameters

corpus – the Corpus to transform

Returns

modified version of the input Corpus. Note that unlike the scikit-learn equivalent, transform() operates inplace on the Corpus (though for convenience and compatibility with scikit-learn, it also returns the modified Corpus).

convokit.speakerConvoDiversity.speakerConvoDiversity.compute_divergences(cmp_tokens, ref_token_list, aux_input={'cmp_sample_size': 200, 'n_iters': 50, 'ref_sample_size': 1000})

computes the linguistic divergence between a text cmp_tokens and a set of reference texts ref_token_list. in particular, implements a sampling-based unigram perplexity score (where the sampling is done to ensure that we do not incur length-based effects)

this function takes in several parameters, through the aux_input argument:
  • cmp_sample_size: the number of tokens to sample from the analyzed text cmp_tokens. the function returns np.nan if cmp_tokens doesn’t have that many tokens.

  • ref_sample_size: the nubmer of tokens to sample from each reference text. typically setting this to be longer than cmp_tokens makes sense, especially in the (typical) use case where language models are trained on longer texts. if none of the texts in ref_token_list pass this length threshold then the fucntion returns np.nan.

  • n_iters: the number of times to compute divergence.

Parameters
  • cmp_tokens – the text to compute divergence of (relative to texts in ref_token_list). is a list of tokens.

  • ref_token_list – the texts on which to train reference language models against which cmp_tokens is compared. each entry in the list is a list of tokens.

  • aux_input – additional parameters (see above)

Returns

if texts are of sufficient length, returns a perplexity score, else returns np.nan

convokit.speakerConvoDiversity.speakerConvoDiversity.compute_speaker_convo_divergence(input_table, cmp_select_fn=<function <lambda>>, ref_select_fn=<function <lambda>>, select_fn=<function <lambda>>, divergence_fn=<function compute_divergences>, groupby=[], aux_input={}, verbosity=0)

given a table of speaker-conversation entries, computes linguistic divergences between each speaker-conversation entry and reference text. See SpeakerConvoDiversity for further explanation of arguments.

The function operates on a table which has as columns:
  • speaker: speaker ID

  • convo_id: conversation ID

  • convo_idx: n where this conversation is the nth that the speaker participated in

  • tokens: all utterances the speaker contributed to the conversation, concatenated together as a single list of words

  • any other speaker-conversation, speaker, or conversation-level metadata required to filter input and select reference language models per speaker-conversation.

Parameters
  • cmp_select_fn – the subset of speaker-conversation entries to compute divergences for. function of the form fn(df, aux) where df is a data frame indexed by speaker-conversation, and aux is any auxiliary parametsr required; returns a boolean mask over the dataframe.

  • ref_select_fn – the subset of speaker-conversation entries to compute reference language models over. function of the form fn(df, aux) where df is a data frame indexed by speaker-conversation, and aux is any auxiliary parameters required; returns a boolean mask over the dataframe.

  • select_fn – function of the form fn(df,row, aux) where df is a data frame indexed by speaker-conversation, row is a row of a dataframe indexed by speaker-conversation, and aux is any auxiliary parameters required; returns a boolean mask over the dataframe.

  • divergence_fn – function to compute divergence between a speaker-conversation and reference texts. By default, the transformer will compute unigram perplexity scores, as implemented by the compute_divergences function. However, you can also specify your own divergence function (e.g., some sort of bigram divergence) using the same function signature.

  • groupby – whether to aggregate the reference texts according to the specified keys (leave empty to avoid aggregation).

  • aux_input – a dictionary of auxiliary input to the selector functions and the divergence computation

  • verbosity – frequency of status messages.