User Convo Diversity

Implements linguistic diversity measures as described in this paper.

Example usage: user conversation attributes

class convokit.userConvoDiversity.userConvoDiversity.UserConvoDiversity(output_field, cmp_select_fn=<function UserConvoDiversity.<lambda>>, ref_select_fn=<function UserConvoDiversity.<lambda>>, select_fn=<function UserConvoDiversity.<lambda>>, divergence_fn=<function compute_divergences>, user_convo_cols=[], user_cols=[], convo_cols=[], groupby=[], aux_input={}, recompute_tokens=False, verbosity=0)

implements methodology to compute the linguistic divergence between a user’s activity in each conversation in a corpus (i.e., the language of their utterances) and a reference language model trained over a different set of conversations/users. See UserConvoDiversityWrapper for more specific implementation which compares language used by individuals within fixed lifestages, and see the implementation of this wrapper for examples of calls to this transformer.

The transformer assumes that a corpus has already been tokenized (via a call to TextParser).

In general, this is appropriate for cases when the reference language model you wish to compare against varies across different user/conversations; in contrast, if you wish to compare many conversations to a _single_ language model (e.g., one trained on past conversations) then this will be inefficient.

This will produce attributes per user-conversation (i.e., the behavior of a user in a conversation); hence it takes as parameters functions which will subset the data at a user-conversation level. these functions operate on a table which has as columns:
  • user: user ID

  • convo_id: conversation ID

  • convo_idx: n where this conversation is the nth that the user participated in

  • tokens: all utterances the user contributed to the conversation, concatenated together as a single list of words

  • any other user-conversation, user, or conversation-level metadata required to filter input and select reference language models per user-conversation (passed in via the user_convo_cols, user_cols and convo_cols parameters)

The table is the output of calling Corpus.get_full_attribute_table; see documentation of that function for further reference.

The transformer supports two broad types of comparisons:
  • if groupby=[], then each text will be compared against a single reference text (specified by select_fn)

  • if groupby=[key] then each text will be compared against a set of reference texts, where each reference text represents a different chunk of the data, aggregated by key (e.g., each text could be compared against the utterances contributed by different users, such that in each iteration of a divergence computation, the text is compared against just the utterances of a single user.)

Parameters
  • cmp_select_fn – the subset of user-conversation entries to compute divergences for. function of the form fn(df, aux) where df is a data frame indexed by user-conversation, and aux is any auxiliary parametsr required; returns a boolean mask over the dataframe.

  • ref_select_fn – the subset of user-conversation entries to compute reference language models over. function of the form fn(df, aux) where df is a data frame indexed by user-conversation, and aux is any auxiliary parameters required; returns a boolean mask over the dataframe.

  • select_fn – function of the form fn(df,row, aux) where df is a data frame indexed by user-conversation, row is a row of a dataframe indexed by user-conversation, and aux is any auxiliary parameters required; returns a boolean mask over the dataframe.

  • divergence_fn – function to compute divergence between a user-conversation and reference texts. By default, the transformer will compute unigram perplexity scores, as implemented by the compute_divergences function. However, you can also specify your own divergence function (e.g., some sort of bigram divergence) using the same function signature.

  • user_convo_cols – additional user-convo attributes used as input to the selector functions

  • user_cols – additional user-level attributes

  • convo_cols – additional conversation-level attributes

  • groupby – whether to aggregate the reference texts according to the specified keys (leave empty to avoid aggregation).

  • aux_input – a dictionary of auxiliary input to the selector functions and the divergence computation

  • recompute_tokens – whether to reprocess tokens by aggregating all tokens across different utterances made by a user in a conversation. by default, will cache existing output.

  • verbosity – frequency of status messages.

transform(corpus)

Modify the provided corpus. This is an abstract method that must be implemented by any Transformer subclass

Parameters

corpus – the Corpus to transform

Returns

modified version of the input Corpus. Note that unlike the scikit-learn equivalent, transform() operates inplace on the Corpus (though for convenience and compatibility with scikit-learn, it also returns the modified Corpus).

class convokit.userConvoDiversity.userConvoDiversity.UserConvoDiversityWrapper(output_field='div', lifestage_size=20, max_exp=120, sample_size=200, min_n_utterances=1, n_iters=50, cohort_delta=5184000, verbosity=100)

implements methodology for calculating linguistic diversity per life-stage. A wrapper around UserConvoDiversity.

Outputs the following (user, conversation) attributes:
  • div__self (within-diversity)

  • div__other (across-diversity)

  • div__adj (relative diversity)

Note that np.nan is returned for (user, conversation) pairs with not enough text.

Parameters
  • output_field – prefix of attributes to output, defaults to ‘div’

  • lifestage_size – number of conversations per lifestage

  • max_exp – highest experience level (i.e., # convos taken) to compute diversity scores for.

  • sample_size – number of words to sample per convo

  • min_n_utterances – minimum number of utterances a user contributes per convo for that (user, convo) to get scored

  • n_iters – number of samples to take for perplexity scoring

  • cohort_delta – timespan between when users start for them to be counted as part of the same cohort. defaults to 2 months

  • verbosity – amount of output to print

transform(corpus)

Modify the provided corpus. This is an abstract method that must be implemented by any Transformer subclass

Parameters

corpus – the Corpus to transform

Returns

modified version of the input Corpus. Note that unlike the scikit-learn equivalent, transform() operates inplace on the Corpus (though for convenience and compatibility with scikit-learn, it also returns the modified Corpus).

convokit.userConvoDiversity.userConvoDiversity.compute_divergences(cmp_tokens, ref_token_list, aux_input={'cmp_sample_size': 200, 'n_iters': 50, 'ref_sample_size': 1000})

computes the linguistic divergence between a text cmp_tokens and a set of reference texts ref_token_list. in particular, implements a sampling-based unigram perplexity score (where the sampling is done to ensure that we do not incur length-based effects)

this function takes in several parameters, through the aux_input argument:
  • cmp_sample_size: the number of tokens to sample from the analyzed text cmp_tokens. the function returns np.nan if cmp_tokens doesn’t have that many tokens.

  • ref_sample_size: the nubmer of tokens to sample from each reference text. typically setting this to be longer than cmp_tokens makes sense, especially in the (typical) use case where language models are trained on longer texts. if none of the texts in ref_token_list pass this length threshold then the fucntion returns np.nan.

  • n_iters: the number of times to compute divergence.

Parameters
  • cmp_tokens – the text to compute divergence of (relative to texts in ref_token_list). is a list of tokens.

  • ref_token_list – the texts on which to train reference language models against which cmp_tokens is compared. each entry in the list is a list of tokens.

  • aux_input – additional parameters (see above)

Returns

if texts are of sufficient length, returns a perplexity score, else returns np.nan

convokit.userConvoDiversity.userConvoDiversity.compute_user_convo_divergence(input_table, cmp_select_fn=<function <lambda>>, ref_select_fn=<function <lambda>>, select_fn=<function <lambda>>, divergence_fn=<function compute_divergences>, groupby=[], aux_input={}, verbosity=0)

given a table of user-conversation entries, computes linguistic divergences between each user-conversation entry and reference text. See UserConvoDiversity for further explanation of arguments.

The function operates on a table which has as columns:
  • user: user ID

  • convo_id: conversation ID

  • convo_idx: n where this conversation is the nth that the user participated in

  • tokens: all utterances the user contributed to the conversation, concatenated together as a single list of words

  • any other user-conversation, user, or conversation-level metadata required to filter input and select reference language models per user-conversation.

Parameters
  • cmp_select_fn – the subset of user-conversation entries to compute divergences for. function of the form fn(df, aux) where df is a data frame indexed by user-conversation, and aux is any auxiliary parametsr required; returns a boolean mask over the dataframe.

  • ref_select_fn – the subset of user-conversation entries to compute reference language models over. function of the form fn(df, aux) where df is a data frame indexed by user-conversation, and aux is any auxiliary parameters required; returns a boolean mask over the dataframe.

  • select_fn – function of the form fn(df,row, aux) where df is a data frame indexed by user-conversation, row is a row of a dataframe indexed by user-conversation, and aux is any auxiliary parameters required; returns a boolean mask over the dataframe.

  • divergence_fn – function to compute divergence between a user-conversation and reference texts. By default, the transformer will compute unigram perplexity scores, as implemented by the compute_divergences function. However, you can also specify your own divergence function (e.g., some sort of bigram divergence) using the same function signature.

  • groupby – whether to aggregate the reference texts according to the specified keys (leave empty to avoid aggregation).

  • aux_input – a dictionary of auxiliary input to the selector functions and the divergence computation

  • verbosity – frequency of status messages.