User Convo Diversity¶
Implements linguistic diversity measures as described in this paper.
Example usage: user conversation attributes
-
class
convokit.userConvoDiversity.userConvoDiversity.
UserConvoDiversity
(output_field, cmp_select_fn=<function UserConvoDiversity.<lambda>>, ref_select_fn=<function UserConvoDiversity.<lambda>>, select_fn=<function UserConvoDiversity.<lambda>>, divergence_fn=<function compute_divergences>, user_convo_cols=[], user_cols=[], convo_cols=[], groupby=[], aux_input={}, recompute_tokens=False, verbosity=0)¶ implements methodology to compute the linguistic divergence between a user’s activity in each conversation in a corpus (i.e., the language of their utterances) and a reference language model trained over a different set of conversations/users. See UserConvoDiversityWrapper for more specific implementation which compares language used by individuals within fixed lifestages, and see the implementation of this wrapper for examples of calls to this transformer.
The transformer assumes that a corpus has already been tokenized (via a call to TextParser).
In general, this is appropriate for cases when the reference language model you wish to compare against varies across different user/conversations; in contrast, if you wish to compare many conversations to a _single_ language model (e.g., one trained on past conversations) then this will be inefficient.
- This will produce attributes per user-conversation (i.e., the behavior of a user in a conversation); hence it takes as parameters functions which will subset the data at a user-conversation level. these functions operate on a table which has as columns:
user: user ID
convo_id: conversation ID
convo_idx: n where this conversation is the nth that the user participated in
tokens: all utterances the user contributed to the conversation, concatenated together as a single list of words
any other user-conversation, user, or conversation-level metadata required to filter input and select reference language models per user-conversation (passed in via the user_convo_cols, user_cols and convo_cols parameters)
The table is the output of calling Corpus.get_full_attribute_table; see documentation of that function for further reference.
- The transformer supports two broad types of comparisons:
if groupby=[], then each text will be compared against a single reference text (specified by select_fn)
if groupby=[key] then each text will be compared against a set of reference texts, where each reference text represents a different chunk of the data, aggregated by key (e.g., each text could be compared against the utterances contributed by different users, such that in each iteration of a divergence computation, the text is compared against just the utterances of a single user.)
- Parameters
cmp_select_fn – the subset of user-conversation entries to compute divergences for. function of the form fn(df, aux) where df is a data frame indexed by user-conversation, and aux is any auxiliary parametsr required; returns a boolean mask over the dataframe.
ref_select_fn – the subset of user-conversation entries to compute reference language models over. function of the form fn(df, aux) where df is a data frame indexed by user-conversation, and aux is any auxiliary parameters required; returns a boolean mask over the dataframe.
select_fn – function of the form fn(df,row, aux) where df is a data frame indexed by user-conversation, row is a row of a dataframe indexed by user-conversation, and aux is any auxiliary parameters required; returns a boolean mask over the dataframe.
divergence_fn – function to compute divergence between a user-conversation and reference texts. By default, the transformer will compute unigram perplexity scores, as implemented by the compute_divergences function. However, you can also specify your own divergence function (e.g., some sort of bigram divergence) using the same function signature.
user_convo_cols – additional user-convo attributes used as input to the selector functions
user_cols – additional user-level attributes
convo_cols – additional conversation-level attributes
groupby – whether to aggregate the reference texts according to the specified keys (leave empty to avoid aggregation).
aux_input – a dictionary of auxiliary input to the selector functions and the divergence computation
recompute_tokens – whether to reprocess tokens by aggregating all tokens across different utterances made by a user in a conversation. by default, will cache existing output.
verbosity – frequency of status messages.
-
transform
(corpus)¶ Modify the provided corpus. This is an abstract method that must be implemented by any Transformer subclass
- Parameters
corpus – the Corpus to transform
- Returns
modified version of the input Corpus. Note that unlike the scikit-learn equivalent,
transform()
operates inplace on the Corpus (though for convenience and compatibility with scikit-learn, it also returns the modified Corpus).
-
class
convokit.userConvoDiversity.userConvoDiversity.
UserConvoDiversityWrapper
(output_field='div', lifestage_size=20, max_exp=120, sample_size=200, min_n_utterances=1, n_iters=50, cohort_delta=5184000, verbosity=100)¶ implements methodology for calculating linguistic diversity per life-stage. A wrapper around UserConvoDiversity.
- Outputs the following (user, conversation) attributes:
div__self (within-diversity)
div__other (across-diversity)
div__adj (relative diversity)
Note that np.nan is returned for (user, conversation) pairs with not enough text.
- Parameters
output_field – prefix of attributes to output, defaults to ‘div’
lifestage_size – number of conversations per lifestage
max_exp – highest experience level (i.e., # convos taken) to compute diversity scores for.
sample_size – number of words to sample per convo
min_n_utterances – minimum number of utterances a user contributes per convo for that (user, convo) to get scored
n_iters – number of samples to take for perplexity scoring
cohort_delta – timespan between when users start for them to be counted as part of the same cohort. defaults to 2 months
verbosity – amount of output to print
-
transform
(corpus)¶ Modify the provided corpus. This is an abstract method that must be implemented by any Transformer subclass
- Parameters
corpus – the Corpus to transform
- Returns
modified version of the input Corpus. Note that unlike the scikit-learn equivalent,
transform()
operates inplace on the Corpus (though for convenience and compatibility with scikit-learn, it also returns the modified Corpus).
-
convokit.userConvoDiversity.userConvoDiversity.
compute_divergences
(cmp_tokens, ref_token_list, aux_input={'cmp_sample_size': 200, 'n_iters': 50, 'ref_sample_size': 1000})¶ computes the linguistic divergence between a text cmp_tokens and a set of reference texts ref_token_list. in particular, implements a sampling-based unigram perplexity score (where the sampling is done to ensure that we do not incur length-based effects)
- this function takes in several parameters, through the aux_input argument:
cmp_sample_size: the number of tokens to sample from the analyzed text cmp_tokens. the function returns np.nan if cmp_tokens doesn’t have that many tokens.
ref_sample_size: the nubmer of tokens to sample from each reference text. typically setting this to be longer than cmp_tokens makes sense, especially in the (typical) use case where language models are trained on longer texts. if none of the texts in ref_token_list pass this length threshold then the fucntion returns np.nan.
n_iters: the number of times to compute divergence.
- Parameters
cmp_tokens – the text to compute divergence of (relative to texts in ref_token_list). is a list of tokens.
ref_token_list – the texts on which to train reference language models against which cmp_tokens is compared. each entry in the list is a list of tokens.
aux_input – additional parameters (see above)
- Returns
if texts are of sufficient length, returns a perplexity score, else returns np.nan
-
convokit.userConvoDiversity.userConvoDiversity.
compute_user_convo_divergence
(input_table, cmp_select_fn=<function <lambda>>, ref_select_fn=<function <lambda>>, select_fn=<function <lambda>>, divergence_fn=<function compute_divergences>, groupby=[], aux_input={}, verbosity=0)¶ given a table of user-conversation entries, computes linguistic divergences between each user-conversation entry and reference text. See UserConvoDiversity for further explanation of arguments.
- The function operates on a table which has as columns:
user: user ID
convo_id: conversation ID
convo_idx: n where this conversation is the nth that the user participated in
tokens: all utterances the user contributed to the conversation, concatenated together as a single list of words
any other user-conversation, user, or conversation-level metadata required to filter input and select reference language models per user-conversation.
- Parameters
cmp_select_fn – the subset of user-conversation entries to compute divergences for. function of the form fn(df, aux) where df is a data frame indexed by user-conversation, and aux is any auxiliary parametsr required; returns a boolean mask over the dataframe.
ref_select_fn – the subset of user-conversation entries to compute reference language models over. function of the form fn(df, aux) where df is a data frame indexed by user-conversation, and aux is any auxiliary parameters required; returns a boolean mask over the dataframe.
select_fn – function of the form fn(df,row, aux) where df is a data frame indexed by user-conversation, row is a row of a dataframe indexed by user-conversation, and aux is any auxiliary parameters required; returns a boolean mask over the dataframe.
divergence_fn – function to compute divergence between a user-conversation and reference texts. By default, the transformer will compute unigram perplexity scores, as implemented by the compute_divergences function. However, you can also specify your own divergence function (e.g., some sort of bigram divergence) using the same function signature.
groupby – whether to aggregate the reference texts according to the specified keys (leave empty to avoid aggregation).
aux_input – a dictionary of auxiliary input to the selector functions and the divergence computation
verbosity – frequency of status messages.