Expected Context Framework

Implements the Expected Context Framework as described in this dissertation.

Contains:

Example usage:

class convokit.expected_context_framework.expected_context_model.ExpectedContextModelTransformer(context_field, output_prefix, vect_field, context_vect_field=None, n_svd_dims=25, snip_first_dim=True, n_clusters=8, cluster_on='utts', model=None, random_state=None, cluster_random_state=None)

Transformer that derives representations of terms and utterances in terms of their conversational context, i.e., context-utterances that occur near an utterance, or utterances containing a term. Typically, the conversational context consists of immediate replies (“forwards context”) or predecessors (“backwards context”), though this can be specified by the user via the context_field argument.

The underlying model in the transformer, implemented as the ExpectedContextModel class, is fitted given input training data consisting of pairs of utterances and context-utterances, represented as feature vectors (e.g., tf-idf reweighted term-document matrices), specified via the vect_field and context_vect_field arguments. This model is stored as the ec_model attribute of the transformer, and can be accessed as such. In the fit step, the model, which is based off of latent semantic analysis (LSA), computes the following:

  • representations of terms and utterances in the training data, with respect to the context, along with representations of the context (which are derived in the underlying LSA step). the dimensionality of these representations is specified via the n_svd_dims argument (see also the snip_first_dim and random_state arguments). these can be accessed via various get functions that the transformer provides.

  • a term-level statistic, “range”, measuring the variation in context-utterances associated with a term. One interpretation of this statistic is that it quantifies the “strengths of our expectations” of what reply a term typically gets, or what predecessors it typically follows.

  • a clustering of utterance, term and context representations. The resultant clusters can help interpret the representations the model derives, by highlighting salient groupings that emerge. The number of clusters is specified via the n_clusters argument; the print_clusters function can be called to inspect this output. (see also the cluster_on and cluster_random_state arguments)

An instance of the transformer can be initialized with an instance of another, fitted transformer, via the model argument. This ensures that both transformers derive representations that are comparable, i.e., can be interpreted as being part of the same vector space, with distances between representations that are well-defined. As an example of when this might be useful, we may wish to compare representations derived with respect to expected replies, with representations pertaining to expected predecessors.

The transfomer contains various functions to access term-level characterizations. In the transform step, it outputs vector representations of utterances, stored as <output_prefix>_repr in the corpus. It also outputs various attributes of utterances (names prefixed with <output_prefix>_), stored as metadata fields in each transformed utterance:

  • range: the range of the utterance

  • clustering.cluster: the name of the cluster the utterance has been assigned to

  • clustering.cluster_id_: the numerical ID (0-# of clusters) of the cluster the utterance has been assigned to

  • clustering.cluster_dist: the distance between the utterance representation and the centroid of its cluster

Parameters
  • context_field – the name of an utterance-level attribute containing the ID of the corresponding context-utterance. in particular, to use immediate predecessors as context, set context_field to ‘reply_to’. as another example, to use immediate replies, provided that utterances contain an attribute next_id containing the ID of their reply, set context_field to ‘next_id’.

  • output_prefix – the name of the attributes and vectors to write to in the transform step. the transformer outputs several fields, which will be prefixed with the given string.

  • vect_field – the name of the vectors to use as input vector representation for utterances, as stored in a corpus.

  • context_vect_field – the name of the vectors to use as input vector representations for context-utterances, as stored in a corpus. by default, the transformer will use the same vector representations as utterances, specified in vect_field. if you expect that utterances and context-utterances will differ in some way (e.g., they come from speakers in a conversation who play clearly delineated roles), then it’s a good idea to use a different input representation.

  • n_svd_dims – the dimensionality of the representations to derive (via LSA/SVD).

  • snip_first_dim – whether or not to remove the first dimension of the derived representations. by default this is set to True, since we’ve found that the first dimension tends to reflect term frequency, making the output less informative. Note that if snip_first_dim=True then in practice, we output n_svd_dims-1-dimensional representations.

  • n_clusters – the number of clusters to infer.

  • cluster_on – whether to cluster on utterance or term representations, (corresponding to values ‘utts’ or ‘terms’). By default, we infer clusters based on representations of the utterances from the training data, and then assign term and context-utterance representations to the resultant clusters. In some cases (e.g., if utterances are highly unstructured and lengthy) it might be better to cluster term representations first.

  • model – an existing, fitted ExpectedContextModelTransformer object to initialize with (optional)

  • random_state – the random seed to use in the LSA step (which calls a randomized implementation of SVD)

  • cluster_random_state – the random seed to use to infer clusters.

fit(corpus, y=None, selector=<function ExpectedContextModelTransformer.<lambda>>, context_selector=<function ExpectedContextModelTransformer.<lambda>>)

Fits an ExpectedContextModelTransformer transformer over training data: derives representations of terms, utterances and contexts, range statistics for terms, and a clustering of the resultant representations.

Parameters
  • corpus – Corpus containing training data

  • selector – a boolean function of signature filter(utterance) that determines which utterances will be considered in the fit step. defaults to using all utterances.

  • context_selector – a boolean function of signature filter(utterance) that determines which context-utterances will be considered in the fit step. defaults to using all utterances.

Returns

None

transform(corpus, selector=<function ExpectedContextModelTransformer.<lambda>>)

Computes vector representations, ranges, and cluster assignments for utterances in a corpus.

Parameters
  • corpus – Corpus

  • selector – a boolean function of signature filter(utterance) that determines which utterances to transform. defaults to all utterances.

Returns

the Corpus, with per-utterance representations, ranges and cluster assignments.

transform_utterance(utt)

Computes vector representation, range, and cluster assignment for a single utterance. Note that the utterance must contain the input representation as a metadata field, specified by what was passed into the constructor as the vect_field argument. Will write all of these characterizations (including vectors) to the utterance’s metadata; attribute names are prefixed with the output_prefix constructor argument.

Parameters

utt – Utterance

Returns

the utterance, with per-utterance representation, range and cluster assignments.

compute_utt_ranges(corpus, selector=<function ExpectedContextModelTransformer.<lambda>>)

Computes utterance ranges.

Parameters
  • corpus – Corpus

  • selector – determines which utterances to compute ranges for.

Returns

the Corpus, with per-utterance ranges.

transform_context_utts(corpus, selector=<function ExpectedContextModelTransformer.<lambda>>)

Computes representations of context-utterances, along with cluster assignments.

Parameters
  • corpus – Corpus

  • selector – determines which utterances to compute representations for.

Returns

the Corpus, with per-utterance representations and cluster assignments.

fit_clusters(n_clusters='default', random_state='default')

Infers a clustering of term or utterance representations (specified by the cluster_on argument used to initialize the transformer) on the training data originally used to fit the transformer. Can be called to infer a different number of clusters than what was initially specified.

Parameters
  • n_clusters – number of clusters to infer. defaults to the number of clusters specified when initializing the transformer.

  • random_state – random seed used to infer clusters. defaults to the random seed used to initialize the transformer.

Returns

None

compute_clusters(corpus, selector=<function ExpectedContextModelTransformer.<lambda>>, is_context=False)

Assigns utterances in a corpus, for which expected context representations have already been computed, to inferred clusters.

Parameters
  • corpus – Corpus

  • selector – determines which utterances to compute clusterings for

  • is_context – whether to treat input data as utterances, or context-utterances

Returns

a DataFrame containing cluster assignment information for each utterance.

set_cluster_names(cluster_names)

Assigns names to inferred clusters. May be called after inspecting the output of print_clusters.

Parameters

cluster_names – a list of names, where cluster_names[i] is the name of the cluster with cluster_id_ i.

Returns

None

get_cluster_names()

Returns the names of the inferred clusters.

Returns

list of cluster names where cluster_names[i] is the name of the cluster with cluster_id_ i.

print_clusters(k=10, max_chars=1000, corpus=None)

Prints representative terms, utterances and context-utterances for each inferred type. Can be inspected to help interpret the transformer’s output. By default, will only print out terms and context terms; if the corpus containing the training data is passed in, will output utterances and context-utterances as well.

Parameters
  • k – number of examples to print out.

  • max_chars – maximum number of characters per utterance/context-utterance to print. Can be toggled to control the size of the output.

  • corpus – optional, the corpus that the transformer was trained on. if set, will print example utterances and context-utterances as well as terms.

Returns

None

print_cluster_stats()

Returns a Pandas dataframe containing the % of terms, context terms, and training utterances/context-utterances that have been assigned to each cluster.

Returns

dataframe containing cluster statistics

summarize(k=10, max_chars=1000, corpus=None)

Wrapper function to print inferred clusters and statistics about their sizes.

Parameters
  • k – number of examples to print out.

  • max_chars – maximum number of characters per utterance/context-utterance to print. Can be toggled to control the size of the output.

  • corpus – optional, the corpus that the transformer was trained on. if set, will print example utterances and context-utterances as well as terms.

Returns

None

get_terms()

Gets the names of the terms for which the transformer has computed representations.

Returns

list of terms

get_term_ranges()

Gets the range statistics of terms.

Returns

list of term ranges. order corresponds to the ordering of terms returned via get_terms().

get_term_reprs()

Gets the derived representations of terms.

Returns

numpy array containing term representations. order of rows corresponds to the ordering of terms returned via get_terms.

get_context_terms()

Gets the names of the context terms for which the transformer has computed (LSA) representations.

Returns

list of context terms

get_context_term_reprs()

Gets the derived (LSA) representations of context terms.

Returns

numpy array containing term representations. order of rows corresponds to the ordering of terms returned via get_context_terms.

get_clustering()

Returns a dictionary containing various objects pertaining to the inferred clustering, with fields as follows:

  • km_obj: the fitted KMeans object

  • utts: a Pandas dataframe of cluster assignments for utterances from the training data

  • terms: a dataframe of cluster assignments for terms

  • context_utts: dataframe of cluster assignments for context-utterances from the training data

  • context_terms: dataframe of cluster assignments for terms.

Returns

dictionary containing clustering information

load(dirname)

Loads a model from disk.

Parameters

dirname – directory to read model from

Returns

None

dump(dirname)

Writes a model to disk.

Parameters

dirname – directory to write model to.

Returns

None

class convokit.expected_context_framework.expected_context_model.ExpectedContextModel(n_svd_dims=25, snip_first_dim=True, n_clusters=8, context_U=None, context_V=None, context_s=None, model=None, context_terms=None, cluster_on='utts', random_state=None, cluster_random_state=None)

Model that derives representations of terms and utterances in terms of their conversational context, i.e., context-utterances that occur near an utterance, or utterances containing a term. Typically, the conversational context consists of immediate replies (“forwards context”) or predecessors (“backwards context”), though this can be specified by the user. Can be used in ConvoKit through the ExpectedContextModelTransformer transformer; see documentation of that transformer for further details.

class convokit.expected_context_framework.expected_context_model.ClusterWrapper(n_clusters, cluster_names=None, random_state=None)

Wrapper that performs K-Means clustering. Handles model loading and dumping, formats clustering output as dataframes for convenience, and keeps track of names that an end-user can assign to clusters.

class convokit.expected_context_framework.dual_context_wrapper.DualContextWrapper(context_fields, output_prefixes, vect_field, context_vect_field=None, wrapper_output_prefix='', n_svd_dims=25, snip_first_dim=True, n_clusters=8, cluster_on='utts', random_state=None, cluster_random_state=None)

Transformer that derives and compares characterizations of terms and utterances with respect to two different choices of conversational context. Designed in particular to contrast replies and predecessors, though other choices of context are also possible.

This is a wrapper that encompasses two instances of ExpectedContextModelTransformer, stored at the ec_models attribute. It computes two particular comparative term-level statistics, orientation and shift, stored as the term_orientations and term_shifts attributes. It also computes these statistics at the utterance level in the transform step.

Parameters
  • context_fields – list containing the names of the utterance-level attributes containing the IDs of the context-utterances used by each of the ExpectedContextModelTransformer instances.

  • output_prefixes – list containing the name of the attributes and vectors that each ExpectedContextModelTransformer instances will write to in the transform step.

  • vect_field – the name of the vectors to use as input vector representation for utterances, as stored in a corpus.

  • context_vect_field – the name of the vectors to use as input vector representations for context-utterances, as stored in a corpus. by default, the transformer will use the same vector representations as utterances, specified in vect_field. if you expect that utterances and context-utterances will differ in some way (e.g., they come from speakers in a conversation who play clearly delineated roles), then it’s a good idea to use a different input representation.

  • wrapper_output_prefix – the metadata fields where the utterance-level orientation and shift statistics are stored. By default, these attributes are stored as orn and shift in the metadata; if wrapper_output_prefix is specified, then they are stored as <wrapper_output_prefix>_orn (orientation) and <wrapper_output_prefix>_shift (shift).

  • n_svd_dims – the dimensionality of the representations to derive (via LSA/SVD).

  • snip_first_dim – whether or not to remove the first dimension of the derived representations. by default this is set to True, since we’ve found that the first dimension tends to reflect term frequency, making the output less informative. Note that if snip_first_dim=True then in practice, we output n_svd_dims-1-dimensional representations.

  • n_clusters – the number of clusters to infer.

  • cluster_on – whether to cluster on utterance or term representations, (corresponding to values ‘utts’ or ‘terms’). By default, we infer clusters based on representations of the utterances from the training data, and then assign term and context-utterance representations to the resultant clusters. In some cases (e.g., if utterances are highly unstructured and lengthy) it might be better to cluster term representations first.

  • random_state – the random seed to use in the LSA step (which calls a randomized implementation of SVD)

  • cluster_random_state – the random seed to use to infer clusters.

fit(corpus, y=None, selector=<function DualContextWrapper.<lambda>>, context_selector=<function DualContextWrapper.<lambda>>)

Fits a transformer over training data: fits the two ExpectedContextModelTransformer instances, and computes term-level orientation and shift.

Parameters
  • corpus – Corpus containing training data

  • selector – a boolean function of signature filter(utterance) that determines which utterances will be considered in the fit step. defaults to using all utterances.

  • context_selector – a boolean function of signature filter(utterance) that determines which context-utterances will be considered in the fit step. defaults to using all utterances.

Returns

None

transform(corpus, selector=<function DualContextWrapper.<lambda>>)

Computes vector representations, ranges, and cluster assignments for utterances in a corpus, using the two ExpectedContextModelTransformer instances. Also computes utterance-level orientation and shift.

Parameters
  • corpus – Corpus

  • selector – a boolean function of signature filter(utterance) that determines which utterances to transform. defaults to all utterances.

Returns

the Corpus, with per-utterance attributes.

transform_utterance(utt)

Computes vector representations, ranges, and cluster assignments for an utterance, using the two ExpectedContextModelTransformer instances. Also computes utterance-level orientation and shift. Note that the utterance must contain the input representation as a metadata field, specified by what was passed into the constructor as the vect_field argument. Will write all of these characterizations (including vectors) to the utterance’s metadata.

Parameters

utt – Utterance

Returns

the utterance, with per-utterance attributes.

get_terms()

Gets the names of the terms for which the transformer has computed representations.

Returns

list of terms

get_term_df()

Gets a Pandas dataframe containing term-level statistics computed by the transformer (shift, orientation) and its constituent ExpectedContextModelTransformer instances (ranges).

Returns

dataframe of term-level statistics

summarize(k=10, max_chars=1000, corpus=None)

For each constituent ExpectedContextModelTransformer, prints inferred clusters and statistics about their sizes.

Parameters
  • k – number of examples to print out.

  • max_chars – maximum number of characters per utterance/context-utterance to print. Can be toggled to control the size of the output.

  • corpus – optional, the corpus that the transformer was trained on. if set, will print example utterances and context-utterances as well as terms.

Returns

None

load(dirname, model_dirs=None)

Loads a model from disk.

Parameters
  • dirname – directory to read model from

  • model_dirs – optional list containing the directories (relative to dirname) in which each ExpectedContextModelTransformer is stored. defaults to the output_prefixes argument passed at initialization.

Returns

None

dump(dirname)

Writes a model to disk. Will store each ExpectedContextModelTransformer in a separate directory with names given by the output_prefixes argument passed at initialization.

Parameters

dirname – directory to write model to.

Returns

None

class convokit.expected_context_framework.expected_context_model_pipeline.ExpectedContextModelPipeline(context_field, output_prefix, text_field, context_text_field=None, text_pipe=None, context_text_pipe=None, tfidf_params={}, context_tfidf_params=None, share_tfidf_models=True, min_terms=0, context_min_terms=None, n_svd_dims=25, snip_first_dim=True, n_clusters=8, cluster_on='utts', ec_model=None, random_state=None, cluster_random_state=None)

Wrapper class implementing a pipeline that derives characterizations of terms and utterances in terms of their conversational context. The pipeline handles the following steps:

  • processing input text (via a pipeline supplied by the user in the text_pipe argument);

  • transforming text to input representation (via ColNormedTfidfTransformer);

  • deriving characterizations (via ExpectedContextModelTransformer)

The ColNormedTfidfTransformer components are stored as the tfidf_model and context_tfidf_model attributes of the class; the ExpectedContextModelTransformer is stored as the ec_model attribute.

For further details, see the ColNormedTfidfTransformer and ExpectedContextModelTransformer classes.

Parameters
  • context_field – the name of an utterance-level attribute containing the ID of the corresponding context-utterance. in particular, to use immediate predecessors as context, set context_field to ‘reply_to’. as another example, to use immediate replies, provided that utterances contain an attribute next_id containing the ID of their reply, set context_field to ‘next_id’.

  • output_prefix – the name of the attributes and vectors to write to in the transform step. the transformer outputs several fields, which will be prefixed with the given string.

  • text_field – the name of the utterance-level attribute containing the text to use as input.

  • context_text_field – the name of the utterance-level attribute containing the text to use as input for context-utterances. by default, is equivalent to text_field.

  • text_pipe – a convokitPipeline object used to compute the contents of text_field. defaults to populating the text_field attribute of each utterance utt with utt.text.

  • context_text_pipe – a convokitPipeline object used to compute the contents of context_text_field; by default equivalent to text_pipe

  • tfidf_params – a dictionary specifying parameters to be passed to the ColNormedTfidfTransformer object to compute input representations of utterances.

  • context_tfidf_parms – a dictionary specifying parameters to be passed to the ColNormedTfidfTransformer object to compute input representations of context-utterances. equivalent to tfidf_params by default.

  • share_tfidf_models – whether or not to use the same ColNormedTfidfTransformer for both utterances and context-utterances. defaults to True.

  • min_terms – the minimum number of terms in the vocabulary, derived by ColNormedTfidfTransformer, that an utterance must contain for it to be considered in fitting and transforming the underlying ExpectedContextModelTransformer object. defaults to 0, meaning the transformer will consider all utterances.

  • context_min_terms – minimum number of terms in the vocabulary for a context-utterance to be considered in fitting and transforming the underlying ExpectedContextModelTransformer object. equivalent to min_terms by default.

  • n_svd_dims – the dimensionality of the representations to derive (via LSA/SVD).

  • snip_first_dim – whether or not to remove the first dimension of the derived representations. by default this is set to True, since we’ve found that the first dimension tends to reflect term frequency, making the output less informative. Note that if snip_first_dim=True then in practice, we output n_svd_dims-1-dimensional representations.

  • n_clusters – the number of clusters to infer.

  • cluster_on – whether to cluster on utterance or term representations, (corresponding to values ‘utts’ or ‘terms’). By default, we infer clusters based on representations of the utterances from the training data, and then assign term and context-utterance representations to the resultant clusters. In some cases (e.g., if utterances are highly unstructured and lengthy) it might be better to cluster term representations first.

  • ec_model – an existing, fitted ExpectedContextModelPipeline object to initialize with (optional)

  • random_state – the random seed to use in the LSA step (which calls a randomized implementation of SVD)

  • cluster_random_state – the random seed to use to infer clusters.

fit(corpus, y=None, selector=<function ExpectedContextModelPipeline.<lambda>>, context_selector=<function ExpectedContextModelPipeline.<lambda>>)

Fits an ExpectedContextModelPipeline over training data: derives input and latent representations of terms, utterances and contexts, range statistics for terms, and a clustering of the resultant representations.

Parameters
  • corpus – Corpus containing training data

  • selector – a boolean function of signature filter(utterance) that determines which utterances will be considered in the fit step. defaults to using all utterances, subject to min_terms parameter passed at initialization.

  • context_selector – a boolean function of signature filter(utterance) that determines which context-utterances will be considered in the fit step. defaults to using all utterances, subject to context_min_terms parameter passed at initialization.

Returns

None

transform(corpus, y=None, selector=<function ExpectedContextModelPipeline.<lambda>>)

Computes vector representations, ranges, and cluster assignments for utterances in a corpus.

Parameters
  • corpus – Corpus

  • selector – a boolean function of signature filter(utterance) that determines which utterances to transform.

Returns

the Corpus, with per-utterance representations, ranges and cluster assignments.

transform_utterance(utt)

Computes vector representation, range, and cluster assignment for a single utterance, which can be a ConvoKit Utterance or a string. Will return an Utterance object a nd write all of these characterizations (including vectors) to the utterance’s metadata; attribute names are prefixed with the output_prefix constructor argument.

Parameters

utt – Utterance or string

Returns

the utterance, with per-utterance representation, range and cluster assignments.

summarize(k=10, max_chars=1000, corpus=None)

Prints inferred clusters and statistics about their sizes.

Parameters
  • k – number of examples to print out.

  • max_chars – maximum number of characters per utterance/context-utterance to print. Can be toggled to control the size of the output.

  • corpus – optional, the corpus that the transformer was trained on. if set, will print example utterances and context-utterances as well as terms.

Returns

None

set_cluster_names(names)

Assigns names to inferred clusters. May be called after inspecting the output of print_clusters.

Parameters

cluster_names – a list of names, where cluster_names[i] is the name of the cluster with cluster_id_ i.

Returns

None

get_cluster_names()

Returns the names of the inferred clusters.

Returns

list of cluster names where cluster_names[i] is the name of the cluster with cluster_id_ i.

get_terms()

Gets the names of the terms for which the transformer has computed representations.

Returns

list of terms

load(dirname, model_dirs=None)

Loads a model from disk.

Parameters
  • dirname – directory to read model from

  • model_dirs – optional list containing the directories (relative to dirname) in which each component is stored. the order of the list is as follows: [the ExpectedContextModelTransformer, the utterance ColNormedTfidfTransformer, the context-utterance ColNormedTfidfTransformer (if share_tfidf_models is set to False at initialization)]. defaults to [‘ec_model’, ‘tfidf_model’, ‘context_tfidf_model’].

Returns

None

dump(dirname)

Writes a model to disk.

Parameters

dirname – directory to write model to.

Returns

None

class convokit.expected_context_framework.expected_context_model_pipeline.DualContextPipeline(context_fields, output_prefixes, text_field, context_text_field=None, wrapper_output_prefix='', text_pipe=None, context_text_pipe=None, tfidf_params={}, context_tfidf_params=None, share_tfidf_models=True, min_terms=0, context_min_terms=None, n_svd_dims=25, snip_first_dim=True, n_clusters=8, cluster_on='utts', random_state=None, cluster_random_state=None)

Wrapper class implementing a pipeline that derives characterizations of terms and utterances in terms of two choices of conversational context. The pipeline handles the following steps:

  • processing input text (via a pipeline supplied by the user in the text_pipe argument);

  • transforming text to input representation (via ColNormedTfidfTransformer);

  • deriving characterizations (via DualContextWrapper)

The ColNormedTfidfTransformer components are stored as the tfidf_model and context_tfidf_model attributes of the class; the DualContextWrapper is stored as the dualmodel attribute.

For further details, see the ColNormedTfidfTransformer and DualContextWrapper classes.

Parameters
  • context_field – the name of an utterance-level attribute containing the ID of the corresponding context-utterance. in particular, to use immediate predecessors as context, set context_field to ‘reply_to’. as another example, to use immediate replies, provided that utterances contain an attribute next_id containing the ID of their reply, set context_field to ‘next_id’.

  • output_prefixes – list containing the name of the attributes and vectors that the DualContextWrapper component will write to in the transform step.

  • text_field – the name of the utterance-level attribute containing the text to use as input.

  • context_text_field – the name of the utterance-level attribute containing the text to use as input for context-utterances. by default, is equivalent to text_field.

  • wrapper_output_prefix – the metadata fields where the utterance-level orientation and shift statistics are stored. By default, these attributes are stored as orn and shift in the metadata; if wrapper_output_prefix is specified, then they are stored as <wrapper_output_prefix>_orn (orientation) and <wrapper_output_prefix>_shift (shift).

  • text_pipe – a convokitPipeline object used to compute the contents of text_field. defaults to populating the text_field attribute of each utterance utt with utt.text.

  • context_text_pipe – a convokitPipeline object used to compute the contents of context_text_field; by default equivalent to text_pipe

  • tfidf_params – a dictionary specifying parameters to be passed to the ColNormedTfidfTransformer object to compute input representations of utterances.

  • context_tfidf_parms – a dictionary specifying parameters to be passed to the ColNormedTfidfTransformer object to compute input representations of context-utterances. equivalent to tfidf_params by default.

  • share_tfidf_models – whether or not to use the same ColNormedTfidfTransformer for both utterances and context-utterances. defaults to True.

  • min_terms – the minimum number of terms in the vocabulary, derived by ColNormedTfidfTransformer, that an utterance must contain for it to be considered in fitting and transforming the underlying ExpectedContextModelTransformer object. defaults to 0, meaning the transformer will consider all utterances.

  • context_min_terms – minimum number of terms in the vocabulary for a context-utterance to be considered in fitting and transforming the underlying ExpectedContextModelTransformer object. equivalent to min_terms by default.

  • n_svd_dims – the dimensionality of the representations to derive (via LSA/SVD).

  • snip_first_dim – whether or not to remove the first dimension of the derived representations. by default this is set to True, since we’ve found that the first dimension tends to reflect term frequency, making the output less informative. Note that if snip_first_dim=True then in practice, we output n_svd_dims-1-dimensional representations.

  • n_clusters – the number of clusters to infer.

  • cluster_on – whether to cluster on utterance or term representations, (corresponding to values ‘utts’ or ‘terms’). By default, we infer clusters based on representations of the utterances from the training data, and then assign term and context-utterance representations to the resultant clusters. In some cases (e.g., if utterances are highly unstructured and lengthy) it might be better to cluster term representations first.

  • random_state – the random seed to use in the LSA step (which calls a randomized implementation of SVD)

  • cluster_random_state – the random seed to use to infer clusters.

fit(corpus, y=None, selector=<function DualContextPipeline.<lambda>>, context_selector=<function DualContextPipeline.<lambda>>)

Fits the model over training data.

Parameters
  • corpus – Corpus containing training data

  • selector – a boolean function of signature filter(utterance) that determines which utterances will be considered in the fit step. defaults to using all utterances, subject to min_terms parameter passed at initialization.

  • context_selector – a boolean function of signature filter(utterance) that determines which context-utterances will be considered in the fit step. defaults to using all utterances, subject to context_min_terms parameter passed at initialization.

Returns

None

transform(corpus, y=None, selector=<function DualContextPipeline.<lambda>>)

Computes vector representations, and statistics for utterances in a corpus, using the DualContextWrapper component.

Parameters
  • corpus – Corpus

  • selector – a boolean function of signature filter(utterance) that determines which utterances to transform. defaults to all utterances.

Returns

the Corpus, with per-utterance attributes.

transform_utterance(utt)

Computes representations and statistics for a single utterance, which can be a ConvoKit Utterance or a string. Will return an Utterance object a nd write all of these characterizations (including vectors) to the utterance’s metadata; attribute names are prefixed with the output_prefix constructor argument.

Parameters

utt – Utterance or string

Returns

the utterance, with per-utterance representation, range and cluster assignments.

summarize(k=10, max_chars=1000, corpus=None)

Prints inferred clusters and statistics about their sizes, for each component in the underlying DualContextWrapper.

Parameters
  • k – number of examples to print out.

  • max_chars – maximum number of characters per utterance/context-utterance to print. Can be toggled to control the size of the output.

  • corpus – optional, the corpus that the transformer was trained on. if set, will print example utterances and context-utterances as well as terms.

Returns

None

get_terms()

Gets the names of the terms for which the transformer has computed representations.

Returns

list of terms

get_term_df()

Gets a Pandas dataframe containing term-level statistics computed by the transformer (shift, orientation, ranges)

Returns

dataframe of term-level statistics

load(dirname, model_dirs=None)

Loads a model from disk.

Parameters
  • dirname – directory to read model from

  • model_dirs – optional list containing the directories (relative to dirname) in which each component is stored. the order of the list is as follows: [the DualContextWrapper components, the utterance ColNormedTfidfTransformer, the context-utterance ColNormedTfidfTransformer (if share_tfidf_models is set to False at initialization)]. defaults to [output_prefixes[0], output_prefixes[1], ‘tfidf_model’, ‘context_tfidf_model’] where output_prefixes is passed at initialization.

Returns

None

dump(dirname)

Writes a model to disk.

Parameters

dirname – directory to write model to.

Returns

None