Expected Context Framework¶
Implements the Expected Context Framework as described in this dissertation.
Contains:
Wrapper DualContextWrapper that handles two choices of conversational context
Wrapper pipelines ExpectedContextModelPipeline and DualContextPipeline
Example usage:
deriving question types and other characterizations in British parliamentary question periods
exploration of Switchboard dialog acts corpus using ExpectedContextModelTransformer, and using DualContextWrapper
computing the orientation of justice utterances in the US Supreme Court
-
class
convokit.expected_context_framework.expected_context_model.
ExpectedContextModelTransformer
(context_field, output_prefix, vect_field, context_vect_field=None, n_svd_dims=25, snip_first_dim=True, n_clusters=8, cluster_on='utts', model=None, random_state=None, cluster_random_state=None)¶ Transformer that derives representations of terms and utterances in terms of their conversational context, i.e., context-utterances that occur near an utterance, or utterances containing a term. Typically, the conversational context consists of immediate replies (“forwards context”) or predecessors (“backwards context”), though this can be specified by the user via the context_field argument.
The underlying model in the transformer, implemented as the ExpectedContextModel class, is fitted given input training data consisting of pairs of utterances and context-utterances, represented as feature vectors (e.g., tf-idf reweighted term-document matrices), specified via the vect_field and context_vect_field arguments. This model is stored as the ec_model attribute of the transformer, and can be accessed as such. In the fit step, the model, which is based off of latent semantic analysis (LSA), computes the following:
representations of terms and utterances in the training data, with respect to the context, along with representations of the context (which are derived in the underlying LSA step). the dimensionality of these representations is specified via the n_svd_dims argument (see also the snip_first_dim and random_state arguments). these can be accessed via various get functions that the transformer provides.
a term-level statistic, “range”, measuring the variation in context-utterances associated with a term. One interpretation of this statistic is that it quantifies the “strengths of our expectations” of what reply a term typically gets, or what predecessors it typically follows.
a clustering of utterance, term and context representations. The resultant clusters can help interpret the representations the model derives, by highlighting salient groupings that emerge. The number of clusters is specified via the n_clusters argument; the print_clusters function can be called to inspect this output. (see also the cluster_on and cluster_random_state arguments)
An instance of the transformer can be initialized with an instance of another, fitted transformer, via the model argument. This ensures that both transformers derive representations that are comparable, i.e., can be interpreted as being part of the same vector space, with distances between representations that are well-defined. As an example of when this might be useful, we may wish to compare representations derived with respect to expected replies, with representations pertaining to expected predecessors.
The transfomer contains various functions to access term-level characterizations. In the transform step, it outputs vector representations of utterances, stored as <output_prefix>_repr in the corpus. It also outputs various attributes of utterances (names prefixed with <output_prefix>_), stored as metadata fields in each transformed utterance:
range: the range of the utterance
clustering.cluster: the name of the cluster the utterance has been assigned to
clustering.cluster_id_: the numerical ID (0-# of clusters) of the cluster the utterance has been assigned to
clustering.cluster_dist: the distance between the utterance representation and the centroid of its cluster
- Parameters
context_field – the name of an utterance-level attribute containing the ID of the corresponding context-utterance. in particular, to use immediate predecessors as context, set context_field to ‘reply_to’. as another example, to use immediate replies, provided that utterances contain an attribute next_id containing the ID of their reply, set context_field to ‘next_id’.
output_prefix – the name of the attributes and vectors to write to in the transform step. the transformer outputs several fields, which will be prefixed with the given string.
vect_field – the name of the vectors to use as input vector representation for utterances, as stored in a corpus.
context_vect_field – the name of the vectors to use as input vector representations for context-utterances, as stored in a corpus. by default, the transformer will use the same vector representations as utterances, specified in vect_field. if you expect that utterances and context-utterances will differ in some way (e.g., they come from speakers in a conversation who play clearly delineated roles), then it’s a good idea to use a different input representation.
n_svd_dims – the dimensionality of the representations to derive (via LSA/SVD).
snip_first_dim – whether or not to remove the first dimension of the derived representations. by default this is set to True, since we’ve found that the first dimension tends to reflect term frequency, making the output less informative. Note that if snip_first_dim=True then in practice, we output n_svd_dims-1-dimensional representations.
n_clusters – the number of clusters to infer.
cluster_on – whether to cluster on utterance or term representations, (corresponding to values ‘utts’ or ‘terms’). By default, we infer clusters based on representations of the utterances from the training data, and then assign term and context-utterance representations to the resultant clusters. In some cases (e.g., if utterances are highly unstructured and lengthy) it might be better to cluster term representations first.
model – an existing, fitted ExpectedContextModelTransformer object to initialize with (optional)
random_state – the random seed to use in the LSA step (which calls a randomized implementation of SVD)
cluster_random_state – the random seed to use to infer clusters.
-
fit
(corpus, y=None, selector=<function ExpectedContextModelTransformer.<lambda>>, context_selector=<function ExpectedContextModelTransformer.<lambda>>)¶ Fits an ExpectedContextModelTransformer transformer over training data: derives representations of terms, utterances and contexts, range statistics for terms, and a clustering of the resultant representations.
- Parameters
corpus – Corpus containing training data
selector – a boolean function of signature filter(utterance) that determines which utterances will be considered in the fit step. defaults to using all utterances.
context_selector – a boolean function of signature filter(utterance) that determines which context-utterances will be considered in the fit step. defaults to using all utterances.
- Returns
None
-
transform
(corpus, selector=<function ExpectedContextModelTransformer.<lambda>>)¶ Computes vector representations, ranges, and cluster assignments for utterances in a corpus.
- Parameters
corpus – Corpus
selector – a boolean function of signature filter(utterance) that determines which utterances to transform. defaults to all utterances.
- Returns
the Corpus, with per-utterance representations, ranges and cluster assignments.
-
transform_utterance
(utt)¶ Computes vector representation, range, and cluster assignment for a single utterance. Note that the utterance must contain the input representation as a metadata field, specified by what was passed into the constructor as the vect_field argument. Will write all of these characterizations (including vectors) to the utterance’s metadata; attribute names are prefixed with the output_prefix constructor argument.
- Parameters
utt – Utterance
- Returns
the utterance, with per-utterance representation, range and cluster assignments.
-
compute_utt_ranges
(corpus, selector=<function ExpectedContextModelTransformer.<lambda>>)¶ Computes utterance ranges.
- Parameters
corpus – Corpus
selector – determines which utterances to compute ranges for.
- Returns
the Corpus, with per-utterance ranges.
-
transform_context_utts
(corpus, selector=<function ExpectedContextModelTransformer.<lambda>>)¶ Computes representations of context-utterances, along with cluster assignments.
- Parameters
corpus – Corpus
selector – determines which utterances to compute representations for.
- Returns
the Corpus, with per-utterance representations and cluster assignments.
-
fit_clusters
(n_clusters='default', random_state='default')¶ Infers a clustering of term or utterance representations (specified by the cluster_on argument used to initialize the transformer) on the training data originally used to fit the transformer. Can be called to infer a different number of clusters than what was initially specified.
- Parameters
n_clusters – number of clusters to infer. defaults to the number of clusters specified when initializing the transformer.
random_state – random seed used to infer clusters. defaults to the random seed used to initialize the transformer.
- Returns
None
-
compute_clusters
(corpus, selector=<function ExpectedContextModelTransformer.<lambda>>, is_context=False)¶ Assigns utterances in a corpus, for which expected context representations have already been computed, to inferred clusters.
- Parameters
corpus – Corpus
selector – determines which utterances to compute clusterings for
is_context – whether to treat input data as utterances, or context-utterances
- Returns
a DataFrame containing cluster assignment information for each utterance.
-
set_cluster_names
(cluster_names)¶ Assigns names to inferred clusters. May be called after inspecting the output of print_clusters.
- Parameters
cluster_names – a list of names, where cluster_names[i] is the name of the cluster with cluster_id_ i.
- Returns
None
-
get_cluster_names
()¶ Returns the names of the inferred clusters.
- Returns
list of cluster names where cluster_names[i] is the name of the cluster with cluster_id_ i.
-
print_clusters
(k=10, max_chars=1000, corpus=None)¶ Prints representative terms, utterances and context-utterances for each inferred type. Can be inspected to help interpret the transformer’s output. By default, will only print out terms and context terms; if the corpus containing the training data is passed in, will output utterances and context-utterances as well.
- Parameters
k – number of examples to print out.
max_chars – maximum number of characters per utterance/context-utterance to print. Can be toggled to control the size of the output.
corpus – optional, the corpus that the transformer was trained on. if set, will print example utterances and context-utterances as well as terms.
- Returns
None
-
print_cluster_stats
()¶ Returns a Pandas dataframe containing the % of terms, context terms, and training utterances/context-utterances that have been assigned to each cluster.
- Returns
dataframe containing cluster statistics
-
summarize
(k=10, max_chars=1000, corpus=None)¶ Wrapper function to print inferred clusters and statistics about their sizes.
- Parameters
k – number of examples to print out.
max_chars – maximum number of characters per utterance/context-utterance to print. Can be toggled to control the size of the output.
corpus – optional, the corpus that the transformer was trained on. if set, will print example utterances and context-utterances as well as terms.
- Returns
None
-
get_terms
()¶ Gets the names of the terms for which the transformer has computed representations.
- Returns
list of terms
-
get_term_ranges
()¶ Gets the range statistics of terms.
- Returns
list of term ranges. order corresponds to the ordering of terms returned via get_terms().
-
get_term_reprs
()¶ Gets the derived representations of terms.
- Returns
numpy array containing term representations. order of rows corresponds to the ordering of terms returned via get_terms.
-
get_context_terms
()¶ Gets the names of the context terms for which the transformer has computed (LSA) representations.
- Returns
list of context terms
-
get_context_term_reprs
()¶ Gets the derived (LSA) representations of context terms.
- Returns
numpy array containing term representations. order of rows corresponds to the ordering of terms returned via get_context_terms.
-
get_clustering
()¶ Returns a dictionary containing various objects pertaining to the inferred clustering, with fields as follows:
km_obj: the fitted KMeans object
utts: a Pandas dataframe of cluster assignments for utterances from the training data
terms: a dataframe of cluster assignments for terms
context_utts: dataframe of cluster assignments for context-utterances from the training data
context_terms: dataframe of cluster assignments for terms.
- Returns
dictionary containing clustering information
-
load
(dirname)¶ Loads a model from disk.
- Parameters
dirname – directory to read model from
- Returns
None
-
dump
(dirname)¶ Writes a model to disk.
- Parameters
dirname – directory to write model to.
- Returns
None
-
class
convokit.expected_context_framework.expected_context_model.
ExpectedContextModel
(n_svd_dims=25, snip_first_dim=True, n_clusters=8, context_U=None, context_V=None, context_s=None, model=None, context_terms=None, cluster_on='utts', random_state=None, cluster_random_state=None)¶ Model that derives representations of terms and utterances in terms of their conversational context, i.e., context-utterances that occur near an utterance, or utterances containing a term. Typically, the conversational context consists of immediate replies (“forwards context”) or predecessors (“backwards context”), though this can be specified by the user. Can be used in ConvoKit through the ExpectedContextModelTransformer transformer; see documentation of that transformer for further details.
-
class
convokit.expected_context_framework.expected_context_model.
ClusterWrapper
(n_clusters, cluster_names=None, random_state=None)¶ Wrapper that performs K-Means clustering. Handles model loading and dumping, formats clustering output as dataframes for convenience, and keeps track of names that an end-user can assign to clusters.
-
class
convokit.expected_context_framework.dual_context_wrapper.
DualContextWrapper
(context_fields, output_prefixes, vect_field, context_vect_field=None, wrapper_output_prefix='', n_svd_dims=25, snip_first_dim=True, n_clusters=8, cluster_on='utts', random_state=None, cluster_random_state=None)¶ Transformer that derives and compares characterizations of terms and utterances with respect to two different choices of conversational context. Designed in particular to contrast replies and predecessors, though other choices of context are also possible.
This is a wrapper that encompasses two instances of ExpectedContextModelTransformer, stored at the ec_models attribute. It computes two particular comparative term-level statistics, orientation and shift, stored as the term_orientations and term_shifts attributes. It also computes these statistics at the utterance level in the transform step.
- Parameters
context_fields – list containing the names of the utterance-level attributes containing the IDs of the context-utterances used by each of the ExpectedContextModelTransformer instances.
output_prefixes – list containing the name of the attributes and vectors that each ExpectedContextModelTransformer instances will write to in the transform step.
vect_field – the name of the vectors to use as input vector representation for utterances, as stored in a corpus.
context_vect_field – the name of the vectors to use as input vector representations for context-utterances, as stored in a corpus. by default, the transformer will use the same vector representations as utterances, specified in vect_field. if you expect that utterances and context-utterances will differ in some way (e.g., they come from speakers in a conversation who play clearly delineated roles), then it’s a good idea to use a different input representation.
wrapper_output_prefix – the metadata fields where the utterance-level orientation and shift statistics are stored. By default, these attributes are stored as orn and shift in the metadata; if wrapper_output_prefix is specified, then they are stored as <wrapper_output_prefix>_orn (orientation) and <wrapper_output_prefix>_shift (shift).
n_svd_dims – the dimensionality of the representations to derive (via LSA/SVD).
snip_first_dim – whether or not to remove the first dimension of the derived representations. by default this is set to True, since we’ve found that the first dimension tends to reflect term frequency, making the output less informative. Note that if snip_first_dim=True then in practice, we output n_svd_dims-1-dimensional representations.
n_clusters – the number of clusters to infer.
cluster_on – whether to cluster on utterance or term representations, (corresponding to values ‘utts’ or ‘terms’). By default, we infer clusters based on representations of the utterances from the training data, and then assign term and context-utterance representations to the resultant clusters. In some cases (e.g., if utterances are highly unstructured and lengthy) it might be better to cluster term representations first.
random_state – the random seed to use in the LSA step (which calls a randomized implementation of SVD)
cluster_random_state – the random seed to use to infer clusters.
-
fit
(corpus, y=None, selector=<function DualContextWrapper.<lambda>>, context_selector=<function DualContextWrapper.<lambda>>)¶ Fits a transformer over training data: fits the two ExpectedContextModelTransformer instances, and computes term-level orientation and shift.
- Parameters
corpus – Corpus containing training data
selector – a boolean function of signature filter(utterance) that determines which utterances will be considered in the fit step. defaults to using all utterances.
context_selector – a boolean function of signature filter(utterance) that determines which context-utterances will be considered in the fit step. defaults to using all utterances.
- Returns
None
-
transform
(corpus, selector=<function DualContextWrapper.<lambda>>)¶ Computes vector representations, ranges, and cluster assignments for utterances in a corpus, using the two ExpectedContextModelTransformer instances. Also computes utterance-level orientation and shift.
- Parameters
corpus – Corpus
selector – a boolean function of signature filter(utterance) that determines which utterances to transform. defaults to all utterances.
- Returns
the Corpus, with per-utterance attributes.
-
transform_utterance
(utt)¶ Computes vector representations, ranges, and cluster assignments for an utterance, using the two ExpectedContextModelTransformer instances. Also computes utterance-level orientation and shift. Note that the utterance must contain the input representation as a metadata field, specified by what was passed into the constructor as the vect_field argument. Will write all of these characterizations (including vectors) to the utterance’s metadata.
- Parameters
utt – Utterance
- Returns
the utterance, with per-utterance attributes.
-
get_terms
()¶ Gets the names of the terms for which the transformer has computed representations.
- Returns
list of terms
-
get_term_df
()¶ Gets a Pandas dataframe containing term-level statistics computed by the transformer (shift, orientation) and its constituent ExpectedContextModelTransformer instances (ranges).
- Returns
dataframe of term-level statistics
-
summarize
(k=10, max_chars=1000, corpus=None)¶ For each constituent ExpectedContextModelTransformer, prints inferred clusters and statistics about their sizes.
- Parameters
k – number of examples to print out.
max_chars – maximum number of characters per utterance/context-utterance to print. Can be toggled to control the size of the output.
corpus – optional, the corpus that the transformer was trained on. if set, will print example utterances and context-utterances as well as terms.
- Returns
None
-
load
(dirname, model_dirs=None)¶ Loads a model from disk.
- Parameters
dirname – directory to read model from
model_dirs – optional list containing the directories (relative to dirname) in which each ExpectedContextModelTransformer is stored. defaults to the output_prefixes argument passed at initialization.
- Returns
None
-
dump
(dirname)¶ Writes a model to disk. Will store each ExpectedContextModelTransformer in a separate directory with names given by the output_prefixes argument passed at initialization.
- Parameters
dirname – directory to write model to.
- Returns
None
-
class
convokit.expected_context_framework.expected_context_model_pipeline.
ExpectedContextModelPipeline
(context_field, output_prefix, text_field, context_text_field=None, text_pipe=None, context_text_pipe=None, tfidf_params={}, context_tfidf_params=None, share_tfidf_models=True, min_terms=0, context_min_terms=None, n_svd_dims=25, snip_first_dim=True, n_clusters=8, cluster_on='utts', ec_model=None, random_state=None, cluster_random_state=None)¶ Wrapper class implementing a pipeline that derives characterizations of terms and utterances in terms of their conversational context. The pipeline handles the following steps:
processing input text (via a pipeline supplied by the user in the text_pipe argument);
transforming text to input representation (via ColNormedTfidfTransformer);
deriving characterizations (via ExpectedContextModelTransformer)
The ColNormedTfidfTransformer components are stored as the tfidf_model and context_tfidf_model attributes of the class; the ExpectedContextModelTransformer is stored as the ec_model attribute.
For further details, see the ColNormedTfidfTransformer and ExpectedContextModelTransformer classes.
- Parameters
context_field – the name of an utterance-level attribute containing the ID of the corresponding context-utterance. in particular, to use immediate predecessors as context, set context_field to ‘reply_to’. as another example, to use immediate replies, provided that utterances contain an attribute next_id containing the ID of their reply, set context_field to ‘next_id’.
output_prefix – the name of the attributes and vectors to write to in the transform step. the transformer outputs several fields, which will be prefixed with the given string.
text_field – the name of the utterance-level attribute containing the text to use as input.
context_text_field – the name of the utterance-level attribute containing the text to use as input for context-utterances. by default, is equivalent to text_field.
text_pipe – a convokitPipeline object used to compute the contents of text_field. defaults to populating the text_field attribute of each utterance utt with utt.text.
context_text_pipe – a convokitPipeline object used to compute the contents of context_text_field; by default equivalent to text_pipe
tfidf_params – a dictionary specifying parameters to be passed to the ColNormedTfidfTransformer object to compute input representations of utterances.
context_tfidf_parms – a dictionary specifying parameters to be passed to the ColNormedTfidfTransformer object to compute input representations of context-utterances. equivalent to tfidf_params by default.
share_tfidf_models – whether or not to use the same ColNormedTfidfTransformer for both utterances and context-utterances. defaults to True.
min_terms – the minimum number of terms in the vocabulary, derived by ColNormedTfidfTransformer, that an utterance must contain for it to be considered in fitting and transforming the underlying ExpectedContextModelTransformer object. defaults to 0, meaning the transformer will consider all utterances.
context_min_terms – minimum number of terms in the vocabulary for a context-utterance to be considered in fitting and transforming the underlying ExpectedContextModelTransformer object. equivalent to min_terms by default.
n_svd_dims – the dimensionality of the representations to derive (via LSA/SVD).
snip_first_dim – whether or not to remove the first dimension of the derived representations. by default this is set to True, since we’ve found that the first dimension tends to reflect term frequency, making the output less informative. Note that if snip_first_dim=True then in practice, we output n_svd_dims-1-dimensional representations.
n_clusters – the number of clusters to infer.
cluster_on – whether to cluster on utterance or term representations, (corresponding to values ‘utts’ or ‘terms’). By default, we infer clusters based on representations of the utterances from the training data, and then assign term and context-utterance representations to the resultant clusters. In some cases (e.g., if utterances are highly unstructured and lengthy) it might be better to cluster term representations first.
ec_model – an existing, fitted ExpectedContextModelPipeline object to initialize with (optional)
random_state – the random seed to use in the LSA step (which calls a randomized implementation of SVD)
cluster_random_state – the random seed to use to infer clusters.
-
fit
(corpus, y=None, selector=<function ExpectedContextModelPipeline.<lambda>>, context_selector=<function ExpectedContextModelPipeline.<lambda>>)¶ Fits an ExpectedContextModelPipeline over training data: derives input and latent representations of terms, utterances and contexts, range statistics for terms, and a clustering of the resultant representations.
- Parameters
corpus – Corpus containing training data
selector – a boolean function of signature filter(utterance) that determines which utterances will be considered in the fit step. defaults to using all utterances, subject to min_terms parameter passed at initialization.
context_selector – a boolean function of signature filter(utterance) that determines which context-utterances will be considered in the fit step. defaults to using all utterances, subject to context_min_terms parameter passed at initialization.
- Returns
None
-
transform
(corpus, y=None, selector=<function ExpectedContextModelPipeline.<lambda>>)¶ Computes vector representations, ranges, and cluster assignments for utterances in a corpus.
- Parameters
corpus – Corpus
selector – a boolean function of signature filter(utterance) that determines which utterances to transform.
- Returns
the Corpus, with per-utterance representations, ranges and cluster assignments.
-
transform_utterance
(utt)¶ Computes vector representation, range, and cluster assignment for a single utterance, which can be a ConvoKit Utterance or a string. Will return an Utterance object a nd write all of these characterizations (including vectors) to the utterance’s metadata; attribute names are prefixed with the output_prefix constructor argument.
- Parameters
utt – Utterance or string
- Returns
the utterance, with per-utterance representation, range and cluster assignments.
-
summarize
(k=10, max_chars=1000, corpus=None)¶ Prints inferred clusters and statistics about their sizes.
- Parameters
k – number of examples to print out.
max_chars – maximum number of characters per utterance/context-utterance to print. Can be toggled to control the size of the output.
corpus – optional, the corpus that the transformer was trained on. if set, will print example utterances and context-utterances as well as terms.
- Returns
None
-
set_cluster_names
(names)¶ Assigns names to inferred clusters. May be called after inspecting the output of print_clusters.
- Parameters
cluster_names – a list of names, where cluster_names[i] is the name of the cluster with cluster_id_ i.
- Returns
None
-
get_cluster_names
()¶ Returns the names of the inferred clusters.
- Returns
list of cluster names where cluster_names[i] is the name of the cluster with cluster_id_ i.
-
get_terms
()¶ Gets the names of the terms for which the transformer has computed representations.
- Returns
list of terms
-
load
(dirname, model_dirs=None)¶ Loads a model from disk.
- Parameters
dirname – directory to read model from
model_dirs – optional list containing the directories (relative to dirname) in which each component is stored. the order of the list is as follows: [the ExpectedContextModelTransformer, the utterance ColNormedTfidfTransformer, the context-utterance ColNormedTfidfTransformer (if share_tfidf_models is set to False at initialization)]. defaults to [‘ec_model’, ‘tfidf_model’, ‘context_tfidf_model’].
- Returns
None
-
dump
(dirname)¶ Writes a model to disk.
- Parameters
dirname – directory to write model to.
- Returns
None
-
class
convokit.expected_context_framework.expected_context_model_pipeline.
DualContextPipeline
(context_fields, output_prefixes, text_field, context_text_field=None, wrapper_output_prefix='', text_pipe=None, context_text_pipe=None, tfidf_params={}, context_tfidf_params=None, share_tfidf_models=True, min_terms=0, context_min_terms=None, n_svd_dims=25, snip_first_dim=True, n_clusters=8, cluster_on='utts', random_state=None, cluster_random_state=None)¶ Wrapper class implementing a pipeline that derives characterizations of terms and utterances in terms of two choices of conversational context. The pipeline handles the following steps:
processing input text (via a pipeline supplied by the user in the text_pipe argument);
transforming text to input representation (via ColNormedTfidfTransformer);
deriving characterizations (via DualContextWrapper)
The ColNormedTfidfTransformer components are stored as the tfidf_model and context_tfidf_model attributes of the class; the DualContextWrapper is stored as the dualmodel attribute.
For further details, see the ColNormedTfidfTransformer and DualContextWrapper classes.
- Parameters
context_field – the name of an utterance-level attribute containing the ID of the corresponding context-utterance. in particular, to use immediate predecessors as context, set context_field to ‘reply_to’. as another example, to use immediate replies, provided that utterances contain an attribute next_id containing the ID of their reply, set context_field to ‘next_id’.
output_prefixes – list containing the name of the attributes and vectors that the DualContextWrapper component will write to in the transform step.
text_field – the name of the utterance-level attribute containing the text to use as input.
context_text_field – the name of the utterance-level attribute containing the text to use as input for context-utterances. by default, is equivalent to text_field.
wrapper_output_prefix – the metadata fields where the utterance-level orientation and shift statistics are stored. By default, these attributes are stored as orn and shift in the metadata; if wrapper_output_prefix is specified, then they are stored as <wrapper_output_prefix>_orn (orientation) and <wrapper_output_prefix>_shift (shift).
text_pipe – a convokitPipeline object used to compute the contents of text_field. defaults to populating the text_field attribute of each utterance utt with utt.text.
context_text_pipe – a convokitPipeline object used to compute the contents of context_text_field; by default equivalent to text_pipe
tfidf_params – a dictionary specifying parameters to be passed to the ColNormedTfidfTransformer object to compute input representations of utterances.
context_tfidf_parms – a dictionary specifying parameters to be passed to the ColNormedTfidfTransformer object to compute input representations of context-utterances. equivalent to tfidf_params by default.
share_tfidf_models – whether or not to use the same ColNormedTfidfTransformer for both utterances and context-utterances. defaults to True.
min_terms – the minimum number of terms in the vocabulary, derived by ColNormedTfidfTransformer, that an utterance must contain for it to be considered in fitting and transforming the underlying ExpectedContextModelTransformer object. defaults to 0, meaning the transformer will consider all utterances.
context_min_terms – minimum number of terms in the vocabulary for a context-utterance to be considered in fitting and transforming the underlying ExpectedContextModelTransformer object. equivalent to min_terms by default.
n_svd_dims – the dimensionality of the representations to derive (via LSA/SVD).
snip_first_dim – whether or not to remove the first dimension of the derived representations. by default this is set to True, since we’ve found that the first dimension tends to reflect term frequency, making the output less informative. Note that if snip_first_dim=True then in practice, we output n_svd_dims-1-dimensional representations.
n_clusters – the number of clusters to infer.
cluster_on – whether to cluster on utterance or term representations, (corresponding to values ‘utts’ or ‘terms’). By default, we infer clusters based on representations of the utterances from the training data, and then assign term and context-utterance representations to the resultant clusters. In some cases (e.g., if utterances are highly unstructured and lengthy) it might be better to cluster term representations first.
random_state – the random seed to use in the LSA step (which calls a randomized implementation of SVD)
cluster_random_state – the random seed to use to infer clusters.
-
fit
(corpus, y=None, selector=<function DualContextPipeline.<lambda>>, context_selector=<function DualContextPipeline.<lambda>>)¶ Fits the model over training data.
- Parameters
corpus – Corpus containing training data
selector – a boolean function of signature filter(utterance) that determines which utterances will be considered in the fit step. defaults to using all utterances, subject to min_terms parameter passed at initialization.
context_selector – a boolean function of signature filter(utterance) that determines which context-utterances will be considered in the fit step. defaults to using all utterances, subject to context_min_terms parameter passed at initialization.
- Returns
None
-
transform
(corpus, y=None, selector=<function DualContextPipeline.<lambda>>)¶ Computes vector representations, and statistics for utterances in a corpus, using the DualContextWrapper component.
- Parameters
corpus – Corpus
selector – a boolean function of signature filter(utterance) that determines which utterances to transform. defaults to all utterances.
- Returns
the Corpus, with per-utterance attributes.
-
transform_utterance
(utt)¶ Computes representations and statistics for a single utterance, which can be a ConvoKit Utterance or a string. Will return an Utterance object a nd write all of these characterizations (including vectors) to the utterance’s metadata; attribute names are prefixed with the output_prefix constructor argument.
- Parameters
utt – Utterance or string
- Returns
the utterance, with per-utterance representation, range and cluster assignments.
-
summarize
(k=10, max_chars=1000, corpus=None)¶ Prints inferred clusters and statistics about their sizes, for each component in the underlying DualContextWrapper.
- Parameters
k – number of examples to print out.
max_chars – maximum number of characters per utterance/context-utterance to print. Can be toggled to control the size of the output.
corpus – optional, the corpus that the transformer was trained on. if set, will print example utterances and context-utterances as well as terms.
- Returns
None
-
get_terms
()¶ Gets the names of the terms for which the transformer has computed representations.
- Returns
list of terms
-
get_term_df
()¶ Gets a Pandas dataframe containing term-level statistics computed by the transformer (shift, orientation, ranges)
- Returns
dataframe of term-level statistics
-
load
(dirname, model_dirs=None)¶ Loads a model from disk.
- Parameters
dirname – directory to read model from
model_dirs – optional list containing the directories (relative to dirname) in which each component is stored. the order of the list is as follows: [the DualContextWrapper components, the utterance ColNormedTfidfTransformer, the context-utterance ColNormedTfidfTransformer (if share_tfidf_models is set to False at initialization)]. defaults to [output_prefixes[0], output_prefixes[1], ‘tfidf_model’, ‘context_tfidf_model’] where output_prefixes is passed at initialization.
- Returns
None
-
dump
(dirname)¶ Writes a model to disk.
- Parameters
dirname – directory to write model to.
- Returns
None