Prompt Types

Note: this is an older method that we keep in the ConvoKit library to reflect the content of past publications, and for backwards compatability. For a modified and more general variant of the method, see the ExpectedContextModel functionality.

Implements prompt type model described in this paper.

Example usage: end to end pipeline to infer question types in British parliament, more detailed exploration of additional options for using the module, understanding the use of conversational prompts in conversations gone awry on Wikipedia.

class convokit.prompt_types.promptTypes.PromptTypes(prompt_field, reference_field, output_field, n_types=8, prompt_transform_field=None, reference_transform_field=None, prompt__tfidf_min_df=100, prompt__tfidf_max_df=0.1, reference__tfidf_min_df=100, reference__tfidf_max_df=0.1, snip_first_dim=True, svd__n_components=25, max_dist=0.9, random_state=None, verbosity=0)

Model that infers a vector representation of utterances in terms of the responses that similar utterances tend to prompt, as well as types of rhetorical intentions encapsulated by utterances in a corpus, in terms of their anticipated responses (operationalized as k-means clusters of vectors).

Under the surface, the model takes as input pairs of prompts and responses during the fit step. In this stage the following subcomponents are involved:

  1. a prompt embedding model that will learn the vector representations;

  2. a prompt type model that learns a clustering of these representations.

The model can transform individual (unpaired) utterances in the transform step. While the focus is on representing properties of prompts, as a side-effect the model can also compute representations that encapsulate properties of responses and assign responses to prompt types (as “typical responses” to the prompts in that type).

Internally, the model contains the following elements:
  • prompt_embedding_model: stores models that compute the vector representations. includes tf-idf models that convert the prompt and response input to term document matrices, an SVD model that produces a low-dimensional representation of responses and prompts, and vector representations of prompt and response terms

  • type_models: stores kmeans models along with type assignments of prompt and response terms

  • train_results: stores the vector representations of the corpus used to train the model in the fit step

  • train_types: stores the type assignments of the corpus used in the fit step

The transformer will output several attributes of an utterance (names prefixed with <output_field>__). If the utterance is a prompt (in the default case, if it has a response), then the following will be outputted.
  • prompt_repr: a vector representation of the utterance (stored as a corpus-wide matrix, or in the metadata of an individual utterance if transform_utterance is called)

  • prompt_dists.<number of types>: a vector storing the distance between the utterance vector and the centroid of each k-means cluster (stored as a corpus-wide matrix, or in the metadata of an individual utterance if transform_utterance is called)

  • prompt_type.<number of types>: the index of the type the utterance is assigned to

  • prompt_type_dist.<number of types>: the distance from the vector representation to the centroid of the assigned type

If the utterance is a response to a previous utterance, then the utterance will also be annotated an analogous set of attributes denoting its response representation and type. For downstream tasks, a reasonable first step is to only look at the prompt-side representations.

For an end-to-end implementation that runs several default values of the parameters, see the PromptTypeWrapper module.

Parameters
  • prompt_field – the name of the attribute of prompts to use as input to fit.

  • reference_field – the name of the attribute of responses to use as input to fit. a reasonable choice is to set to the same value as prompt_field.

  • output_field – the name of the attribute to write to in the transform step. the transformer outputs several fields, as listed above.

  • n_types – the number of types to infer. defaults to 8.

  • prompt_transform_field – the name of the attribute of prompts to use as input to transform; defaults to the same attribute as in fit.

  • reference_transform_field – the name of the attribute of responses to use as input to transform; defaults to the same attribute as in fit.

  • prompt__tfidf_min_df – the minimum frequency of prompt terms to use. can be specified as a fraction or as an absolute count, defaults to 100.

  • prompt__tfidf_max_df – the maximum frequency of prompt terms to use. can be specified as a fraction or as an absolute count, defaults to 0.1. Setting higher is more permissive, but may result in many stopword-like terms adding noise to the model.

  • reference__tfidf_min_df – the minimum frequency of response terms to use. can be specified as a fraction or as an absolute count, defaults to 100.

  • reference__tfidf_max_df – the maximum frequency of response terms to use. can be specified as a fraction or as an absolute count, defaults to 0.1.

  • snip_first_dim – whether or not to remove the first SVD dimension (which may add noise to the model; typically this reflects frequency rather than any semantic interpretation). defaults to True.

  • svd__n_components – the number of SVD dimensions to use, defaults to 25. higher values result in richer vector representations, perhaps at the cost of the model learning overly-specific types.

  • max_dist – the maximum distance between a vector representation of an utterance and the cluster centroid; a cluster whose distance to all centroids is above this cutoff will get assigned to a null type, denoted by -1. Defaults to 0.9.

  • random_state – the random seed to use.

  • verbosity – frequency of status messages.

display_type(type_id, corpus=None, type_key=None, k=10)

For a particular prompt type, displays the representative prompt and response terms. can also display representative prompt and response utterances.

Parameters
  • type_id – ID of the prompt type to display.

  • corpus – pass in the training corpus to also display representative utterances.

  • type_key – the name of the prompt type clustering model to use. defaults to n_types that the model was initialized with, but if refit_types is called with different number of types, can be modified to display this updated model as well.

  • k – the number of sample terms (or utteranceS) to display.

Returns

None

dump_model(model_dir, type_keys='default', dump_train_corpus=True)

Dumps the model to disk.

Parameters
  • model_dir – directory to write model to

  • type_keys – if ‘default’, will only write the type clustering model corresponding to the n_types the model was initialized with. if ‘all’, will write all clustering models that have been trained via calls to refit_types. can also take a list of clustering models.

  • dump_train_corpus – whether to also write the representations and type assignments of the training corpus. defaults to True.

Returns

None

fit(corpus, y=None, prompt_selector=<function PromptTypes.<lambda>>, reference_selector=<function PromptTypes.<lambda>>)

Fits a PromptTypes model for a corpus – that is, learns latent representations of prompt and response terms, as well as prompt types.

Parameters
  • corpus – Corpus

  • prompt_selector – a boolean function of signature filter(utterance) that determines which

utterances will be considered as prompts in the fit step. defaults to using all utterances which have a response. :param reference_selector: a boolean function of signature filter(utterance) that determines which utterances

will be considered as responses in the fit step. defaults to using all utterances which are responses to a prompt.

Returns

None

get_model(type_keys='default')
Returns the model as a dictionary containing:
  • embedding_model: stores information pertaining to the vector representations.
    • prompt_tfidf_model: sklearn tf-idf model that converts prompt input to term-document matrix

    • reference_tfidf_model: tf-idf model that converts response input to term-document matrix

    • svd_model: sklearn TruncatedSVD model that produces a low-dimensional representation of responses and prompts

    • U_prompt: vector representations of prompt terms

    • U_reference: vector representations of response terms

  • type_models: a dictionary mapping each type clustering model to:
    • km_model: a sklearn KMeans model of the learned types

    • prompt_df: distances to cluster centroids, and type assignments, of prompt terms

    • reference_df: distances to cluster centroids, and type assignments, of reference terms

Parameters

type_keys – if ‘default’, will return the type clustering model corresponding to the n_types the model was initialized with. if ‘all’, returns all clustering models that have been trained via calls to refit_types. can also take a list of clustering models.

Returns

the prompt types model

load_model(model_dir, type_keys='default', load_train_corpus=True)

Loads the model from disk.

Parameters
  • model_dir – directory to read model to

  • type_keys – if ‘default’, will only read the type clustering model corresponding to the n_types the model was initialized with. if ‘all’, will read all clustering models that are available in directory. can also take a list of clustering models.

  • load_train_corpus – whether to also read the representations and type assignments of the training corpus. defaults to True.

Returns

None

refit_types(n_types, random_state=None, name=None)

Using the latent representations of prompt terms learned during the initial fit call, infers n_types prompt types. permits retraining the clustering model that determines the number of types, on top of the initial model. calling this and updating the default_n_types field of the model will result in future transform calls assigning utterances to one of n_types prompt types.

Parameters
  • n_types – number of types to learn

  • random_state – random seed

  • name – the name of the new type model. defaults to n_types.

Returns

None

summarize(corpus, type_ids=None, type_key=None, k=10)

Displays representative prompt and response terms and utterances for each type learned. A wrapper for display_type.

Parameters
  • corpus – corpus to display utterances for (must have transform() called on it)

  • type_ids – ID of the prompt type to display. if None, will display all types.

  • type_key – the name of the prompt type clustering model to use. defaults to n_types that the model was initialized with, but if refit_types is called with different number of types, can be modified to display this updated model as well.

  • k – the number of sample terms (or utteranceS) to display.

Returns

None

transform(corpus, use_fit_selectors=True, prompt_selector=<function PromptTypes.<lambda>>, reference_selector=<function PromptTypes.<lambda>>)

Computes vector representations and prompt type assignments for utterances in a corpus.

Parameters
  • corpus – Corpus

  • use_fit_selectors – defaults to True, will use the same filters as the fit step to determine which utterances will be considered as prompts and responses in the transform step.

  • prompt_selector – filter that determines which utterances will be considered as prompts in the transform step. defaults to prompt_selector, the same as is used in fit.

  • reference_selector – filter that determines which utterances will be considered as responses in the transform step. defaults to reference_selector, the same as is used in fit.

Returns

the corpus, with per-utterance representations and type assignments.

transform_utterance(utterance)

Computes vector representations and prompt type assignments for a single utterance.

Parameters

utterance – the utterance.

Returns

the utterance, annotated with representations and type assignments.

convokit.prompt_types.promptTypes.assign_prompt_types(model, ids, vects, max_dist=0.9)

Standalone function that returns type assignments of input vectors given a trained PromptTypes type model. See docstring of PromptTypes class for details.

Parameters
  • model – prompt type model

  • ids – ids of input vectors

  • vects – input vectors

Returns

a dataframe storing cluster centroid distances and the assigned type.

convokit.prompt_types.promptTypes.fit_prompt_embedding_model(prompt_input, reference_input, snip_first_dim=True, prompt__tfidf_min_df=100, prompt__tfidf_max_df=0.1, reference__tfidf_min_df=100, reference__tfidf_max_df=0.1, svd__n_components=25, random_state=None, verbosity=0)

Standalone function that fits an embedding model given paired prompt and response inputs. See docstring of the PromptTypes class for details.

Parameters
  • prompt_input – list of prompts (represented as space-separated strings of terms)

  • reference_input – list of responses (represented as space-separated strings of terms). note that each entry of reference_input should be a response to the corresponding entry in prompt_input.

Returns

prompt embedding model

convokit.prompt_types.promptTypes.fit_prompt_type_model(model, n_types, random_state=None, max_dist=0.9, verbosity=0)

Standalone function that fits a prompt type model given paired prompt and response inputs. See docstring of the PromptTypes class for details.

Parameters
  • model – prompt embedding model (from fit_prompt_embedding_model())

  • n_types – number of prompt types to infer

Returns

prompt type model

convokit.prompt_types.promptTypes.transform_embeddings(model, ids, input, side='prompt', filter_empty=True)

Standalone function that returns vector representations of input text given a trained PromptTypes prompt_embedding_model. See docstring of PromptTypes class for details.

Parameters
  • model – prompt embedding model

  • ids – ids of input text

  • input – a list where each entry has corresponding id in the ids argument, and is a string of terms corresponding to an utterance.

  • side – whether to return prompt or response embeddings (“prompt” and “reference” respectively); defaults to “prompt”

  • filter_empty – if True, will not return embeddings for prompts with no terms.

Returns

input IDs ids, and corresponding vector representations of input vect

class convokit.prompt_types.promptTypeWrapper.PromptTypeWrapper(output_field='prompt_types', n_types=8, use_prompt_motifs=True, root_only=True, questions_only=True, enforce_caps=True, recompute_all=False, min_support=100, min_df=100, svd__n_components=25, max_df=0.1, max_dist=0.9, random_state=None, verbosity=10000)

This is a wrapper class implementing a pipeline that infers types of rhetorical intentions encapsulated by utterances in a corpus, in terms of their anticipated responses.

The pipeline involves:
  • parsing input text via TextParser

  • representing input text as dependency tree arcs, with nouns censored out, via CensorNouns, TextToArcs and QuestionSentences

  • extracting a set of “phrasings” from the corpus, using a PhrasingMotifs model

  • inferring prompt types and type assignments per-utterance, using a PromptTypes model.

While the pipeline computes many attributes of an utterance along the way, the overall goal is to assign each utterance to a prompt type. By default, the pipeline will focus on learning types of questions, in terms of how the questions are phrased. However, other options are possible (see parameters below). For further details, see the respective classes listed above.

Parameters
  • output_field – the name of the attribute to write to in the transform step. the transformer outputs several fields, corresponding to both vector representations and discrete type assignments.

  • n_types – the number of prompt types to infer.

  • use_prompt_motifs – whether to represent prompts in terms of how they are phrased. defaults to True. if False, will use individual dependency arcs as input (this might be better for noisier text)

  • root_only – whether to only use dependency arcs attached to the root of the parse. defaults to True. if False will also consider arcs beyond the root (may be better for noisier text)

  • questions_only – whether to only learn representations of questions (i.e., utterances containing sentences that end in question marks); defaults to True.

  • enforce_caps – whether to only fit and transform on sentences that start with capital letters. defaults to True, which is appropriate for formal settings like transcripts of institutional proceedings, where this is a check on how well-formed the input is. in less formal settings like social media, setting to False may be more appropriately permissive.

  • min_support – the minimum frequency of phrasings to extract.

  • min_df – the minimum frequency of prompt and response terms to consider when inferring types.

  • max_df – the maximum frequency of prompt and response terms to use. defaults to 0.1 (i.e., occurs in at most 10% of prompt-response pairs). Setting higher is more permissive, but may result in many stopword-like terms adding noise to the model.

  • svd__n_components – the number of SVD dimensions to use when inferring types, defaults to 25. higher values result in richer vector representations, perhaps at the cost of the model learning overly-specific types.

  • max_dist – the maximum distance between a vector representation of an utterance and the cluster centroid; a cluster whose distance to all centroids is above this cutoff will get assigned to a null type, denoted by -1. defaults to 0.9.

  • recompute_all – if False (the default), checks utterances to see if they already have an attribute computed, skipping over that utterance in the relevant step of the pipeline. if True, recomputes all attributes.

  • random_state – the random seed to use.

  • verbosity – frequency of status messages.

display_type(type_id, corpus=None, type_key=None, k=10)

for a particular prompt type, displays the representative prompt and response terms. can also display representative prompt and response utterances.

Parameters
  • type_id – ID of the prompt type to display.

  • corpus – pass in the training corpus to also display representative utterances.

  • type_key – the name of the prompt type clustering model to use. defaults to n_types that the model was initialized with, but if refit_types is called with different number of types, can be modified to display this updated model as well.

  • k – the number of sample terms (or utteranceS) to display.

Returns

None

dump_model(model_dir, type_keys='default')

Writes the PhrasingMotifs (if applicable) and PromptTypes models to disk.

Parameters

model_dir – directory to write to.

Returns

None

fit(corpus, y=None)

Fits the model for a corpus – that is, computes all necessary utterance attributes, and fits the underlying PhrasingMotifs and PromptTypes models.

Parameters

corpus – Corpus

Returns

None

get_model(type_keys='default')
Returns the model:
  • pm_model: PhrasingMotifs model (if applicable, i.e., use_motifs=True)

  • pt_model: PromptTypes model

Parameters

type_keys – which numbers of prompt types to return corresponding PromptTypes model for

Returns

model

load_model(model_dir, type_keys='default')

Reads the PhrasingMotifs (if applicable) and PromptTypes models from disk.

Parameters

model_dir – directory to read from.

Returns

None

print_top_phrasings(k)

prints the k most frequent phrasings from the PhrasingMotifs component of the pipeline, if phrasings are used.

Parameters

k – number of phrasings to print

Returns

None

refit_types(n_types, random_state=None, name=None)

infers a different number of prompt types than was originally called.

Parameters
  • n_types – number of types to learn

  • random_state – random seed

  • name – the name of the new type model. defaults to n_types.

Returns

None

summarize(corpus, type_ids=None, type_key=None, k=10)

Displays representative prompt and response terms and utterances for each type learned.

Parameters
  • corpus – corpus to display utterances for (must have transform() called on it)

  • type_ids – ID of the prompt type to display. if None, will display all types.

  • type_key – the name of the prompt type clustering model to use. defaults to n_types that the model was initialized with, but if refit_types is called with different number of types, can be modified to display this updated model as well.

  • k – the number of sample terms (or utteranceS) to display.

Returns

None

transform(corpus)

Computes prompt type assignments for utterances in a corpus.

Parameters

corpus – Corpus

Returns

the corpus, with per-utterance representations and type assignments.

transform_utterance(utterance)

Computes prompt type assignments for individual utterances. can take as input ConvoKit Utterances or raw strings. will return assignments for all string input, even if the input is not a question.

Parameters

utterance – the utterance, as an Utterance or string.

Returns

the utterance, annotated with type assignments.