Phrasing Motifs¶

Extracts phrasing motifs from text, as described in this paper.

Example usage: using phrasing motifs as features in a prompt types model.

class convokit.phrasing_motifs.phrasingMotifs.PhrasingMotifs(output_field, fit_field, min_support, fit_filter=None, transform_field=None, transform_filter=None, deduplication_threshold=0.9, max_naive_itemset_size=5, max_itemset_size=10, verbosity=0)¶

A model that extracts a set of “phrasings” from a Corpus in the fit step, and that identifies which phrasings in this set are present in an utterance in the transform step. Phrasings intuitively correspond to frequently-occurring structures in the dependency trees of utterances, and are operationalized as frequently-occurring sets of dependency-parse arcs (though any other token-like input could work).

The model expects as input utterances with a field consisting of either a string with space-separated tokens or arcs, or a list of such strings.

As output in the transform step it produces:

a list, one per sentence, of space-separated phrasings contained in each sentence, where each phrasing is represented as a string where components (e.g., arcs) are separated by double underscores ‘__’.
a list of sink phrasings in each sentence – the most finely-specified phrasing encapsulated by the sentence. (e.g., “do you agree…” is more finely-specified than “…agree…”)

Internally the model contains the following elements:

itemset_counts: a dictionary of phrasings to frequencies in the training data
downlinks: the graph structure representing the relationship between phrasings. used to later determine which phrasings are contained in a sentence in the transform step.
itemset_to_ids: maps phrasings to their de-duplicated forms.
min_support: the minimum frequency of a subset that is to be considered a phrasing

Parameters

output_field – name of attribute to write phrasings to in transform step. sink phrasings will be written to field <output_field>__sink.
fit_field – name of attribute to use as input in fit.
min_support – the minimum frequency of phrasings to return
fit_filter – a boolean function of signature fit_filter(utterance). during the fit step phrasings will only be computed over utterances where fit_filter returns True. By default, will always return True, meaning that all utterances will be used.
transform_field – name of attribute to use as input in transform; defaults to the same field used in fit.
transform_filter – a boolean function of signature transform_filter(utterance). during the transform step phrasings will only be computed over utterances where transform_filter returns True. defaults to filter used in fit step.
deduplication_threshold – merges phrasings into a single phrasing if phrasings co-occur above this frequency (i.e., pr(phrasing 1 | phrasing 2) and vice versa)
max_naive_itemset_size – maximum size of subsets to compute. above this size, a variant of the a-priori algorithm will be used in lieu of enumerating all possible subsets.
max_itemset_size – maximum size of subsets to consider as phrasings. setting lower will run faster but miss more complex phrasings.
verbosity – frequency of status messages.

dump_model(model_dir)¶

Writes the model to disk.

Will output one json file per model component.

Parameters: model_dir – directory to write to.
Returns: None

fit(corpus, y=None)¶

Fits a PhrasingMotifs model for a corpus – that is, extracts all phrasings from the corpus.

Parameters: corpus – Corpus
Returns: None

get_model()¶

Returns the PhrasingMotifs model. See class docstring for description of fields.

Returns: PhrasingMotifs model

load_model(model_dir)¶

Loads a saved PhrasingMotifs model from disk.

Parameters: model_dir – directory to read model from.
Returns: None

print_top_phrasings(k)¶

prints the k most frequent phrasings.

Parameters: k – number of phrasings to print
Returns: None

convokit.phrasing_motifs.phrasingMotifs.extract_phrasing_motifs(set_dict, min_support, deduplication_threshold=0.9, max_naive_itemset_size=5, max_itemset_size=10, verbosity=0)¶

standalone function to extract phrasings – i.e., frequently-occurring collections of dependency-parse arcs (or other token-like objects).

Parameters

set_dict – dictionary mapping an ID (e.g., an utterance-sentence ID) to the collection of arcs in the corresponding object.
min_support – minimum frequency of phrasings to return
deduplication_threshold – merges phrasings into a single phrasing if phrasings co-occur above this frequency
max_naive_itemset_size – maximum size of subsets to compute.
max_itemset_size – maximum size of subsets to consider as phrasings.
verbosity – frequency of status messages.

Returns

phrasing motifs model, as a dict containing each component.

convokit.phrasing_motifs.phrasingMotifs.get_phrasing_motifs(arcs_per_sent, phrasing_motif_info)¶

standalone function that returns phrasings and sink phrasings given an utterance (consisting of a list of space-separated arcs per each sentence) and a phrasing motif model.

Parameters

arcs_per_sent – input arcs per sentence
phrasing_motif_info – phrasing motif model

Returns

phrasings and sink phrasings

class convokit.phrasing_motifs.censorNouns.CensorNouns(output_field, input_field='parsed', input_filter=None, verbosity=0)¶

Transformer that, given a parse (formatted as the output of a TextParser transformer) returns a parse where nouns and pronouns are replaced with “NN~”. A rough heuristic for removing “content-related” tokens. This transformer also collapses constructions with Wh-determiners like What time [is it] into What [is it].

Parameters

output_field – name of attribute to output to.
input_field – name of field to use as input. defaults to ‘parsed’, which stores dependency parses as returned by the TextParser transformer; otherwise expects similarly-formatted input.
input_filter – a boolean function of signature input_filter(utterance, aux_input). parses will only be computed for utterances where input_filter returns True. By default, will always return True, meaning that arcs will be computed for all utterances.
verbosity – frequency of status messages.

convokit.phrasing_motifs.censorNouns.censor_nouns(text_entry)¶

Stand-alone function that removes nouns from parsed text.

Parameters: text_entry – parsed text
Returns: parse with nouns censored out.

class convokit.phrasing_motifs.questionSentences.QuestionSentences(output_field, input_field, use_caps=True, filter_field='parsed', input_filter=None, verbosity=0)¶

Transformer that, given a list of sentences, returns a list containing only sentences which are questions (determined, as a rough heuristic, by whether they end in question marks). Returns an empty list if there are no questions.

Parameters

output_field – name of attribute to output to.
input_field – name of field to use as input. expects a list where each sentence corresponds to a sentence in filter_field.
use_caps – whether to only use sentences which start in capital letters. defaults to True.
filter_field – name of field to check for question marks in, defaults to the output of the TextParser transformer. the entries of input_field and filter_field should exactly correspond.
input_filter – a boolean function of signature input_filter(utterance, aux_input). parses will only be computed for utterances where input_filter returns True. By default, will always return True, meaning that arcs will be computed for all utterances.
verbosity – frequency of status messages.