Phrasing Motifs

Extracts phrasing motifs from text, as described in this paper.

Example usage: using phrasing motifs as features in a prompt types model.

class convokit.phrasing_motifs.phrasingMotifs.PhrasingMotifs(output_field, fit_field, min_support, fit_filter=None, transform_field=None, transform_filter=None, deduplication_threshold=0.9, max_naive_itemset_size=5, max_itemset_size=10, verbosity=0)

A model that extracts a set of “phrasings” from a Corpus in the fit step, and that identifies which phrasings in this set are present in an utterance in the transform step. Phrasings intuitively correspond to frequently-occurring structures in the dependency trees of utterances, and are operationalized as frequently-occurring sets of dependency-parse arcs (though any other token-like input could work).

The model expects as input utterances with a field consisting of either a string with space-separated tokens or arcs, or a list of such strings.

As output in the transform step it produces:
  • a list, one per sentence, of space-separated phrasings contained in each sentence, where each phrasing is represented as a string where components (e.g., arcs) are separated by double underscores ‘__’.

  • a list of sink phrasings in each sentence – the most finely-specified phrasing encapsulated by the sentence. (e.g., “do you agree…” is more finely-specified than “…agree…”)

Internally the model contains the following elements:
  • itemset_counts: a dictionary of phrasings to frequencies in the training data

  • downlinks: the graph structure representing the relationship between phrasings. used to later determine which phrasings are contained in a sentence in the transform step.

  • itemset_to_ids: maps phrasings to their de-duplicated forms.

  • min_support: the minimum frequency of a subset that is to be considered a phrasing

Parameters
  • output_field – name of attribute to write phrasings to in transform step. sink phrasings will be written to field <output_field>__sink.

  • fit_field – name of attribute to use as input in fit.

  • min_support – the minimum frequency of phrasings to return

  • fit_filter – a boolean function of signature fit_filter(utterance). during the fit step phrasings will only be computed over utterances where fit_filter returns True. By default, will always return True, meaning that all utterances will be used.

  • transform_field – name of attribute to use as input in transform; defaults to the same field used in fit.

  • transform_filter – a boolean function of signature transform_filter(utterance). during the transform step phrasings will only be computed over utterances where transform_filter returns True. defaults to filter used in fit step.

  • deduplication_threshold – merges phrasings into a single phrasing if phrasings co-occur above this frequency (i.e., pr(phrasing 1 | phrasing 2) and vice versa)

  • max_naive_itemset_size – maximum size of subsets to compute. above this size, a variant of the a-priori algorithm will be used in lieu of enumerating all possible subsets.

  • max_itemset_size – maximum size of subsets to consider as phrasings. setting lower will run faster but miss more complex phrasings.

  • verbosity – frequency of status messages.

dump_model(model_dir)

Writes the model to disk.

Will output one json file per model component.

Parameters

model_dir – directory to write to.

Returns

None

fit(corpus, y=None)

Fits a PhrasingMotifs model for a corpus – that is, extracts all phrasings from the corpus.

Parameters

corpus – Corpus

Returns

None

get_model()

Returns the PhrasingMotifs model. See class docstring for description of fields.

Returns

PhrasingMotifs model

load_model(model_dir)

Loads a saved PhrasingMotifs model from disk.

Parameters

model_dir – directory to read model from.

Returns

None

print_top_phrasings(k)

prints the k most frequent phrasings.

Parameters

k – number of phrasings to print

Returns

None

convokit.phrasing_motifs.phrasingMotifs.extract_phrasing_motifs(set_dict, min_support, deduplication_threshold=0.9, max_naive_itemset_size=5, max_itemset_size=10, verbosity=0)

standalone function to extract phrasings – i.e., frequently-occurring collections of dependency-parse arcs (or other token-like objects).

Parameters
  • set_dict – dictionary mapping an ID (e.g., an utterance-sentence ID) to the collection of arcs in the corresponding object.

  • min_support – minimum frequency of phrasings to return

  • deduplication_threshold – merges phrasings into a single phrasing if phrasings co-occur above this frequency

  • max_naive_itemset_size – maximum size of subsets to compute.

  • max_itemset_size – maximum size of subsets to consider as phrasings.

  • verbosity – frequency of status messages.

Returns

phrasing motifs model, as a dict containing each component.

convokit.phrasing_motifs.phrasingMotifs.get_phrasing_motifs(arcs_per_sent, phrasing_motif_info)

standalone function that returns phrasings and sink phrasings given an utterance (consisting of a list of space-separated arcs per each sentence) and a phrasing motif model.

Parameters
  • arcs_per_sent – input arcs per sentence

  • phrasing_motif_info – phrasing motif model

Returns

phrasings and sink phrasings

class convokit.phrasing_motifs.censorNouns.CensorNouns(output_field, input_field='parsed', input_filter=None, verbosity=0)

Transformer that, given a parse (formatted as the output of a TextParser transformer) returns a parse where nouns and pronouns are replaced with “NN~”. A rough heuristic for removing “content-related” tokens. This transformer also collapses constructions with Wh-determiners like What time [is it] into What [is it].

Parameters
  • output_field – name of attribute to output to.

  • input_field – name of field to use as input. defaults to ‘parsed’, which stores dependency parses as returned by the TextParser transformer; otherwise expects similarly-formatted input.

  • input_filter – a boolean function of signature input_filter(utterance, aux_input). parses will only be computed for utterances where input_filter returns True. By default, will always return True, meaning that arcs will be computed for all utterances.

  • verbosity – frequency of status messages.

convokit.phrasing_motifs.censorNouns.censor_nouns(text_entry)

Stand-alone function that removes nouns from parsed text.

Parameters

text_entry – parsed text

Returns

parse with nouns censored out.

class convokit.phrasing_motifs.questionSentences.QuestionSentences(output_field, input_field, use_caps=True, filter_field='parsed', input_filter=None, verbosity=0)

Transformer that, given a list of sentences, returns a list containing only sentences which are questions (determined, as a rough heuristic, by whether they end in question marks). Returns an empty list if there are no questions.

Parameters
  • output_field – name of attribute to output to.

  • input_field – name of field to use as input. expects a list where each sentence corresponds to a sentence in filter_field.

  • use_caps – whether to only use sentences which start in capital letters. defaults to True.

  • filter_field – name of field to check for question marks in, defaults to the output of the TextParser transformer. the entries of input_field and filter_field should exactly correspond.

  • input_filter – a boolean function of signature input_filter(utterance, aux_input). parses will only be computed for utterances where input_filter returns True. By default, will always return True, meaning that arcs will be computed for all utterances.

  • verbosity – frequency of status messages.