Phrasing Motifs¶
Extracts phrasing motifs from text, as described in this paper.
Example usage: using phrasing motifs as features in a prompt types model.
-
class
convokit.phrasing_motifs.phrasingMotifs.
PhrasingMotifs
(output_field, fit_field, min_support, fit_filter=None, transform_field=None, transform_filter=None, deduplication_threshold=0.9, max_naive_itemset_size=5, max_itemset_size=10, verbosity=0)¶ A model that extracts a set of “phrasings” from a Corpus in the fit step, and that identifies which phrasings in this set are present in an utterance in the transform step. Phrasings intuitively correspond to frequently-occurring structures in the dependency trees of utterances, and are operationalized as frequently-occurring sets of dependency-parse arcs (though any other token-like input could work).
The model expects as input utterances with a field consisting of either a string with space-separated tokens or arcs, or a list of such strings.
- As output in the transform step it produces:
a list, one per sentence, of space-separated phrasings contained in each sentence, where each phrasing is represented as a string where components (e.g., arcs) are separated by double underscores ‘__’.
a list of sink phrasings in each sentence – the most finely-specified phrasing encapsulated by the sentence. (e.g., “do you agree…” is more finely-specified than “…agree…”)
- Internally the model contains the following elements:
itemset_counts: a dictionary of phrasings to frequencies in the training data
downlinks: the graph structure representing the relationship between phrasings. used to later determine which phrasings are contained in a sentence in the transform step.
itemset_to_ids: maps phrasings to their de-duplicated forms.
min_support: the minimum frequency of a subset that is to be considered a phrasing
- Parameters
output_field – name of attribute to write phrasings to in transform step. sink phrasings will be written to field <output_field>__sink.
fit_field – name of attribute to use as input in fit.
min_support – the minimum frequency of phrasings to return
fit_filter – a boolean function of signature fit_filter(utterance). during the fit step phrasings will only be computed over utterances where fit_filter returns True. By default, will always return True, meaning that all utterances will be used.
transform_field – name of attribute to use as input in transform; defaults to the same field used in fit.
transform_filter – a boolean function of signature transform_filter(utterance). during the transform step phrasings will only be computed over utterances where transform_filter returns True. defaults to filter used in fit step.
deduplication_threshold – merges phrasings into a single phrasing if phrasings co-occur above this frequency (i.e., pr(phrasing 1 | phrasing 2) and vice versa)
max_naive_itemset_size – maximum size of subsets to compute. above this size, a variant of the a-priori algorithm will be used in lieu of enumerating all possible subsets.
max_itemset_size – maximum size of subsets to consider as phrasings. setting lower will run faster but miss more complex phrasings.
verbosity – frequency of status messages.
-
dump_model
(model_dir)¶ Writes the model to disk.
Will output one json file per model component.
- Parameters
model_dir – directory to write to.
- Returns
None
-
fit
(corpus, y=None)¶ Fits a PhrasingMotifs model for a corpus – that is, extracts all phrasings from the corpus.
- Parameters
corpus – Corpus
- Returns
None
-
get_model
()¶ Returns the PhrasingMotifs model. See class docstring for description of fields.
- Returns
PhrasingMotifs model
-
load_model
(model_dir)¶ Loads a saved PhrasingMotifs model from disk.
- Parameters
model_dir – directory to read model from.
- Returns
None
-
print_top_phrasings
(k)¶ prints the k most frequent phrasings.
- Parameters
k – number of phrasings to print
- Returns
None
-
convokit.phrasing_motifs.phrasingMotifs.
extract_phrasing_motifs
(set_dict, min_support, deduplication_threshold=0.9, max_naive_itemset_size=5, max_itemset_size=10, verbosity=0)¶ standalone function to extract phrasings – i.e., frequently-occurring collections of dependency-parse arcs (or other token-like objects).
- Parameters
set_dict – dictionary mapping an ID (e.g., an utterance-sentence ID) to the collection of arcs in the corresponding object.
min_support – minimum frequency of phrasings to return
deduplication_threshold – merges phrasings into a single phrasing if phrasings co-occur above this frequency
max_naive_itemset_size – maximum size of subsets to compute.
max_itemset_size – maximum size of subsets to consider as phrasings.
verbosity – frequency of status messages.
- Returns
phrasing motifs model, as a dict containing each component.
-
convokit.phrasing_motifs.phrasingMotifs.
get_phrasing_motifs
(arcs_per_sent, phrasing_motif_info)¶ standalone function that returns phrasings and sink phrasings given an utterance (consisting of a list of space-separated arcs per each sentence) and a phrasing motif model.
- Parameters
arcs_per_sent – input arcs per sentence
phrasing_motif_info – phrasing motif model
- Returns
phrasings and sink phrasings
-
class
convokit.phrasing_motifs.censorNouns.
CensorNouns
(output_field, input_field='parsed', input_filter=None, verbosity=0)¶ Transformer that, given a parse (formatted as the output of a TextParser transformer) returns a parse where nouns and pronouns are replaced with “NN~”. A rough heuristic for removing “content-related” tokens. This transformer also collapses constructions with Wh-determiners like What time [is it] into What [is it].
- Parameters
output_field – name of attribute to output to.
input_field – name of field to use as input. defaults to ‘parsed’, which stores dependency parses as returned by the TextParser transformer; otherwise expects similarly-formatted input.
input_filter – a boolean function of signature input_filter(utterance, aux_input). parses will only be computed for utterances where input_filter returns True. By default, will always return True, meaning that arcs will be computed for all utterances.
verbosity – frequency of status messages.
-
convokit.phrasing_motifs.censorNouns.
censor_nouns
(text_entry)¶ Stand-alone function that removes nouns from parsed text.
- Parameters
text_entry – parsed text
- Returns
parse with nouns censored out.
-
class
convokit.phrasing_motifs.questionSentences.
QuestionSentences
(output_field, input_field, use_caps=True, filter_field='parsed', input_filter=None, verbosity=0)¶ Transformer that, given a list of sentences, returns a list containing only sentences which are questions (determined, as a rough heuristic, by whether they end in question marks). Returns an empty list if there are no questions.
- Parameters
output_field – name of attribute to output to.
input_field – name of field to use as input. expects a list where each sentence corresponds to a sentence in filter_field.
use_caps – whether to only use sentences which start in capital letters. defaults to True.
filter_field – name of field to check for question marks in, defaults to the output of the TextParser transformer. the entries of input_field and filter_field should exactly correspond.
input_filter – a boolean function of signature input_filter(utterance, aux_input). parses will only be computed for utterances where input_filter returns True. By default, will always return True, meaning that arcs will be computed for all utterances.
verbosity – frequency of status messages.