Text Processing¶

Various helpers to perform per-utterance computations

Example usage: demonstrating various text processing functionality.

class convokit.text_processing.textProcessor.TextProcessor(proc_fn, output_field, input_field=None, aux_input=None, input_filter=None, verbosity=0)¶

A base class for Transformers that perform per-utterance computations, i.e., computing utterance-by-utterance features or representations.

Parameters

proc_fn – function to compute per utterance. Supports one of two function signatures: proc_fn(input) and proc_fn(input, auxiliary_info).
input_field – If set to a string, the attribute of the utterance that proc_fn will take as input. If set to None, will default to reading utt.text. If set to a list of attributes, proc_fn will expect a dict of {attribute name: attribute value}.
output_field – If set to a string, the name of the attribute that the output of proc_fn will be written to. If set to a list, proc_fn will return a tuple where each entry in the tuple corresponds to a field in the list.
aux_input – any auxiliary input that proc_fn needs (e.g., a pre-loaded model); passed in as a dict.
input_filter – a boolean function of signature input_filter(utterance, aux_input). attributes will only be computed for utterances where input_filter returns True. By default, will always return True, meaning that attributes will be computed for all utterances.
verbosity – frequency at which to print status messages when computing attributes.

transform(corpus: convokit.model.corpus.Corpus) → convokit.model.corpus.Corpus¶

Computes per-utterance attributes for each utterance in the Corpus, storing these values in the output_field of each utterance as specified in the constructor. For utterances which do not contain all of the input_field attributes as specified in the constructor, or for utterances which return False on input_filter, this call will not annotate the utterance.

Parameters: corpus – Corpus
Returns: the corpus

transform_utterance(utt, override_input_filter=False)¶

Computes per-utterance attributes of an individual utterance or string. For utterances which do not contain all of the input_field attributes as specified in the constructor, or for utterances which return False on input_filter, this call will not annotate the utterance. For strings, will convert the string to an utterance and return the utterance, annotating it if input_field is not set to None at initialization.

Parameters

utt – utterance or a string
override_input_filter – ignore input_filter and compute attribute for all utterances

Returns

the utterance

class convokit.text_processing.textParser.TextParser(output_field='parsed', input_field=None, mode='parse', input_filter=<function TextParser.<lambda>>, spacy_nlp=None, sent_tokenizer=None, verbosity=0)¶

Transformer that dependency-parses each Utterance in a Corpus. This parsing step is a prerequisite for some of the models included in ConvoKit.

By default, will perform the following:

tokenize words and sentences

POS-tags words

dependency-parses sentences

However, also supports only tokenizing or only tokenizing-and-tagging. These are performed using SpaCy and nltk’s sentence tokenizer (since SpaCy requires dependency parses in order to tokenize sentences).

Parses are stored as json-serializable objects, consisting of a list of parses of each sentence, where each sentence-level parse is a dict containing:

toks: a list of tokens in the sentence.

rt: the index of the root of the dependency parse, in the list of tokens.

Each token, in turn, is a dict containing:

tok: the text

tag: the POS tag (if tagging is on)

dep: the dependency between that token and its parent (‘ROOT’ if the token is the root). available if parsing is on.

up: the index of the parent of the token in the sentence. does not exist for root tokens.

dn: the indices of the children of the token in the sentence

Note that in principle, this data structure is readily extensible – arbitrary fields could be added to sentences and tokens (e.g., to support NER).

Parameters

output_field – name of attribute to write parse to, defaults to ‘parsed’.
input_field – name of the field to use as input. the field must point to a string, and defaults to utterance.text.
mode – by default, is set to “parse”, which indicates that the entire parsing pipeline is to be run. if set to “tag”, only tokenizing and tagging will be run; if set to “tokenize”, only tokenizing will be run.
input_filter – a boolean function of signature input_filter(utterance, aux_input). parses will only be computed for utterances where input_filter returns True. By default, will always return True, meaning that parses will be computed for all utterances.
spacy_nlp – if provided, will use this SpaCy object to do parsing; otherwise will initialize an object via load(‘en’).
sent_tokenizer – if provided, will use this sentence tokenizer; otherwise will initialize nltk’s sentence tokenizer.
verbosity – frequency of status messages.

convokit.text_processing.textParser.process_text(text, mode='parse', sent_tokenizer=None, spacy_nlp=None)¶

Stand-alone function that computes the dependency parse of a string.

Parameters

text – string to parse
mode – ‘parse’, ‘tag’, or ‘tokenize’
sent_tokenizer – if provided, use this sentence tokenizer
spacy_nlp – if provided, use this spacy object

Returns

the parse, in json-serializable form.

class convokit.text_processing.textToArcs.TextToArcs(output_field, input_field='parsed', use_start=True, root_only=False, follow_deps=('conj', ), filter_fn=<function _use_text>, input_filter=<function TextToArcs.<lambda>>, verbosity=0)¶

Transformer that outputs a collection of arcs in the dependency parses of each sentence of an utterance. The returned collection is a list where each element corresponds to a sentence in the utterance. Each sentence is represented in terms of its arcs, in a space-separated string.

Each arc, in turn, can be read as follows:

x_y means that x is the parent and y is the child token (e.g., agree_does = agree –> does)

x_* means that x is a token with at least one descendant, which we do not resolve (this is analogous to bigrams backing off to unigrams)

x>y means that x and y are the first two tokens in the sentence

x>* means that x is the first token in the sentence.

Parameters

output_field – name of attribute to write arcs to.
input_field – name of field to use as input. defaults to ‘parsed’, which stores dependency parses as returned by the TextParser transformer; otherwise expects similarly-formatted input.
use_start – whether to also return the first and first two tokens of the sentence. defaults to True.
root_only – whether to return only the arcs from the root of the dependency parse. defaults to False.
follow_deps – if root_only is set to True, will nonetheless examine subtrees coming out of a dependency listed in follow_deps; by default will follow ‘conj’ dependencies (hence examining the parts of a sentence following conjunctions like “and”).
filter_fn – a boolean function determining which tokens to use. arcs will only be included if filter_fn returns True for all tokens in the arc. the function is of signature filter_fn(token, sent) where tokens and sents are formatted according to the output of TextParser. by default, will use tokens which only contain alphabet letters, or only contain letters after the first character (allowing for contractions like you ‘re): i.e.: tok[‘tok’].isalpha() or tok[‘tok’][1:].isalpha().
input_filter – a boolean function of signature input_filter(utterance, aux_input). parses will only be computed for utterances where input_filter returns True. By default, will always return True, meaning that arcs will be computed for all utterances.
verbosity – frequency of status messages.

convokit.text_processing.textToArcs.get_arcs_per_message(message, use_start=True, root_only=False, follow_deps=('conj', ), filter_fn=<function _use_text>)¶

Stand-alone function that returns the arcs of parsed text.

Parameters

message – parse to extract arcs from
use_start – whether to also return the first and first two tokens of the sentence. defaults to True.
root_only – whether to return only the arcs from the root of the dependency parse. defaults to False.
follow_deps – if root_only is set to True, will nonetheless examine subtrees coming out of a dependency listed in follow_deps; by default will follow ‘conj’ dependencies (hence examining the parts of a sentence following conjunctions like “and”).
filter_fn – a boolean function determining which tokens to use. arcs will only be included if filter_fn returns True for all tokens in the arc. the function is of signature filter_fn(token, sent) where tokens and sents are formatted according to the output of TextParser. by default, will use tokens which only contain alphabet letters, or only contain letters after the first character (allowing for contractions like you ‘re): i.e.: tok[‘tok’].isalpha() or tok[‘tok’][1:].isalpha().

Returns

a list where each element corresponds to a sentence in the input message. Each sentence is represented in terms of its arcs, in a space-separated string.