TextParser

class convokit.text_processing.textParser.TextParser(output_field='parsed', input_field=None, mode='parse', input_filter=<function TextParser.<lambda>>, spacy_nlp=None, sent_tokenizer=None, verbosity=0)

Transformer that dependency-parses each Utterance in a Corpus. This parsing step is a prerequisite for some of the models included in ConvoKit.

By default, will perform the following:

  • tokenize words and sentences

  • POS-tags words

  • dependency-parses sentences

However, also supports only tokenizing or only tokenizing-and-tagging. These are performed using SpaCy and nltk’s sentence tokenizer (since SpaCy requires dependency parses in order to tokenize sentences).

Parses are stored as json-serializable objects, consisting of a list of parses of each sentence, where each sentence-level parse is a dict containing:

  • toks: a list of tokens in the sentence.

  • rt: the index of the root of the dependency parse, in the list of tokens.

Each token, in turn, is a dict containing:

  • tok: the text

  • tag: the POS tag (if tagging is on)

  • dep: the dependency between that token and its parent (‘ROOT’ if the token is the root). available if parsing is on.

  • up: the index of the parent of the token in the sentence. does not exist for root tokens.

  • dn: the indices of the children of the token in the sentence

Note that in principle, this data structure is readily extensible – arbitrary fields could be added to sentences and tokens (e.g., to support NER).

Parameters
  • output_field – name of attribute to write parse to, defaults to ‘parsed’.

  • input_field – name of the field to use as input. the field must point to a string, and defaults to utterance.text.

  • mode – by default, is set to “parse”, which indicates that the entire parsing pipeline is to be run. if set to “tag”, only tokenizing and tagging will be run; if set to “tokenize”, only tokenizing will be run.

  • input_filter – a boolean function of signature input_filter(utterance, aux_input). parses will only be computed for utterances where input_filter returns True. By default, will always return True, meaning that parses will be computed for all utterances.

  • spacy_nlp – if provided, will use this SpaCy object to do parsing; otherwise will initialize an object via load(‘en’).

  • sent_tokenizer – if provided, will use this sentence tokenizer; otherwise will initialize nltk’s sentence tokenizer.

  • verbosity – frequency of status messages.

convokit.text_processing.textParser.process_text(text, mode='parse', sent_tokenizer=None, spacy_nlp=None)

Stand-alone function that computes the dependency parse of a string.

Parameters
  • text – string to parse

  • mode – ‘parse’, ‘tag’, or ‘tokenize’

  • sent_tokenizer – if provided, use this sentence tokenizer

  • spacy_nlp – if provided, use this spacy object

Returns

the parse, in json-serializable form.