TextParser¶
-
class
convokit.text_processing.textParser.
TextParser
(output_field='parsed', input_field=None, mode='parse', input_filter=<function TextParser.<lambda>>, spacy_nlp=None, sent_tokenizer=None, verbosity=0)¶ Transformer that dependency-parses each Utterance in a Corpus. This parsing step is a prerequisite for some of the models included in ConvoKit.
By default, will perform the following:
tokenize words and sentences
POS-tags words
dependency-parses sentences
However, also supports only tokenizing or only tokenizing-and-tagging. These are performed using SpaCy and nltk’s sentence tokenizer (since SpaCy requires dependency parses in order to tokenize sentences).
Parses are stored as json-serializable objects, consisting of a list of parses of each sentence, where each sentence-level parse is a dict containing:
toks: a list of tokens in the sentence.
rt: the index of the root of the dependency parse, in the list of tokens.
Each token, in turn, is a dict containing:
tok: the text
tag: the POS tag (if tagging is on)
dep: the dependency between that token and its parent (‘ROOT’ if the token is the root). available if parsing is on.
up: the index of the parent of the token in the sentence. does not exist for root tokens.
dn: the indices of the children of the token in the sentence
Note that in principle, this data structure is readily extensible – arbitrary fields could be added to sentences and tokens (e.g., to support NER).
- Parameters
output_field – name of attribute to write parse to, defaults to ‘parsed’.
input_field – name of the field to use as input. the field must point to a string, and defaults to utterance.text.
mode – by default, is set to “parse”, which indicates that the entire parsing pipeline is to be run. if set to “tag”, only tokenizing and tagging will be run; if set to “tokenize”, only tokenizing will be run.
input_filter – a boolean function of signature input_filter(utterance, aux_input). parses will only be computed for utterances where input_filter returns True. By default, will always return True, meaning that parses will be computed for all utterances.
spacy_nlp – if provided, will use this SpaCy object to do parsing; otherwise will initialize an object via load(‘en’).
sent_tokenizer – if provided, will use this sentence tokenizer; otherwise will initialize nltk’s sentence tokenizer.
verbosity – frequency of status messages.
-
convokit.text_processing.textParser.
process_text
(text, mode='parse', sent_tokenizer=None, spacy_nlp=None)¶ Stand-alone function that computes the dependency parse of a string.
- Parameters
text – string to parse
mode – ‘parse’, ‘tag’, or ‘tokenize’
sent_tokenizer – if provided, use this sentence tokenizer
spacy_nlp – if provided, use this spacy object
- Returns
the parse, in json-serializable form.