TextCleaner¶

class convokit.text_processing.textCleaner.TextCleaner(text_cleaner: Optional[Callable[[str], str]] = None, input_field=None, input_filter=<function TextCleaner.<lambda>>, verbosity: int = 100, replace_text: bool = True, save_original: bool = True)¶

Transformer that cleans the text of utterances in an input Corpus. By default, the text cleaner assumes the text is in English. It fixes unicode errors, transliterates text to the closest ASCII representation, lowercases text, removes line breaks, and replaces URLs, emails, phone numbers, numbers, currency symbols with special tokens.

This transformer can be configured with any custom text cleaning function that takes a text as input and outputs the cleaned version of the text.

Parameters

text_cleaner – an optional function for cleaning text. If unfilled, uses ConvoKit’s default text cleaner as described above.
input_field – name of attribute to use as input. This attribute must point to a string, and defaults to utterance.text.
input_filter – a boolean function of signature input_filter(utterance, aux_input). Text cleaning will only be applied to utterances where input_filter returns True. By default, will always return True, meaning that all utterances will be cleaned.
verbosity – frequency of status messages
replace_text – whether to replace the text being cleaned with the cleaned version. True by default. If False, the cleaned text is stored under attribute ‘cleaned’.
save_original – if replacing text, whether to save the original version of the text. If True, saves it under the ‘original’ attribute.

transform(corpus: convokit.model.corpus.Corpus) → convokit.model.corpus.Corpus¶

Computes per-utterance attributes for each utterance in the Corpus, storing these values in the output_field of each utterance as specified in the constructor. For utterances which do not contain all of the input_field attributes as specified in the constructor, or for utterances which return False on input_filter, this call will not annotate the utterance.

Parameters: corpus – Corpus
Returns: the corpus