TextCleaner

class convokit.text_processing.textCleaner.TextCleaner(text_cleaner: Optional[Callable[[str], str]] = None, input_field=None, input_filter=<function TextCleaner.<lambda>>, verbosity: int = 100, replace_text: bool = True, save_original: bool = True)

Transformer that cleans the text of utterances in an input Corpus. By default, the text cleaner assumes the text is in English. It fixes unicode errors, transliterates text to the closest ASCII representation, lowercases text, removes line breaks, and replaces URLs, emails, phone numbers, numbers, currency symbols with special tokens.

This transformer can be configured with any custom text cleaning function that takes a text as input and outputs the cleaned version of the text.

Parameters
  • text_cleaner – an optional function for cleaning text. If unfilled, uses ConvoKit’s default text cleaner as described above.

  • input_field – name of attribute to use as input. This attribute must point to a string, and defaults to utterance.text.

  • input_filter – a boolean function of signature input_filter(utterance, aux_input). Text cleaning will only be applied to utterances where input_filter returns True. By default, will always return True, meaning that all utterances will be cleaned.

  • verbosity – frequency of status messages

  • replace_text – whether to replace the text being cleaned with the cleaned version. True by default. If False, the cleaned text is stored under attribute ‘cleaned’.

  • save_original – if replacing text, whether to save the original version of the text. If True, saves it under the ‘original’ attribute.

transform(corpus: convokit.model.corpus.Corpus) → convokit.model.corpus.Corpus

Computes per-utterance attributes for each utterance in the Corpus, storing these values in the output_field of each utterance as specified in the constructor. For utterances which do not contain all of the input_field attributes as specified in the constructor, or for utterances which return False on input_filter, this call will not annotate the utterance.

Parameters

corpus – Corpus

Returns

the corpus