Bag-of-words Transformer

class convokit.bag_of_words.bow_transformer.BoWTransformer(obj_type: str, vector_name='bow_vector', text_func: Callable[[convokit.model.corpusComponent.CorpusComponent], str] = None, vectorizer=None)

Bag-of-Words Transformer for annotating a Corpus’s objects with the bag-of-words vectorization of some textual element of the Corpus components.

Runs on the Corpus’s Speakers, Utterances, or Conversations (as specified by obj_type). By default, the text used for the different object types:

  • For utterances, this would be the utterance text.

  • For conversations, this would be joined texts of all the utterances in the conversation

  • For speakers, this would be the joined texts of all the utterances by the speaker

Other custom text configurations can be configured using the text_func argument

Compatible with any type of vectorizer (e.g. bag-of-words, TF-IDF, etc)

Parameters
  • obj_type – “speaker”, “utterance”, or “conversation”

  • vectorizer – a sklearn vectorizer object; default is CountVectorizer(min_df=10, max_df=.5, ngram_range(1, 1), binary=False, max_features=15000)

  • vector_name – name for the vector matrix generated in the transform() step

  • text_func – function for getting text from the Corpus component object. By default, this is configured based on the obj_type.

fit(corpus: convokit.model.corpus.Corpus, y=None, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function BoWTransformer.<lambda>>)

Fit the Transformer’s internal vectorizer on the Corpus objects’ texts, with an optional selector that selects for objects to be fit on.

Parameters
  • corpus – the target Corpus

  • selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.

Returns

the fitted BoWTransformer

fit_transform(corpus: convokit.model.corpus.Corpus, y=None, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function BoWTransformer.<lambda>>) → convokit.model.corpus.Corpus

Fit the Transformer’s internal vectorizer on the Corpus component objects’ texts, and then compute vector representations for them and stores it in the Corpus object as vector_name.

Parameters
  • corpus – target Corpus

  • selector – a (lambda) function that takes a Corpus component object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.

Returns

the Corpus with the computed vector matrix stored in it

get_vocabulary()

Get the vocabulary of the vectorizer object

transform(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function BoWTransformer.<lambda>>) → convokit.model.corpus.Corpus

Computes the vector matrix for the Corpus component objects and then stores it in a ConvoKitMatrix object, which is saved in the Corpus as vector_name.

Parameters
  • corpus – the target Corpus

  • selector – a (lambda) function that takes a Corpus component object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.

Returns

the target Corpus annotated