Bag-of-words Transformer¶
-
class
convokit.bag_of_words.bow_transformer.
BoWTransformer
(obj_type: str, vector_name='bow_vector', text_func: Callable[[convokit.model.corpusComponent.CorpusComponent], str] = None, vectorizer=None)¶ Bag-of-Words Transformer for annotating a Corpus’s objects with the bag-of-words vectorization of some textual element of the Corpus components.
Runs on the Corpus’s Speakers, Utterances, or Conversations (as specified by obj_type). By default, the text used for the different object types:
For utterances, this would be the utterance text.
For conversations, this would be joined texts of all the utterances in the conversation
For speakers, this would be the joined texts of all the utterances by the speaker
Other custom text configurations can be configured using the text_func argument
Compatible with any type of vectorizer (e.g. bag-of-words, TF-IDF, etc)
- Parameters
obj_type – “speaker”, “utterance”, or “conversation”
vectorizer – a sklearn vectorizer object; default is CountVectorizer(min_df=10, max_df=.5, ngram_range(1, 1), binary=False, max_features=15000)
vector_name – name for the vector matrix generated in the transform() step
text_func – function for getting text from the Corpus component object. By default, this is configured based on the obj_type.
-
fit
(corpus: convokit.model.corpus.Corpus, y=None, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function BoWTransformer.<lambda>>)¶ Fit the Transformer’s internal vectorizer on the Corpus objects’ texts, with an optional selector that selects for objects to be fit on.
- Parameters
corpus – the target Corpus
selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
- Returns
the fitted BoWTransformer
-
fit_transform
(corpus: convokit.model.corpus.Corpus, y=None, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function BoWTransformer.<lambda>>) → convokit.model.corpus.Corpus¶ Fit the Transformer’s internal vectorizer on the Corpus component objects’ texts, and then compute vector representations for them and stores it in the Corpus object as vector_name.
- Parameters
corpus – target Corpus
selector – a (lambda) function that takes a Corpus component object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
- Returns
the Corpus with the computed vector matrix stored in it
-
get_vocabulary
()¶ Get the vocabulary of the vectorizer object
-
transform
(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function BoWTransformer.<lambda>>) → convokit.model.corpus.Corpus¶ Computes the vector matrix for the Corpus component objects and then stores it in a ConvoKitMatrix object, which is saved in the Corpus as vector_name.
- Parameters
corpus – the target Corpus
selector – a (lambda) function that takes a Corpus component object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
- Returns
the target Corpus annotated