Column normalized Tf-Idf¶

Implements a modifed Tf-Idf transformer that normalizes by columns (i.e., term-wise).

class convokit.expected_context_framework.col_normed_tfidf.ColNormedTfidf(**kwargs)¶: Model that derives tf-idf reweighted representations of utterances, which are normalized by column. Can be used in ConvoKit through the ColNormedTfidfTransformer transformer; see documentation of that transformer for further details.

class convokit.expected_context_framework.col_normed_tfidf.ColNormedTfidfTransformer(input_field, output_field='col_normed_tfidf', model=None, **kwargs)¶

Transformer that derives tf-idf reweighted representations of utterances, which are normalized by column, i.e., per term. This may be helpful in deriving downstream representations that are less sensitive to relative term frequency; for instance, it could be used to derive input representations to ExpectedContextModelWrapper.

Parameters

input_field – the name of the attribute of utterances to use as input to fit. note that unless token_pattern is specified as an additional argument, this attribute must be a string consisting of whitespace-separated features.
output_field – the name of the attribute to write to in the transform step.
model – optional, an exisitng ColNormedTfidfTransformer
kwargs – other keyword arguments used to initialize the underlying TfidfVectorizer from scikit-learn, see that documentation for details.

dump(dirname)¶

Dumps model to disk.

Parameters: dirname – directory to write to
Returns: None

fit(corpus, y=None, selector=<function ColNormedTfidfTransformer.<lambda>>)¶

Fits a transformer over training data.

Parameters

corpus – Corpus
selector – which utterances to fit the transformer over. a boolean function of the form filter(utterance) that defaults to True (i.e., all utterances).

Returns

None

fit_transform(corpus, y=None, selector=<function ColNormedTfidfTransformer.<lambda>>)¶

Fit and run the Transformer on a single Corpus.

Parameters: corpus – the Corpus to use
Returns: same as transform

get_vocabulary()¶

Returns: array of feature names

load(dirname)¶

Loads model from disk.

Parameters: dirname – directory to load from
Returns: None

transform(corpus, selector=<function ColNormedTfidfTransformer.<lambda>>)¶

Computes column-normalized tf-idf representations for utterances in a corpus, stored in the corpus as <output_field>. Also annotates each utterance with a metadata field, <output_field>__n_feats, indicating the number of terms in the vocabulary that utterance contains.

Parameters

corpus – Corpus
selector – which utterances to transform

Returns

corpus, with per-utterance representations and vocabulary counts

transform_utterance(utt)¶

Computes tf-idf representations for a single utterance. Representation is stored in the utterance as <output_field>__vect; number of vocabulary terms that utterance contains is stored as <output_field>__n_feats

Parameters: utt – Utterance
Returns: utterance, with representation and vocabulary count