VectorClassifier¶

Example usage: bag-of-words classification.

class convokit.classifier.vectorClassifier.VectorClassifier(obj_type: str, vector_name: str, columns: List[str] = None, labeller: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>, clf=None, clf_attribute_name: str = 'prediction', clf_prob_attribute_name: str = 'pred_score')¶

Transformer that trains a classifier on the Corpus components’ text vector representation (e.g. bag-of-words, TF-IDF, etc)

Corpus must have a vector with the specified vector_name.

Inherits from Classifier and has access to its methods.

Parameters

obj_type – “speaker”, “utterance”, or “conversation”
vector_name – the metadata key where the Corpus object text vector is stored
columns – list of column names of vector matrix to use; uses all columns by default.
labeller – a (lambda) function that takes a Corpus object and returns True (y=1) or False (y=0) - i.e. labeller defines the y value of the object for fitting
clf – a sklearn Classifier. By default, clf is a Pipeline with StandardScaler and LogisticRegression
clf_attribute_name – the metadata attribute name to store the classifier prediction value under; default: “prediction”
clf_prob_attribute_name – the metadata attribute name to store the classifier prediction score under; default: “pred_score”

accuracy(corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function Classifier.<lambda>>)¶

Calculate the accuracy of the classification

Parameters

corpus – target Corpus
selector – (lambda) function selecting objects to include in this accuracy calculation; uses all objects by default

Returns

float value

base_accuracy(corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function Classifier.<lambda>>)¶

Get the base accuracy, i.e. the maximum of the percentages of results that are y=1 and y=0

Parameters

corpus – the classified Corpus
selector – (lambda) function selecting objects to include in this accuracy calculation; uses all objects by default

Returns

float value

classification_report(corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function Classifier.<lambda>>)¶

Generate classification report for transformed corpus using labeller for y_true and clf_attribute_name as y_pred

Parameters

corpus – target Corpus
selector – (lambda) function selecting objects to include in this classification report

Returns

classification report

confusion_matrix(corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function Classifier.<lambda>>)¶

Generate confusion matrix for transformed corpus using labeller for y_true and clf_attribute_name as y_pred

Parameters

corpus – target Corpus
selector – (lambda) function selecting objects to include in this confusion_matrix; uses all objects by default

Returns

sklearn confusion matrix

evaluate_with_cv(corpus: convokit.model.corpus.Corpus, cv=sklearn.model_selection.KFold, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>)¶

Evaluate the performance of predictive features (Classifier.pred_feats) in predicting for the label, using cross-validation for data splitting.

Parameters

corpus – target Corpus
cv – cross-validation model to use: KFold(n_splits=5, shuffle=True) by default.
selector – if running on a Corpus, this is a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.

Returns

cross-validated accuracy score

evaluate_with_train_test_split(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>, test_size: float = 0.2)¶

Evaluate the performance of predictive features (Classifier.pred_feats) in predicting for the label, using a train-test split.

Run either on a Corpus (with Classifier labeller, selector, obj_type settings) or a list of Corpus objects

Parameters

corpus – target Corpus
selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
test_size – size of test set

Returns

accuracy and confusion matrix

fit(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>, y=None)¶

Fit the Transformer’s internal classifier model on the vector matrix that represents one of the Corpus components, with an optional selector that selects for objects to be fit on.

Parameters

corpus – the target Corpus
selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.

Returns

the fitted VectorClassifier

fit_transform(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>) → convokit.model.corpus.Corpus¶

Runs the fit() and transform() steps in order, with the specified selector.

Parameters

corpus – the target Corpus
selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.

Returns

the target Corpus annotated

get_coefs(feature_names: List[str], coef_func=None)¶

Get dataframe of classifier coefficients

Parameters

feature_names – list of feature names to get coefficients for
coef_func – function for accessing the list of coefficients from the classifier model; by default, assumes it is a pipeline with a logistic regression component

Returns

DataFrame of features and coefficients, indexed by feature names

get_model()¶: Gets the Classifier’s internal model

get_y_true_pred(corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function Classifier.<lambda>>)¶

Get lists of true and predicted labels

Parameters

corpus – target Corpus
selector – (lambda) function selecting objects to get labels for; uses all objects by default

Returns

list of true labels, and list of predicted labels

set_model(clf)¶: Sets the Classifier’s internal model

summarize(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>)¶

Generate a DataFrame indexed by object id with the classifier predictions and scores.

Parameters

corpus – the annotated Corpus
selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.

Returns

a pandas DataFrame

summarize_objs(objs: List[convokit.model.corpusComponent.CorpusComponent])¶: Not implemented for VectorClassifier.

transform(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>) → convokit.model.corpus.Corpus¶

Annotate the corpus components with the classifier prediction and prediction score, with an optional selector that selects for objects to be classified. Objects that are not selected will get a metadata value of ‘None’ instead of the classifier prediction.

Parameters

corpus – the target Corpus
selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.

Returns

the target Corpus annotated

transform_objs(objs: List[convokit.model.corpusComponent.CorpusComponent]) → List[convokit.model.corpusComponent.CorpusComponent]¶: Not implemented for VectorClassifier.