VectorClassifier

Example usage: bag-of-words classification.

class convokit.classifier.vectorClassifier.VectorClassifier(obj_type: str, vector_name: str, columns: List[str] = None, labeller: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>, clf=None, clf_attribute_name: str = 'prediction', clf_prob_attribute_name: str = 'pred_score')

Transformer that trains a classifier on the Corpus components’ text vector representation (e.g. bag-of-words, TF-IDF, etc)

Corpus must have a vector with the specified vector_name.

Inherits from Classifier and has access to its methods.

Parameters:
  • obj_type – “speaker”, “utterance”, or “conversation”
  • vector_name – the metadata key where the Corpus object text vector is stored
  • columns – list of column names of vector matrix to use; uses all columns by default.
  • labeller – a (lambda) function that takes a Corpus object and returns True (y=1) or False (y=0) - i.e. labeller defines the y value of the object for fitting
  • clf – a sklearn Classifier. By default, clf is a Pipeline with StandardScaler and LogisticRegression
  • clf_attribute_name – the metadata attribute name to store the classifier prediction value under; default: “prediction”
  • clf_prob_attribute_name – the metadata attribute name to store the classifier prediction score under; default: “pred_score”
accuracy(corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function Classifier.<lambda>>)

Calculate the accuracy of the classification

Parameters:
  • corpus – target Corpus
  • selector – (lambda) function selecting objects to include in this accuracy calculation; uses all objects by default
Returns:

float value

base_accuracy(corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function Classifier.<lambda>>)

Get the base accuracy, i.e. the maximum of the percentages of results that are y=1 and y=0

Parameters:
  • corpus – the classified Corpus
  • selector – (lambda) function selecting objects to include in this accuracy calculation; uses all objects by default
Returns:

float value

classification_report(corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function Classifier.<lambda>>)

Generate classification report for transformed corpus using labeller for y_true and clf_attribute_name as y_pred

Parameters:
  • corpus – target Corpus
  • selector – (lambda) function selecting objects to include in this classification report
Returns:

classification report

confusion_matrix(corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function Classifier.<lambda>>)

Generate confusion matrix for transformed corpus using labeller for y_true and clf_attribute_name as y_pred

Parameters:
  • corpus – target Corpus
  • selector – (lambda) function selecting objects to include in this confusion_matrix; uses all objects by default
Returns:

sklearn confusion matrix

evaluate_with_cv(corpus: convokit.model.corpus.Corpus, cv=KFold(n_splits=5, random_state=None, shuffle=True), selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>)

Evaluate the performance of predictive features (Classifier.pred_feats) in predicting for the label, using cross-validation for data splitting.

Parameters:
  • corpus – target Corpus
  • cv – cross-validation model to use: KFold(n_splits=5, shuffle=True) by default.
  • selector – if running on a Corpus, this is a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
Returns:

cross-validated accuracy score

evaluate_with_train_test_split(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>, test_size: float = 0.2)

Evaluate the performance of predictive features (Classifier.pred_feats) in predicting for the label, using a train-test split.

Run either on a Corpus (with Classifier labeller, selector, obj_type settings) or a list of Corpus objects

Parameters:
  • corpus – target Corpus
  • selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
  • test_size – size of test set
Returns:

accuracy and confusion matrix

fit(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>, y=None)

Fit the Transformer’s internal classifier model on the vector matrix that represents one of the Corpus components, with an optional selector that selects for objects to be fit on.

Parameters:
  • corpus – the target Corpus
  • selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
Returns:

the fitted VectorClassifier

fit_transform(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>) → convokit.model.corpus.Corpus

Runs the fit() and transform() steps in order, with the specified selector.

Parameters:
  • corpus – the target Corpus
  • selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
Returns:

the target Corpus annotated

get_coefs(feature_names: List[str], coef_func=None)

Get dataframe of classifier coefficients

Parameters:
  • feature_names – list of feature names to get coefficients for
  • coef_func – function for accessing the list of coefficients from the classifier model; by default, assumes it is a pipeline with a logistic regression component
Returns:

DataFrame of features and coefficients, indexed by feature names

get_model()

Gets the Classifier’s internal model

get_y_true_pred(corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function Classifier.<lambda>>)

Get lists of true and predicted labels

Parameters:
  • corpus – target Corpus
  • selector – (lambda) function selecting objects to get labels for; uses all objects by default
Returns:

list of true labels, and list of predicted labels

set_model(clf)

Sets the Classifier’s internal model

summarize(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>)

Generate a DataFrame indexed by object id with the classifier predictions and scores.

Parameters:
  • corpus – the annotated Corpus
  • selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
Returns:

a pandas DataFrame

summarize_objs(objs: List[convokit.model.corpusComponent.CorpusComponent])

Not implemented for VectorClassifier.

transform(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function VectorClassifier.<lambda>>) → convokit.model.corpus.Corpus

Annotate the corpus components with the classifier prediction and prediction score, with an optional selector that selects for objects to be classified. Objects that are not selected will get a metadata value of ‘None’ instead of the classifier prediction.

Parameters:
  • corpus – the target Corpus
  • selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
Returns:

the target Corpus annotated

transform_objs(objs: List[convokit.model.corpusComponent.CorpusComponent]) → List[convokit.model.corpusComponent.CorpusComponent]

Not implemented for VectorClassifier.