Paired Prediction

At a high level, Paired Prediction is a quasi-experimental method that controls for certain priors, see Cheng et al. 2014 for an illustrated example of PairedPrediction in research.

As an illustrative example, consider the Friends TV series, where we might want to examine how Rachel talks to Monica and Chandler differently. At one level, we might just look at the differences in the utterances where Rachel speaks to Monica and Rachel speaks to Chandler. But this inadvertently surfaces differences that might arise from Rachel interacting with Monica and Chandler separately in different settings and scenarios, and thus highlight only uninteresting differences in topics discussed.

Instead, we might want to look for subtler differences in speech, controlling for topic perhaps. One way we might to do this to look only at Conversations where Rachel, Monica, and Chandler are all present. We would then compare utterances where Rachel speaks to Monica and Rachel speaks to Chandler within that Conversation and look for differences between these paired sets of utterances.

Documentation for the two transformers that do paired prediction task is presented below. PairedPrediction transformer uses corpus object’s metadata features for predictions, while PairedVectorPrediction transformer utilizes vector data associated with the object. Also, see the documentation for Pairer transformer, which sets up the pairs needed in paired prediction analysis.

Example usage: Using Hyperconvo features to predict conversation growth on Reddit in a paired setting

class convokit.paired_prediction.pairedPrediction.PairedPrediction(obj_type: str, pred_feats: List[str], clf=None, pair_id_attribute_name: str = 'pair_id', pair_id_feat_name=None, label_attribute_name: str = 'pair_obj_label', label_feat_name=None, pair_orientation_attribute_name: str = 'pair_orientation', pair_orientation_feat_name=None)

At a high level, Paired Prediction is a quasi-experimental method that controls for certain priors, see Cheng et al. 2014 for an illustrated example of PairedPrediction in research. (https://cs.stanford.edu/people/jure/pubs/disqus-icwsm14.pdf)

See Pairer’s documentation for more information about pairing.

Parameters:
  • pred_feats – list of metadata attributes (i.e. predictive features) to be used in prediction. Features can either be values or a dictionary of key-value pairs.
  • clf – optional classifier to be used in the paired prediction
  • pair_id_attribute_name – metadata attribute name to use in annotating object with pair id, default: “pair_id”
  • label_attribute_name – metadata attribute name to use in annotating object with predicted label, default: “label”
  • pair_orientation_attribute_name – metadata attribute name to use in annotating object with pair orientation, default: “pair_orientation”
fit(corpus: convokit.model.corpus.Corpus, y=None, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function PairedPrediction.<lambda>>)

Fit the internal classifier on the paired object features, with an optional selector selecting for which corpus objects to include in the analysis

Parameters:
  • corpus – target Corpus
  • selector – a (lambda) function that takes a Corpus object and returns a bool: True if the object is to be included in the paired prediction. By default, includes all objects.
Returns:

fitted PairedPrediction Transformer

get_coefs(feature_names: List[str], coef_func=None)

Get dataframe of classifier coefficients.

Parameters:
  • feature_names – list of feature names to get coefficients for
  • coef_func – function for accessing the list of coefficients from the classifier model; by default, assumes it is a pipeline with a logistic regression component
Returns:

DataFrame of features and coefficients, indexed by feature names

summarize(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function PairedPrediction.<lambda>>, cv=KFold(n_splits=5, random_state=None, shuffle=True))

Run PairedPrediction on the corpus with cross-validation and returns the mean cross-validation score.

Parameters:
  • corpus – target Corpus (must be annotated with pair information using PairedPrediction.transform())
  • selector – a (lambda) function that takes a Corpus object and returns a bool: True if the object is to be included in summary. By default, includes all objects.
  • cv – optional CV model: default is KFold(n_splits=5, shuffle=True)
Returns:

cross-validation accuracy score

transform(corpus: convokit.model.corpus.Corpus) → convokit.model.corpus.Corpus

PairedPrediction does not add any annotations to the Corpus.

class convokit.paired_prediction.pairedVectorPrediction.PairedVectorPrediction(obj_type: str, vector_name: str, clf=None, pair_id_attribute_name: str = 'pair_id', label_attribute_name: str = 'pair_obj_label', pair_orientation_attribute_name: str = 'pair_orientation')

Transformer for doing a Paired Prediction with vectors.

Parameters:
  • obj_type – corpus component type being used for analysis: ‘utterance’, ‘speaker’, or ‘conversation’
  • vector_name – name of the vector matrix containing the bag-of-words vectors
  • clf – classifier to be used in the paired prediction; by default: standard-scaled logistic regression
  • pair_id_attribute_name – metadata attribute name to use in annotating object with pair id, default: “pair_id”
  • label_attribute_name – metadata attribute name to use in annotating object with predicted label, default: “label”
  • pair_orientation_attribute_name – metadata attribute name to use in annotating object with pair orientation, default: “pair_orientation”
fit(corpus: convokit.model.corpus.Corpus, y=None, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function PairedVectorPrediction.<lambda>>)

Fit the internal classifier to the Corpus component objects.

Parameters:
  • corpus – the target Corpus
  • selector – selector (lambda) function for which objects should be included in the analysis
Returns:

this Transformer object with a fitted internal classifier

get_coefs(feature_names: List[str], coef_func=None)

Get dataframe of classifier coefficients. By default, assumes it is a pipeline with a logistic regression component. For other setups, the user should define a custom coef_func.

Parameters:
  • feature_names – list of feature names to get coefficients for
  • coef_func – (optional) function for accessing the list of coefficients from the classifier model
Returns:

DataFrame of features and coefficients, indexed by feature names

summarize(corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.corpusComponent.CorpusComponent], bool] = <function PairedVectorPrediction.<lambda>>, cv=KFold(n_splits=5, random_state=None, shuffle=True))

Run PairedPrediction on the corpus with cross-validation.

Parameters:
  • corpus – annoted Corpus (with pair information from PairedPrediction.transform())
  • selector – selector (lambda) function for which objects should be included in the analysis
  • cv – optional CV model: default is KFold(n_splits=5, shuffle=True)
Returns:

cross-validation accuracy score