Tools¶

ConvoKit provides a comprehensive set of analysis tools for extracting conversational features and studying social phenomena.

Bag-of-Words¶

Vectorizes corpus objects (utterances, speakers, or conversations) using bag-of-words representations. Compatible with any sklearn-style vectorizer including TF-IDF. Stores representations as ConvoKitMatrix objects for downstream use with classifiers.

Transformer: BoWTransformer
Tags: feature extraction, vectorization, speaker, utterance, conversation

Example: Bag-of-Words classification

Classifier¶

Trains and applies a classifier on corpus object metadata features. Uses an sklearn-compatible classifier and exposes evaluation utilities including cross-validation and train-test split scoring.

Transformer: Classifier
Tags: classification, modeling, utterance, conversation, speaker, labeling

Example: Politeness

Column Normalized Tf-Idf¶

A modified Tf-Idf transformer that normalizes by columns (term-wise) rather than rows, producing more balanced term representations across a corpus. Often used as input to the Expected Conversational Context Framework.

Transformer: ColNormedTfidfTransformer
Tags: feature extraction, utterance, representation

Community Embedder¶

Embeds community-level hypergraph statistics from HyperConvo into a low-dimensional space, enabling comparison and visualization of structural differences across communities or subreddits.

Transformer: CommunityEmbedder
Research: Patterns of Participant Interactions
Tags: measurement, feature extraction, statistical, corpus, pattern

Conversation Dynamics Similarity (ConDynS)¶

Measures the similarity between two conversations with respect to their conversational dynamics using SCD-based representations and sequence alignment. Enables topic-independent comparison of how conversations unfold.

Transformer: ConvoDynamicsSimilarity
Research: A Similarity Measure for Comparing Conversational Dynamics
Tags: measurement, feature extraction, LLM, corpus, pattern, conversation-flow, context, comparison

Expected Conversational Context Framework¶

Derives representations of utterances and terms based on their expected conversational context — the replies they tend to elicit or the utterances they tend to appear near. Supports both forward and backward context modeling.

Transformer: ExpectedContextModel
Research: Expected Context Framework
Tags: structural, modeling, utterance, exchange, linguistic, context

Examples:

Fighting Words¶

Identifies the n-gram features that most distinguish two groups of corpus objects, using Monroe et al.’s Dirichlet-multinomial method. Annotates objects with the top fighting words of each class.

Transformer: FightingWords
Tags: measurement, statistical, utterance, conversation, speaker, power, influence, social, pattern, comparison

Examples: r/atheism vs r/Christianity

Forecaster¶

A framework for forecasting future conversation outcomes as they develop in real time. Wraps any ForecasterModel (e.g. CRAFT, BERT-based models) and feeds a chronological stream of context tuples to enable per-utterance prediction.

Transformer: Forecaster
Research: Trouble on the Horizon
Tags: prediction, machine learning, neural, utterance, forecasting, LLM

Example: CRAFT on CGA

Hypergraph Conversation Representation¶

Extracts structural features of conversations through a hypergraph model, computing degree distribution statistics and motif counts for both full conversations and mid-threads. Forms the basis for ThreadEmbedder and CommunityEmbedder.

Transformer: HyperConvo
Research: Patterns of Participant Interactions
Tags: structural, graph, conversation, pattern

Example: Reddit hypergraph analysis

Linguistic Coordination¶

Measures the propensity of a speaker to echo the function words used by another speaker in a conversation, serving as a proxy for linguistic coordination and relative power dynamics between individuals or groups.

Transformer: Coordination
Research: Echoes of Power
Tags: measurement, statistical, conversation, linguistic, power, influence

Example: Power balance in U.S. Supreme Court

Linguistic Diversity¶

Computes the linguistic divergence between a speaker’s language in each conversation and a reference language model trained on other conversations or speakers, measuring how a speaker’s voice develops over time.

Transformer: SpeakerConvoDiversity
Research: Finding Your Voice
Tags: measurement, statistical, speaker, corpus, diversity, development

Example: Linguistic diversity on ChangeMyView

LLM Prompt Transformer¶

Applies custom LLM prompts to corpus objects at any level — utterances, conversations, speakers, or the entire corpus — and stores responses as metadata. Supports multiple LLM providers including OpenAI GPT, Google Gemini, and local models.

Transformer: LLMPromptTransformer
Tags: LLM, feature extraction, utterance, conversation, speaker, corpus, pragmatics

Example: GenAI module

Pairer¶

Annotates corpus objects with pairing information needed for paired prediction analyses. Controls for conversational context by pairing objects from the same conversation, enabling comparisons that isolate the variable of interest.

Transformer: Pairer
Tags: pre-processing, prediction, statistical, utterance, conversation, speaker, representation, pattern

Paired Prediction¶

A quasi-experimental prediction method that controls for confounding priors by comparing matched pairs of corpus objects from the same conversation, enabling more rigorous causal inference.

Transformer: PairedPrediction
Research: Antisocial Behavior in Online Discussion Communities
Tags: prediction, classification, machine learning, corpus, detection

Example: Predicting conversation growth on Reddit

Phrasing Motifs¶

Extracts arc-based phrasing patterns from dependency-parsed utterances by abstracting away content words, capturing common syntactic structures independently of topic. Used as input to the Prompt Types framework.

Transformer: PhrasingMotifs
Research: Asking Too Much?
Tags: feature extraction, utterance, linguistic, pragmatics

Example: Phrasing motifs in prompt type models

Pivotal Moment Measure¶

Identifies pivotal moments in a conversation where simulated alternative responses would most change the predicted outcome. Combines a simulator model and a forecaster model to score each conversational position.

Transformer: PivotalMomentMeasure
Tags: prediction, modeling, conversation, turning-points, simulation

Example: Pivotal moments in conversations gone awry

Politeness Strategies¶

Detects lexical and parse-based politeness and impoliteness strategies in utterances based on the Brown and Levinson politeness framework, producing binary feature vectors over a set of validated linguistic markers.

Transformer: PolitenessStrategies
Research: A Computational Approach to Politeness
Tags: measurement, statistical, conversation, linguistic, social, politeness, pragmatics

Example: Extracting politeness features and markers

Prompt Types¶

Infers latent types of conversational prompts based on how they are phrased, using SVD-based embeddings of phrasing motifs relative to their response contexts. Assigns each utterance a prompt type and vector representation.

Transformer: PromptTypes
Research: Asking Too Much?
Tags: feature extraction, vectorization, utterance, context, pragmatics

Examples:

Ranker¶

Sorts and annotates corpus objects with rankings based on a user-defined scoring function. Supports ranking of utterances, speakers, or conversations by any derived or metadata feature.

Transformer: Ranker
Tags: sorting, statistical, utterance, conversation, speaker, representation

Example: Ranking users in r/Cornell by comment count

Redirection and Utterance Likelihood¶

Measures the extent to which an utterance redirects conversational flow away from its context, and computes utterance log-likelihoods given surrounding context using a language model.

Transformer: Redirection and UtteranceLikelihood
Research: Conversational Redirection in Therapy
Tags: structural, modeling, utterance, exchange, conversation-flow, detection, LLM, simulation

Example: Redirection in Supreme Court

Summary of Conversation Dynamics (SCD)¶

Generates structured natural-language summaries of conversational dynamics using the LLM Prompt Transformer. Summaries describe how interactions unfold over time, capturing turn-by-turn shifts in tone, topic, and social dynamics.

Transformer: SCD
Research: How Did We Get Here? Summarizing Conversation Dynamics
Tags: measurement, feature extraction, LLM, corpus, pattern, conversation-flow, context

Example: SCD on conversations gone awry

TextCleaner¶

Cleans utterance text by fixing unicode errors, lowercasing, removing line breaks, and replacing URLs, emails, phone numbers, and currency symbols with special tokens. Supports custom cleaning functions.

Transformer: TextCleaner
Tags: pre-processing, utterance, linguistic

TextParser¶

Dependency-parses each utterance in a corpus using SpaCy. This parsing step is a prerequisite for several other ConvoKit transformers.

Transformer: TextParser
Tags: pre-processing, parsing, utterance, linguistic

Example: Text Preprocessing

TextProcessor (base class)¶

Abstract base class for text processing transformers in ConvoKit. Provides a shared interface for transformers that read and write utterance text fields, enabling pipeline composition of text processing steps.

Transformer: Transformer (base class)
Tags: pre-processing, utterance, linguistic

Example: Text Preprocessing

TextToArcs¶

Converts dependency parse output into arc-based representations, where each sentence is expressed as a collection of dependency arc strings. Requires TextParser to be run first.

Transformer: TextToArcs
Tags: pre-processing, structural, parsing, utterance, linguistic, pattern

Thread Embedder¶

Embeds thread-level hypergraph statistics from HyperConvo into a low-dimensional space using SVD or other dimensionality reduction. Useful for visualizing and comparing thread structure across a corpus.

Transformer: ThreadEmbedder
Research: Patterns of Participant Interactions
Tags: measurement, feature extraction, statistical, corpus, pattern

Vector Classifier¶

Trains and applies a classifier on corpus object vector representations (e.g. bag-of-words, TF-IDF). Inherits from Classifier. Requires a ConvoKitMatrix with the specified vector name to be present on the corpus.

Transformer: VectorClassifier
Tags: classification, modeling, vectorization, utterance, conversation, speaker, labeling

Example: Bag-of-Words classification

Tools¶

Bag-of-Words¶

Classifier¶

Column Normalized Tf-Idf¶

Community Embedder¶

Conversation Dynamics Similarity (ConDynS)¶

Expected Conversational Context Framework¶

Fighting Words¶

Forecaster¶

Hypergraph Conversation Representation¶

Linguistic Coordination¶

Linguistic Diversity¶

LLM Prompt Transformer¶

Pairer¶

Paired Prediction¶

Phrasing Motifs¶

Pivotal Moment Measure¶

Politeness Strategies¶

Prompt Types¶

Ranker¶

Redirection and Utterance Likelihood¶

Summary of Conversation Dynamics (SCD)¶

Talk-Time Sharing Dynamics¶

TextCleaner¶

TextParser¶

TextProcessor (base class)¶

TextToArcs¶

Thread Embedder¶

Vector Classifier¶