Features & APIs¶
ConvoKit provides a comprehensive set of analysis tools for extracting conversational features and studying social phenomena.
TextParser¶
Dependency-parses each utterance in a corpus using SpaCy. This parsing step is a prerequisite for several other ConvoKit transformers.
API: TextParser
Tags: pre-processing, parsing, utterance, linguistic
Example: Text Preprocessing
TextToArcs¶
Converts dependency parse output into arc-based representations, where each sentence is expressed as a collection of dependency arc strings. Requires TextParser to be run first.
API: TextToArcs
Tags: pre-processing, structural, parsing, utterance, linguistic, pattern
TextCleaner¶
Cleans utterance text by fixing unicode errors, lowercasing, removing line breaks, and replacing URLs, emails, phone numbers, and currency symbols with special tokens. Supports custom cleaning functions.
API: TextCleaner
Tags: pre-processing, utterance, linguistic
TextProcessor (base class)¶
Abstract base class for text processing transformers in ConvoKit. Provides a shared interface for transformers that read and write utterance text fields, enabling pipeline composition of text processing steps.
Tags: pre-processing, utterance, linguistic
Example: Text Preprocessing
Bag-of-Words¶
Vectorizes corpus objects (utterances, speakers, or conversations) using bag-of-words representations. Compatible with any sklearn-style vectorizer including TF-IDF. Stores representations as ConvoKitMatrix objects for downstream use with classifiers.
API: BoWTransformer
Tags: feature extraction, vectorization, speaker, utterance, conversation
Example: Bag-of-Words classification
Column Normalized Tf-Idf¶
A modified Tf-Idf transformer that normalizes by columns (term-wise) rather than rows, producing more balanced term representations across a corpus. Often used as input to the Expected Conversational Context Framework.
Tags: feature extraction, utterance, representation
Hypergraph Conversation Representation¶
Extracts structural features of conversations through a hypergraph model, computing degree distribution statistics and motif counts for both full conversations and mid-threads. Forms the basis for ThreadEmbedder and CommunityEmbedder.
API: HyperConvo
Research: Patterns of Participant Interactions
Tags: structural, graph, conversation, pattern
Example: Reddit hypergraph analysis
Phrasing Motifs¶
Extracts arc-based phrasing patterns from dependency-parsed utterances by abstracting away content words, capturing common syntactic structures independently of topic. Used as input to the Prompt Types framework.
API: PhrasingMotifs
Research: Asking Too Much?
Tags: feature extraction, utterance, linguistic, pragmatics
Politeness Strategies¶
Detects lexical and parse-based politeness and impoliteness strategies in utterances based on the Brown and Levinson politeness framework, producing binary feature vectors over a set of validated linguistic markers.
API: PolitenessStrategies
Research: A Computational Approach to Politeness
Tags: measurement, statistical, conversation, linguistic, social, politeness, pragmatics
Prompt Types¶
Infers latent types of conversational prompts based on how they are phrased, using SVD-based embeddings of phrasing motifs relative to their response contexts. Assigns each utterance a prompt type and vector representation.
API: PromptTypes
Research: Asking Too Much?
Tags: feature extraction, vectorization, utterance, context, pragmatics
Examples:
Expected Conversational Context Framework¶
Derives representations of utterances and terms based on their expected conversational context — the replies they tend to elicit or the utterances they tend to appear near. Supports both forward and backward context modeling.
API: ExpectedContextModel
Research: Expected Context Framework
Tags: structural, modeling, utterance, exchange, linguistic, context
Examples:
Redirection and Utterance Likelihood¶
Measures the extent to which an utterance redirects conversational flow away from its context, and computes utterance log-likelihoods given surrounding context using a language model.
Research: Conversational Redirection in Therapy
Tags: structural, modeling, utterance, exchange, conversation-flow, detection, LLM, simulation
Example: Redirection in Supreme Court
Pivotal Moment Measure¶
Identifies pivotal moments in a conversation where simulated alternative responses would most change the predicted outcome. Combines a simulator model and a forecaster model to score each conversational position.
API: PivotalMomentMeasure
Tags: prediction, modeling, conversation, turning-points, simulation
LLM Prompt Transformer¶
Applies custom LLM prompts to corpus objects at any level — utterances, conversations, speakers, or the entire corpus — and stores responses as metadata. Supports multiple LLM providers including OpenAI GPT, Google Gemini, and local models.
API: LLMPromptTransformer
Tags: feature extraction, utterance, conversation, speaker, corpus, pragmatics
Example: GenAI module
Classifier¶
Trains and applies a classifier on corpus object metadata features. Uses an sklearn-compatible classifier and exposes evaluation utilities including cross-validation and train-test split scoring.
API: Classifier
Tags: classification, modeling, utterance, conversation, speaker, labeling
Example: Politeness
Vector Classifier¶
Trains and applies a classifier on corpus object vector representations (e.g. bag-of-words, TF-IDF). Inherits from Classifier. Requires a ConvoKitMatrix with the specified vector name to be present on the corpus.
API: VectorClassifier
Tags: classification, modeling, vectorization, utterance, conversation, speaker, labeling
Example: Bag-of-Words classification
Linguistic Coordination¶
Measures the propensity of a speaker to echo the function words used by another speaker in a conversation, serving as a proxy for linguistic coordination and relative power dynamics between individuals or groups.
API: Coordination
Research: Echoes of Power
Tags: measurement, statistical, conversation, linguistic, power, influence
Example: Power balance in U.S. Supreme Court
Fighting Words¶
Identifies the n-gram features that most distinguish two groups of corpus objects, using Monroe et al.’s Dirichlet-multinomial method. Annotates objects with the top fighting words of each class.
API: FightingWords
Tags: measurement, statistical, utterance, conversation, speaker, power, influence, social, pattern, comparison
Examples: r/atheism vs r/Christianity
Forecaster¶
A framework for forecasting future conversation outcomes as they develop in real time. Wraps any ForecasterModel (e.g. CRAFT, BERT-based models) and feeds a chronological stream of context tuples to enable per-utterance prediction.
API: Forecaster
Research: Trouble on the Horizon
Tags: prediction, machine learning, neural, utterance, forecasting, LLM
Example: CRAFT on CGA
Thread Embedder¶
Embeds thread-level hypergraph statistics from HyperConvo into a low-dimensional space using SVD or other dimensionality reduction. Useful for visualizing and comparing thread structure across a corpus.
API: ThreadEmbedder
Research: Patterns of Participant Interactions
Tags: measurement, feature extraction, statistical, corpus, pattern
Community Embedder¶
Embeds community-level hypergraph statistics from HyperConvo into a low-dimensional space, enabling comparison and visualization of structural differences across communities or subreddits.
API: CommunityEmbedder
Research: Patterns of Participant Interactions
Tags: measurement, feature extraction, statistical, corpus, pattern
Pairer¶
Annotates corpus objects with pairing information needed for paired prediction analyses. Controls for conversational context by pairing objects from the same conversation, enabling comparisons that isolate the variable of interest.
API: Pairer
Tags: pre-processing, prediction, statistical, utterance, conversation, speaker, representation, pattern
Paired Prediction¶
A quasi-experimental prediction method that controls for confounding priors by comparing matched pairs of corpus objects from the same conversation, enabling more rigorous causal inference.
API: PairedPrediction
Research: Antisocial Behavior in Online Discussion Communities
Tags: prediction, classification, machine learning, corpus, detection
Ranker¶
Sorts and annotates corpus objects with rankings based on a user-defined scoring function. Supports ranking of utterances, speakers, or conversations by any derived or metadata feature.
API: Ranker
Tags: sorting, statistical, utterance, conversation, speaker, representation
Linguistic Diversity¶
Computes the linguistic divergence between a speaker’s language in each conversation and a reference language model trained on other conversations or speakers, measuring how a speaker’s voice develops over time.
Research: Finding Your Voice
Tags: measurement, statistical, speaker, corpus, diversity, development
Example: Linguistic diversity on ChangeMyView
Summary of Conversation Dynamics (SCD)¶
Generates structured natural-language summaries of conversational dynamics using the LLM Prompt Transformer. Summaries describe how interactions unfold over time, capturing turn-by-turn shifts in tone, topic, and social dynamics.
API: SCD
Research: How Did We Get Here? Summarizing Conversation Dynamics
Tags: measurement, feature extraction, LLM, corpus, pattern, conversation-flow, context
Example: SCD on conversations gone awry
Conversation Dynamics Similarity (ConDynS)¶
Measures the similarity between two conversations with respect to their conversational dynamics using SCD-based representations and sequence alignment. Enables topic-independent comparison of how conversations unfold.
Research: A Similarity Measure for Comparing Conversational Dynamics
Tags: measurement, feature extraction, LLM, corpus, pattern, conversation-flow, context, comparison
Talk-Time Sharing Dynamics¶
Analyzes how talk-time is distributed and evolves between speakers throughout a conversation, capturing both overall balance and moment-to-moment dynamics in participation patterns.
Tags: measurement, feature extraction, statistical, corpus, pattern, conversation-flow, social, comparison