Features & APIs

ConvoKit provides a comprehensive set of analysis tools for extracting conversational features and studying social phenomena.

Filter by category:

TextParser

Dependency-parses each utterance in a corpus using SpaCy. This parsing step is a prerequisite for several other ConvoKit transformers.

  • API: TextParser

  • Tags: pre-processing, parsing, utterance, linguistic

Example: Text Preprocessing

TextToArcs

Converts dependency parse output into arc-based representations, where each sentence is expressed as a collection of dependency arc strings. Requires TextParser to be run first.

  • API: TextToArcs

  • Tags: pre-processing, structural, parsing, utterance, linguistic, pattern

TextCleaner

Cleans utterance text by fixing unicode errors, lowercasing, removing line breaks, and replacing URLs, emails, phone numbers, and currency symbols with special tokens. Supports custom cleaning functions.

  • API: TextCleaner

  • Tags: pre-processing, utterance, linguistic

TextProcessor (base class)

Abstract base class for text processing transformers in ConvoKit. Provides a shared interface for transformers that read and write utterance text fields, enabling pipeline composition of text processing steps.

Example: Text Preprocessing

Bag-of-Words

Vectorizes corpus objects (utterances, speakers, or conversations) using bag-of-words representations. Compatible with any sklearn-style vectorizer including TF-IDF. Stores representations as ConvoKitMatrix objects for downstream use with classifiers.

  • API: BoWTransformer

  • Tags: feature extraction, vectorization, speaker, utterance, conversation

Example: Bag-of-Words classification

Column Normalized Tf-Idf

A modified Tf-Idf transformer that normalizes by columns (term-wise) rather than rows, producing more balanced term representations across a corpus. Often used as input to the Expected Conversational Context Framework.

Hypergraph Conversation Representation

Extracts structural features of conversations through a hypergraph model, computing degree distribution statistics and motif counts for both full conversations and mid-threads. Forms the basis for ThreadEmbedder and CommunityEmbedder.

Example: Reddit hypergraph analysis

Phrasing Motifs

Extracts arc-based phrasing patterns from dependency-parsed utterances by abstracting away content words, capturing common syntactic structures independently of topic. Used as input to the Prompt Types framework.

Example: Phrasing motifs in prompt type models

Politeness Strategies

Detects lexical and parse-based politeness and impoliteness strategies in utterances based on the Brown and Levinson politeness framework, producing binary feature vectors over a set of validated linguistic markers.

Example: Extracting politeness features and markers

Prompt Types

Infers latent types of conversational prompts based on how they are phrased, using SVD-based embeddings of phrasing motifs relative to their response contexts. Assigns each utterance a prompt type and vector representation.

Examples:

Expected Conversational Context Framework

Derives representations of utterances and terms based on their expected conversational context — the replies they tend to elicit or the utterances they tend to appear near. Supports both forward and backward context modeling.

Examples:

Redirection and Utterance Likelihood

Measures the extent to which an utterance redirects conversational flow away from its context, and computes utterance log-likelihoods given surrounding context using a language model.

Example: Redirection in Supreme Court

Pivotal Moment Measure

Identifies pivotal moments in a conversation where simulated alternative responses would most change the predicted outcome. Combines a simulator model and a forecaster model to score each conversational position.

Example: Pivotal moments in conversations gone awry

LLM Prompt Transformer

Applies custom LLM prompts to corpus objects at any level — utterances, conversations, speakers, or the entire corpus — and stores responses as metadata. Supports multiple LLM providers including OpenAI GPT, Google Gemini, and local models.

  • API: LLMPromptTransformer

  • Tags: feature extraction, utterance, conversation, speaker, corpus, pragmatics

Example: GenAI module

Classifier

Trains and applies a classifier on corpus object metadata features. Uses an sklearn-compatible classifier and exposes evaluation utilities including cross-validation and train-test split scoring.

  • API: Classifier

  • Tags: classification, modeling, utterance, conversation, speaker, labeling

Example: Politeness

Vector Classifier

Trains and applies a classifier on corpus object vector representations (e.g. bag-of-words, TF-IDF). Inherits from Classifier. Requires a ConvoKitMatrix with the specified vector name to be present on the corpus.

  • API: VectorClassifier

  • Tags: classification, modeling, vectorization, utterance, conversation, speaker, labeling

Example: Bag-of-Words classification

Linguistic Coordination

Measures the propensity of a speaker to echo the function words used by another speaker in a conversation, serving as a proxy for linguistic coordination and relative power dynamics between individuals or groups.

Example: Power balance in U.S. Supreme Court

Fighting Words

Identifies the n-gram features that most distinguish two groups of corpus objects, using Monroe et al.’s Dirichlet-multinomial method. Annotates objects with the top fighting words of each class.

  • API: FightingWords

  • Tags: measurement, statistical, utterance, conversation, speaker, power, influence, social, pattern, comparison

Examples: r/atheism vs r/Christianity

Forecaster

A framework for forecasting future conversation outcomes as they develop in real time. Wraps any ForecasterModel (e.g. CRAFT, BERT-based models) and feeds a chronological stream of context tuples to enable per-utterance prediction.

Example: CRAFT on CGA

Thread Embedder

Embeds thread-level hypergraph statistics from HyperConvo into a low-dimensional space using SVD or other dimensionality reduction. Useful for visualizing and comparing thread structure across a corpus.

Community Embedder

Embeds community-level hypergraph statistics from HyperConvo into a low-dimensional space, enabling comparison and visualization of structural differences across communities or subreddits.

Pairer

Annotates corpus objects with pairing information needed for paired prediction analyses. Controls for conversational context by pairing objects from the same conversation, enabling comparisons that isolate the variable of interest.

  • API: Pairer

  • Tags: pre-processing, prediction, statistical, utterance, conversation, speaker, representation, pattern

Paired Prediction

A quasi-experimental prediction method that controls for confounding priors by comparing matched pairs of corpus objects from the same conversation, enabling more rigorous causal inference.

Example: Predicting conversation growth on Reddit

Ranker

Sorts and annotates corpus objects with rankings based on a user-defined scoring function. Supports ranking of utterances, speakers, or conversations by any derived or metadata feature.

  • API: Ranker

  • Tags: sorting, statistical, utterance, conversation, speaker, representation

Example: Ranking users in r/Cornell by comment count

Linguistic Diversity

Computes the linguistic divergence between a speaker’s language in each conversation and a reference language model trained on other conversations or speakers, measuring how a speaker’s voice develops over time.

Example: Linguistic diversity on ChangeMyView

Summary of Conversation Dynamics (SCD)

Generates structured natural-language summaries of conversational dynamics using the LLM Prompt Transformer. Summaries describe how interactions unfold over time, capturing turn-by-turn shifts in tone, topic, and social dynamics.

Example: SCD on conversations gone awry

Conversation Dynamics Similarity (ConDynS)

Measures the similarity between two conversations with respect to their conversational dynamics using SCD-based representations and sequence alignment. Enables topic-independent comparison of how conversations unfold.

Talk-Time Sharing Dynamics

Analyzes how talk-time is distributed and evolves between speakers throughout a conversation, capturing both overall balance and moment-to-moment dynamics in participation patterns.

  • API: TalkTimeSharingDynamics

  • Tags: measurement, feature extraction, statistical, corpus, pattern, conversation-flow, social, comparison

Example: Talk-time in CANDOR and Supreme Court