Datasets

ConvoKit ships with several datasets ready for use “out-of-the-box”. These datasets can be downloaded using the convokit.download() helper function. Alternatively you can access them directly here.

Filter by tag:

Conversations Gone Awry

Three related corpora of conversations that derail into antisocial behavior.

CGA-WIKI: Wikipedia talk page conversations that derail into personal attacks as labeled by crowdworkers.

  • Download name: conversations-gone-awry-corpus

  • Tags: Wikipedia, derailment, online, asynchronous, outcome labels, summaries, persuasion, online, medium size, debate, medium conversations, timestamps

CGA-CMV: ChangeMyView discussion threads that derail into rule-violating behavior.

  • Download name: conversations-gone-awry-cmv-corpus

  • Tags: Reddit, derailment, online, asynchronous, outcome labels, summaries, persuasion, online, medium size, debate, medium conversations, timestamps

CGA-CMV-Large: Expanded version of CGA-CMV dataset.

  • Download name: conversations-gone-awry-cmv-corpus-large

  • Tags: Reddit, derailment, online, asynchronous, outcome labels, summaries, persuasion, online, medium size, debate, medium conversations, timestamps

Documentation

Cornell Movie-Dialogs Corpus

A large metadata-rich collection of fictional conversations extracted from raw movie scripts.

  • Download name: movie-corpus

  • Tags: fictional, speaker info, synchronous, large size, medium conversations

  • Documentation

Parliament Question Time Corpus

Parliamentary question periods from May 1979 to December 2016.

  • Download name: parliament-corpus

  • Tags: politics, speaker info, institutional, asymmetric, synchronous, short conversations, large size

  • Documentation

Supreme Court Corpus

A collection of conversations from U.S. Supreme Court oral arguments.

  • Download name: supreme-corpus

  • Tags: institutional, asymmetric, law, speaker info, outcome labels, in person, synchronous, long conversations, large size

  • Documentation

Wikipedia Talk Pages Corpus

A medium-size collection of conversations from Wikipedia editors’ talk pages.

  • Download name: wiki-corpus

  • Tags: online, asynchronous, Wikipedia, outcome labels, medium size, collaboration, medium conversations, timestamps

  • Documentation

Reddit Corpus

Reddit conversations from over 900k subreddits, arranged by subreddit. A small subset sampled from 100 highly active subreddits is also available.

  • Download name: subreddit-<name_of_subreddit> or reddit-corpus-small

  • Tags: large size, Reddit, online, asynchronous, timestamps

  • Documentation

WikiConv Corpus

Wikipedia talk page conversations from the distinct English, German, Russian, Chinese, and Greek versions of the site, based on the reconstruction described in this paper. Note that due to the large size of the data, every language but Greek is split up by year. We separately provide block data retrieved directly from the Wikipedia block log, , for reproducing the Trajectories of Blocked Community Members paper.

  • Download name: wikiconv-<language>-<year> for English, German, Russian, amd Chinese datasets, where the language key is the lowercase name of the language. wikiconv-greek for the Greek dataset.

  • Tags: large size, Wikipedia, online, asynchronous, timestamps, collaboration

  • Documentation

Chromium Conversations Corpus

A collection of almost 1.5 million conversations and 2.8 million comments posted by developers reviewing proposed code changes in the Chromium project.

  • Download name: chromium-corpus

  • Tags: large size, online, asynchronous, utterance labels, speaker info, timestamps, collaboration, short conversations, work

  • Documentation

Tennis Interviews

Transcripts for tennis singles post-match press conferences for major tournaments between 2007 to 2015 (6,467 post-match press conferences).

  • Download name: tennis-corpus

  • Tags: short conversations, interviews, sports, speaker info

  • Documentation

Winning Arguments Corpus

A metadata-rich subset of conversations made in the r/ChangeMyView subreddit between 1 Jan 2013 - 7 May 2015, with information on the delta (success) of a speaker’s utterance in convincing the poster.

  • Download name: winning-args-corpus

  • Tags: large size, Reddit, asynchronous, online, outcome labels, debate, persuasion, various topics

  • Documentation

Coarse Discourse Corpus

A subset of Reddit conversations that have been manually annotated with discourse act labels.

  • Download name: reddit-coarse-discourse-corpus

  • Tags: medium size, Reddit, online, asynchronous, utterance labels, various topics

  • Documentation

Persuasion For Good Corpus

A collection of online conversations generated by Amazon Mechanical Turk workers, where one participant (the persuader) tries to convince the other (the persuadee) to donate to a charity.

  • Download name: persuasionforgood-corpus

  • Tags: medium size, online, synchronous, speaker info, utterance labels, outcome labels, persuasion, dyadic, medium conversations

  • Documentation

Intelligence Squared Debates Corpus

Transcripts of debates held as part of Intelligence Squared Debates.

  • Download name: iq2-corpus

  • Tags: small size, in person, summaries, media, utterance labels, timestamps, outcome labels, debate, long conversations, various topics, politics

  • Documentation

Friends Corpus

A collection of all the conversations that occurred over 10 seasons of Friends, a popular American TV sitcom that ran in the 1990s.

  • Download name: friends-corpus

  • Tags: medium size, fictional, group, media, utterance labels, sarcasm

  • Documentation

Federal Open Market Committee (FOMC) Corpus

Transcripts of recurring meetings of the Federal Reserve’s Open Market Committee (FOMC), where important aspects of U.S. monetary policy are decided, covering the period 1977-2008.

  • Download name: fomc-corpus

  • Tags: small size, in person, timestamps, institutional, speaker info, utterance labels, politics, financial, long conversations

  • Documentation

NPR Interview 2P Dataset Corpus

This corpus contains conversations between NPR show hosts and their guests.

  • Download name: npr-2p-corpus

  • Tags: large size, in person, dyadic, media, Q&A, interviews, various topics

  • Documentation

DeliData Dataset Corpus

This corpus contains conversations in multi-party problem-solving contexts, containing information about group discussions and team performance.

  • Download name: deli-corpus

  • Tags: group, synchronous, medium size, summaries, outcome labels, problem solving, collaboration

  • Documentation

Switchboard Dialog Act Corpus

A collection of 1,155 five-minute telephone conversations between two participants, annotated with speech act tags.

  • Download name: switchboard-corpus

  • Tags: synchronous, dyadic, medium size, speaker info, summaries, various topics, debate

  • Documentation

Stanford Politeness Corpus

Two collections of requests (from Wikipedia and Stack Exchange respectively) with politeness annotations

Stanford Politeness (Wikipedia): A collection of requests from Wikipedia Talk pages, annotated with politeness (4,353 utteranecs).

  • Download name: wikipedia-politeness-corpus

  • Tags: medium size, asynchronous, Wikipedia, utterance labels, online, short conversations, politeness

  • Documentation

Stanford Politeness (Stack Exchange): A collection of requests from Stack Exchange, annotated with politeness (6,603 utteranecs).

  • Download name: stack-exchange-politeness-corpus

  • Tags: medium size, asynchronous, Stack Exchange, utterance labels, online, short conversations, politeness

  • Documentation

Deception in Diplomacy Conversations

Conversational dataset with intended and perceived deception labels. Over 17,000 messages annotated by the sender for their intended truthfulness and by the receiver for their perceived truthfulness.

  • Download name: diplomacy-corpus

  • Tags: medium size, group, speaker info, utterance labels, negotiation, medium conversations, persuasion, collaboration, deception

  • Documentation

Group Affect and Performance (GAP) Corpus

A conversational dataset comprising group meetings of two to four participants that deliberate in a group decision-making exercise. This dataset contains 28 group meetings with a total of 84 participants.

  • Download name: gap-corpus

  • Tags: institution, small size, in person, group, speaker info, timestamps, summaries, outcome labels, collaboration

  • Documentation

Wikipedia Articles for Deletion Corpus

A collection of Wikipedia’s Articles for Deletion editor debates that occurred between January 1, 2005 and December 31, 2018. This corpus contains about 3,200,000 contributions by approximately 150,000 Wikipedia editors across almost 400,000 debates.

  • Download name: wiki-articles-for-deletion-corpus

  • Tags: Wikipedia, large size, online, asynchronous, speaker info, utterance labels, outcome labels, timestamps, debate

  • Documentation

CaSiNo Corpus

CaSiNo (stands for CampSite Negotiations) is a novel dataset of 1030 negotiation dialogues. Two participants take the role of campsite neighbors and negotiate for Food, Water, and Firewood packages, based on their individual preferences and requirements.

  • Download name: casino-corpus

  • Tags: medium size, speaker info, utterance labels, negotiation, collaboration

  • Documentation

SPOLIN Corpus

Selected Pairs of Learnable ImprovisatioN (SPOLIN) is a collection of more than 68,000 “Yes, and” type utterance pairs extracted from the long-form improvisation podcast Spontaneanation by Paul F. Tompkins, the Cornell Movie-Dialogs Corpus, and the SubTle corpus.

  • Download name: spolin-corpus

  • Tags: media, large size, online, synchronous, utterance labels, short conversations, various topics

  • Documentation

CANDOR Corpus

CANDOR corpus is a dataset of 1650 conversations that strangers had over video chat with rich metadata information obtaind from pre-conversation and post-conversation surveys. The corpus is available by request from the authors (BetterUp CANDOR Corpus) and ConvoKit contains code for converting the transcripts into ConvoKit format, as detailed in the documentation.

  • Tags: synchronous, medium size, speaker info, timestamps, utterance labels

  • Documentation

Fora Corpus

Fora corpus is a dataset of 262 annotated transcripts of multi-person facilitated dialogues regarding issues like education, elections, and public health, primarily through the sharing of personal experience. The corpus is available by request from the authors (https://github.com/schropes/fora-corpus) and ConvoKit contains code for converting the transcripts into ConvoKit format, as detailed below.

  • Tags: small size, speaker info, utterance labels, timestamps, group, in person, various topics

  • Documentation

Unintended Offense Corpus

A collection of unintentionally offensive Tweets and replies in which a Tweet in the exchange was offensive to someone, followed by an indication that the poster meant no offense.

  • Tags: online, asynchronous, outcome, labels, utterance labels, timestamps, Twitter/X, medium size, short conversations, various topics, politeness

  • Documentation

Ubuntu Chat Logs

A collection of conversations featuring pairs of speakers where one speaker is assisting the other through Ubuntu chat logs to help them solve their problem.

  • Tags: online, dyadic, asymmetric, synchronous, outcome, labels, utterance labels, speaker info, timestamps, small size, medium conversations, customer support, problem solving, derailment

  • Documentation

Contextual Abuse Corpus

A dataset of annotated Reddit entries labeled into one or more of six primary categories of abuse. Secondary categories, labels annotated in the context of the conversation thread, and rationales are also included as part of the dataset.

  • Tags: online, asynchronous, utterance, labels, timestamps, Reddit, medium size, short conversations, various topics

  • Documentation

NewsInterview Corpus

A collection of two-person informational interviews from National Public Radio (NPR) and Cable News Network (CNN), focusing on journalistic interviews between interviewers and sources from 2000 to 2020.

  • Tags: dyadic, asymmetric, synchronous, speaker info, summaries, timestamps, media, medium size, medium conversations, various topics, interviews, Q&A

  • Documentation

Emotional Support Conversation Corpus

This dataset contains approximately 1,300 conversations collected between emotional support seekers and supporters.

  • Tags: online, dyadic, asymmetric, synchronous, outcome, labels, utterance labels, speaker info, medium size, medium conversations, various topics, support

  • Documentation

Custom Datasets

You can also use ConvoKit with your own custom datasets by loading them into a Corpus object. See our tutorial on converting custom data.