Datasets¶

ConvoKit ships with several datasets ready for use “out-of-the-box”. These datasets can be downloaded using the convokit.download() helper function. Alternatively you can access them directly here.

Conversations Gone Awry¶

Three related corpora of conversations that derail into antisocial behavior.

CGA-WIKI: Wikipedia talk page conversations that derail into personal attacks as labeled by crowdworkers.

Download name: conversations-gone-awry-corpus
Tags: Wikipedia, derailment, online, asynchronous, outcome labels, summaries, persuasion, online, medium size, debate, medium conversations, timestamps

CGA-CMV: ChangeMyView discussion threads that derail into rule-violating behavior.

Download name: conversations-gone-awry-cmv-corpus
Tags: Reddit, derailment, online, asynchronous, outcome labels, summaries, persuasion, online, medium size, debate, medium conversations, timestamps

CGA-CMV-Large: Expanded version of CGA-CMV dataset.

Download name: conversations-gone-awry-cmv-corpus-large
Tags: Reddit, derailment, online, asynchronous, outcome labels, summaries, persuasion, online, medium size, debate, medium conversations, timestamps

Documentation

Cornell Movie-Dialogs Corpus¶

A large metadata-rich collection of fictional conversations extracted from raw movie scripts.

Download name: movie-corpus
Tags: fictional, speaker info, synchronous, large size, medium conversations
Documentation

Parliament Question Time Corpus¶

Parliamentary question periods from May 1979 to December 2016.

Download name: parliament-corpus
Tags: politics, speaker info, institutional, asymmetric, synchronous, short conversations, large size
Documentation

Supreme Court Corpus¶

A collection of conversations from U.S. Supreme Court oral arguments.

Download name: supreme-corpus
Tags: institutional, asymmetric, law, speaker info, outcome labels, in person, synchronous, long conversations, large size
Documentation

Wikipedia Talk Pages Corpus¶

A medium-size collection of conversations from Wikipedia editors’ talk pages.

Download name: wiki-corpus
Tags: online, asynchronous, Wikipedia, outcome labels, medium size, collaboration, medium conversations, timestamps
Documentation

Reddit Corpus¶

Reddit conversations from over 900k subreddits, arranged by subreddit. A small subset sampled from 100 highly active subreddits is also available.

Download name: subreddit-<name_of_subreddit> or reddit-corpus-small
Tags: large size, Reddit, online, asynchronous, timestamps
Documentation

WikiConv Corpus¶

Wikipedia talk page conversations from the distinct English, German, Russian, Chinese, and Greek versions of the site, based on the reconstruction described in this paper. Note that due to the large size of the data, every language but Greek is split up by year. We separately provide block data retrieved directly from the Wikipedia block log, , for reproducing the Trajectories of Blocked Community Members paper.

Download name: wikiconv-<language>-<year> for English, German, Russian, amd Chinese datasets, where the language key is the lowercase name of the language. wikiconv-greek for the Greek dataset.
Tags: large size, Wikipedia, online, asynchronous, timestamps, collaboration
Documentation

Chromium Conversations Corpus¶

A collection of almost 1.5 million conversations and 2.8 million comments posted by developers reviewing proposed code changes in the Chromium project.

Download name: chromium-corpus
Tags: large size, online, asynchronous, utterance labels, speaker info, timestamps, collaboration, short conversations, work
Documentation

Tennis Interviews¶

Transcripts for tennis singles post-match press conferences for major tournaments between 2007 to 2015 (6,467 post-match press conferences).

Download name: tennis-corpus
Tags: short conversations, interviews, sports, speaker info
Documentation

Winning Arguments Corpus¶

A metadata-rich subset of conversations made in the r/ChangeMyView subreddit between 1 Jan 2013 - 7 May 2015, with information on the delta (success) of a speaker’s utterance in convincing the poster.

Download name: winning-args-corpus
Tags: large size, Reddit, asynchronous, online, outcome labels, debate, persuasion, various topics
Documentation

Coarse Discourse Corpus¶

A subset of Reddit conversations that have been manually annotated with discourse act labels.

Download name: reddit-coarse-discourse-corpus
Tags: medium size, Reddit, online, asynchronous, utterance labels, various topics
Documentation

Persuasion For Good Corpus¶

A collection of online conversations generated by Amazon Mechanical Turk workers, where one participant (the persuader) tries to convince the other (the persuadee) to donate to a charity.

Download name: persuasionforgood-corpus
Tags: medium size, online, synchronous, speaker info, utterance labels, outcome labels, persuasion, dyadic, medium conversations
Documentation

Intelligence Squared Debates Corpus¶

Transcripts of debates held as part of Intelligence Squared Debates.

Download name: iq2-corpus
Tags: small size, in person, summaries, media, utterance labels, timestamps, outcome labels, debate, long conversations, various topics, politics
Documentation

Friends Corpus¶

A collection of all the conversations that occurred over 10 seasons of Friends, a popular American TV sitcom that ran in the 1990s.

Download name: friends-corpus
Tags: medium size, fictional, group, media, utterance labels, sarcasm
Documentation

Federal Open Market Committee (FOMC) Corpus¶

Transcripts of recurring meetings of the Federal Reserve’s Open Market Committee (FOMC), where important aspects of U.S. monetary policy are decided, covering the period 1977-2008.

Download name: fomc-corpus
Tags: small size, in person, timestamps, institutional, speaker info, utterance labels, politics, financial, long conversations
Documentation

NPR Interview 2P Dataset Corpus¶

This corpus contains conversations between NPR show hosts and their guests.

Download name: npr-2p-corpus
Tags: large size, in person, dyadic, media, Q&A, interviews, various topics
Documentation

DeliData Dataset Corpus¶

This corpus contains conversations in multi-party problem-solving contexts, containing information about group discussions and team performance.

Download name: deli-corpus
Tags: group, synchronous, medium size, summaries, outcome labels, problem solving, collaboration
Documentation

Switchboard Dialog Act Corpus¶

A collection of 1,155 five-minute telephone conversations between two participants, annotated with speech act tags.

Download name: switchboard-corpus
Tags: synchronous, dyadic, medium size, speaker info, summaries, various topics, debate
Documentation

Stanford Politeness Corpus¶

Two collections of requests (from Wikipedia and Stack Exchange respectively) with politeness annotations

Stanford Politeness (Wikipedia): A collection of requests from Wikipedia Talk pages, annotated with politeness (4,353 utteranecs).

Download name: wikipedia-politeness-corpus
Tags: medium size, asynchronous, Wikipedia, utterance labels, online, short conversations, politeness
Documentation

Stanford Politeness (Stack Exchange): A collection of requests from Stack Exchange, annotated with politeness (6,603 utteranecs).

Download name: stack-exchange-politeness-corpus
Tags: medium size, asynchronous, Stack Exchange, utterance labels, online, short conversations, politeness
Documentation

Deception in Diplomacy Conversations¶

Conversational dataset with intended and perceived deception labels. Over 17,000 messages annotated by the sender for their intended truthfulness and by the receiver for their perceived truthfulness.

Download name: diplomacy-corpus
Tags: medium size, group, speaker info, utterance labels, negotiation, medium conversations, persuasion, collaboration, deception
Documentation

Group Affect and Performance (GAP) Corpus¶

A conversational dataset comprising group meetings of two to four participants that deliberate in a group decision-making exercise. This dataset contains 28 group meetings with a total of 84 participants.

Download name: gap-corpus
Tags: institution, small size, in person, group, speaker info, timestamps, summaries, outcome labels, collaboration
Documentation

Wikipedia Articles for Deletion Corpus¶

A collection of Wikipedia’s Articles for Deletion editor debates that occurred between January 1, 2005 and December 31, 2018. This corpus contains about 3,200,000 contributions by approximately 150,000 Wikipedia editors across almost 400,000 debates.

Download name: wiki-articles-for-deletion-corpus
Tags: Wikipedia, large size, online, asynchronous, speaker info, utterance labels, outcome labels, timestamps, debate
Documentation

CaSiNo Corpus¶

CaSiNo (stands for CampSite Negotiations) is a novel dataset of 1030 negotiation dialogues. Two participants take the role of campsite neighbors and negotiate for Food, Water, and Firewood packages, based on their individual preferences and requirements.

Download name: casino-corpus
Tags: medium size, speaker info, utterance labels, negotiation, collaboration
Documentation

SPOLIN Corpus¶

Selected Pairs of Learnable ImprovisatioN (SPOLIN) is a collection of more than 68,000 “Yes, and” type utterance pairs extracted from the long-form improvisation podcast Spontaneanation by Paul F. Tompkins, the Cornell Movie-Dialogs Corpus, and the SubTle corpus.

Download name: spolin-corpus
Tags: media, large size, online, synchronous, utterance labels, short conversations, various topics
Documentation

CANDOR Corpus¶

CANDOR corpus is a dataset of 1650 conversations that strangers had over video chat with rich metadata information obtaind from pre-conversation and post-conversation surveys. The corpus is available by request from the authors (BetterUp CANDOR Corpus) and ConvoKit contains code for converting the transcripts into ConvoKit format, as detailed in the documentation.

Tags: synchronous, medium size, speaker info, timestamps, utterance labels
Documentation

Fora Corpus¶

Fora corpus is a dataset of 262 annotated transcripts of multi-person facilitated dialogues regarding issues like education, elections, and public health, primarily through the sharing of personal experience. The corpus is available by request from the authors (https://github.com/schropes/fora-corpus) and ConvoKit contains code for converting the transcripts into ConvoKit format, as detailed in the documentation.

Tags: small size, speaker info, utterance labels, timestamps, group, in person, various topics
Documentation

Unintended Offense Corpus¶

A collection of unintentionally offensive Tweets and replies in which a Tweet in the exchange was offensive to someone, followed by an indication that the poster meant no offense. ConvoKit contains code for converting the data into ConvoKit format, as detailed in the documentation.

Tags: online, asynchronous, outcome, labels, utterance labels, timestamps, Twitter/X, medium size, short conversations, various topics, politeness
Documentation

Ubuntu Chat Logs¶

A collection of conversations featuring pairs of speakers where one speaker is assisting the other through Ubuntu chat logs to help them solve their problem.

Download name: ubuntu-chat-logs
Tags: online, dyadic, asymmetric, synchronous, outcome, labels, utterance labels, speaker info, timestamps, small size, medium conversations, customer support, problem solving, derailment
Documentation

Contextual Abuse Corpus¶

A dataset of annotated Reddit entries labeled into one or more of six primary categories of abuse. Secondary categories, labels annotated in the context of the conversation thread, and rationales are also included as part of the dataset.

Download name: contextual-abuse
Tags: online, asynchronous, utterance, labels, timestamps, Reddit, medium size, short conversations, various topics
Documentation

NewsInterview Corpus¶

A collection of two-person informational interviews from National Public Radio (NPR) and Cable News Network (CNN), focusing on journalistic interviews between interviewers and sources from 2000 to 2020.

Download name: news-interview
Tags: dyadic, asymmetric, synchronous, speaker info, summaries, timestamps, media, medium size, medium conversations, various topics, interviews, Q&A
Documentation

Emotional Support Conversation Corpus¶

This dataset contains approximately 1,300 conversations collected between emotional support seekers and supporters.

Download name: emotional-support
Tags: online, dyadic, asymmetric, synchronous, outcome, labels, utterance labels, speaker info, medium size, medium conversations, various topics, support
Documentation

Custom Datasets¶

You can also use ConvoKit with your own custom datasets by loading them into a Corpus object. See our tutorial on converting custom data.