Conversations Gone Awry Dataset (Large) - Reddit CMV version (CGA-CMV-Large)¶

A collection of conversations from the ChangeMyView (CMV) subreddit that derail into personal attacks (19,578 conversations, 116,793 comments), now updated with conversations up to 2022. Using this dataset, over the original version, is recommended.

Summaries of conversation dynamics (SCDs) are available for a subset of the conversations.

Distributed together with: Trouble on the Horizon: Forecasting the Derailment of Online Conversations as they Develop. Jonathan P. Chang and Crisitan Danescu-Niculescu-Mizil. EMNLP 2019.

Summaries of conversation dynamics described in: How Did We Get Here? Summarizing Conversation Dynamics. Yilun Hua, Nick Chernogor, Yuzhe Gu, Seoyon Julie Jeong, Miranda Luo, Cristian Danescu-Niculescu-Mizil. NAACL 2024.

Example usage of the corpus and summaries: SCD and Basic Examples

Dataset details¶

Speaker-level information¶

Speakers in this dataset are Reddit users; their account names are taken as the user names.

Utterance-level information¶

Each utterance corresponds to a Reddit comment. For each utterance, we provide:

id: Reddit ID of the comment represented by the utterance
speaker: the speaker who authored the utterance
conversation_id: id of the first utterance in the conversation this utterance belongs to. Note that this differs from how ‘conversation_id’ is treated in ConvoKit’s general Reddit corpora: in those corpora a conversation is considered to start with a Reddit post utterance, whereas in this corpus a conversation is considered to start with a top-level reply to a post.
reply_to: Reddit ID of the utterance to which this utterance replies to (None if the utterance represents a top-level comment, i.e., a reply to a post)
timestamp: time of the utterance
text: textual content of the utterance

Metadata for each utterance is inherited from the general CMV corpus:

score: score (i.e., the number of upvotes minus the number of downvotes) of the content
top_level_comment: the id of the top level comment (None if the utterance is a post)
retrieved_on: unix timestamp of the time of when the data is retrieved
gilded: gilded status of the content
gildings: gilding information of the content
stickied: stickied status of the content
permalink: permanent link of the content
author_flair_text: flair of the author

Conversational-level information¶

Metadata for each conversation include:

pair_id: the id of the conversation that this conversation is paired with
has_removed_comment: whether the final comment in this thread was removed by CMV moderators for violation of Rule 2
split: which split (train, val, or test) this conversation was used in for the experiments described in “Trouble on the Horizon”
summary_meta: metadata related to conversation summaries, a list of dictionaries (one per summary available, possibly empty) with the following keys:
- summary_text: the text of the summary;
- summary_type: whether the summary is humman written by humans;(human_written_SCD) or generated automatically using the procedural prompt (“machine_generated_SCD”) ;
- up_to_utterance_id: the last utterance considered when creating the summary;
- truncated_by: the number of utterances the transcript was truncated by when creating the summary (starting from the end);
- scd_split: whether the summary was in the train/test/validation split in the 2024 Summarizing Conversations Dynamics paper;

Usage¶

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("conversations-gone-awry-cmv-corpus-large"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 24555
Number of Utterances: 116793
Number of Conversations: 19578

Contact¶

Please email any questions to: cristian@cs.cornell.edu (Cristian Danescu-Niculescu-Mizil)