Contextual Abuse Dataset (CAD) Corpus¶

This corpus contains around 26,500 annotated Reddit entries (1,394 post titles, 1,394 post bodies, and 23,762 comments). Each entry is labeled into one or more of six primary categories: Identity-directed abuse, Affiliation-directed abuse, Person-directed abuse, Counter Speech, Non-hateful Slurs, and Neutral, with additional secondary subcategories like Derogation, Animosity, Threatening, Dehumanization, and Glorification.

The original dataset can be found here: Introducing CAD: the Contextual Abuse Dataset. Bertie Vidgen, Dong Nguyen, Helen Margetts, Patricia Rossini, and Rebekah Tromble. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021.

Dataset details¶

Speaker-level information¶

Speakers in this dataset correspond to Reddit users. Each speaker is identified from the meta_author field of the original data. If the author value is missing, marked as NA, or deleted, the speaker ID is set to [deleted].

Utterance-level information¶

Each utterance corresponds to one Reddit entry (a post title, post body, or comment). For each utterance, we provide:

id: unique utterance identifier, taken from info_id
speaker: Reddit username of the author
conversation_id: identifier for the Reddit thread containing this utterance
reply_to: ID of the parent post or comment (info_id.parent), or None if no valid parent exists
timestamp: Unix timestamp (in seconds) of when the utterance was created
text: cleaned textual content of the utterance, with [linebreak] markers replaced by newlines

Metadata for each utterance include:

annotation_Primary: main abuse category assigned by trained experts — one of Identity-directed abuse, Affiliation-directed abuse, Person-directed abuse, Counter Speech, Non-hateful Slurs, or Neutral
annotation_Secondary: abuse subtype, e.g., Derogation, Animosity, Threatening, Dehumanization, Glorification
annotation_Context: whether additional context is required to interpret the label (Yes / No / NA)
annotation_Target: the specific individual or group targeted, e.g., Women, Immigrants, Political groups
annotation_Target_top.level.category: higher-level target category, e.g., Identity, Group, Other
annotation_highlighted: text span(s) highlighted by annotators as abusive or offensive content; "NA" if none
meta_date: UTC date of utterance creation (YYYY-MM-DD)
meta_created_utc: Unix timestamp of utterance creation
meta_day: day of utterance creation (YYYY-MM-DD)
meta_permalink: Reddit permalink to the original post or comment
info_subreddit: name of the subreddit where the utterance was posted
info_subreddit_id: Reddit’s internal ID for the subreddit
id: original CAD-assigned ID (e.g., cad_1, cad_2)
info_id: original identifier for the utterance (with -title or -post suffix)
info_id.parent: identifier of the parent utterance
info_id.link: identifier of the original submission that started the thread
info_thread.id: identifier grouping all utterances in the same Reddit thread
info_order: order of the utterance within its thread
info_image.saved: whether an image was saved with the utterance (0 = no, 1 = yes)
split: the dataset split in the original project — one of train, dev, test, exclude_empty, exclude_bot, exclude_lang, or exclude_image
subreddit_seen: whether the subreddit was included in the annotation set (1) or not (0)
entry_type: type of the utterance — one of title, post, or comment

Conversational-level information¶

Each Reddit thread (grouped by info_thread.id) is treated as a conversation. Within each thread, reply_to relations establish the comment tree structure.

Usage¶

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("contextual-abuse"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 11123
Number of Utterances: 26550
Number of Conversations: 1395

The counts for the primary labels are as follows - ‘Neutral’: 21935, ‘IdentityDirectedAbuse’: 2216, ‘AffiliationDirectedAbuse’: 1111, ‘PersonDirectedAbuse’: 951, ‘CounterSpeech’: 210, ‘Slur’: 127.

Additional notes¶

Data License¶

This dataset is shared under the Creative Commons Attribution 4.0 International License.

Contact¶

The original Contextual Abuse Dataset was distributed in the paper Introducing CAD: the Contextual Abuse Dataset (Vidgen et al., NAACL 2021). Corresponding Author: Bertie Vidgen (bvidgen@turing.ac.uk).

The dataset was formatted for Convokit by Hao Wan (hw799@cornell.edu). The demo on transformer usage and analysis was provided by Jadon Geathers (jag569@cornell.edu).