Contextual Abuse Dataset (CAD) Corpus

This corpus contains around 26,500 annotated Reddit entries (1,394 post titles, 1,394 post bodies, and 23,762 comments). Each entry is labeled into one or more of six primary categories: Identity-directed abuse, Affiliation-directed abuse, Person-directed abuse, Counter Speech, Non-hateful Slurs, and Neutral, with additional secondary subcategories like Derogation, Animosity, Threatening, Dehumanization, and Glorification.

The original dataset can be found here: Introducing CAD: the Contextual Abuse Dataset. Bertie Vidgen, Dong Nguyen, Helen Margetts, Patricia Rossini, and Rebekah Tromble. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021.

Dataset details

Speaker-level information

Speakers in this dataset correspond to Reddit users. Each speaker is identified from the meta_author field of the original data. If the author value is missing, marked as NA, or deleted, the speaker ID is set to [deleted].

Utterance-level information

Each utterance corresponds to one Reddit entry (a post title, post body, or comment). For each utterance, we provide:

  • id: unique utterance identifier, taken from info_id

  • speaker: Reddit username of the author

  • conversation_id: identifier for the Reddit thread containing this utterance

  • reply_to: ID of the parent post or comment (info_id.parent), or None if no valid parent exists

  • timestamp: Unix timestamp (in seconds) of when the utterance was created

  • text: cleaned textual content of the utterance, with [linebreak] markers replaced by newlines

Metadata for each utterance include:

  • annotation_Primary: main abuse category assigned by trained experts — one of Identity-directed abuse, Affiliation-directed abuse, Person-directed abuse, Counter Speech, Non-hateful Slurs, or Neutral

  • annotation_Secondary: abuse subtype, e.g., Derogation, Animosity, Threatening, Dehumanization, Glorification

  • annotation_Context: whether additional context is required to interpret the label (Yes / No / NA)

  • annotation_Target: the specific individual or group targeted, e.g., Women, Immigrants, Political groups

  • annotation_Target_top.level.category: higher-level target category, e.g., Identity, Group, Other

  • annotation_highlighted: text span(s) highlighted by annotators as abusive or offensive content; "NA" if none

  • meta_date: UTC date of utterance creation (YYYY-MM-DD)

  • meta_created_utc: Unix timestamp of utterance creation

  • meta_day: day of utterance creation (YYYY-MM-DD)

  • meta_permalink: Reddit permalink to the original post or comment

  • info_subreddit: name of the subreddit where the utterance was posted

  • info_subreddit_id: Reddit’s internal ID for the subreddit

  • id: original CAD-assigned ID (e.g., cad_1, cad_2)

  • info_id: original identifier for the utterance (with -title or -post suffix)

  • info_id.parent: identifier of the parent utterance

  • info_id.link: identifier of the original submission that started the thread

  • info_thread.id: identifier grouping all utterances in the same Reddit thread

  • info_order: order of the utterance within its thread

  • info_image.saved: whether an image was saved with the utterance (0 = no, 1 = yes)

  • split: the dataset split in the original project — one of train, dev, test, exclude_empty, exclude_bot, exclude_lang, or exclude_image

  • subreddit_seen: whether the subreddit was included in the annotation set (1) or not (0)

  • entry_type: type of the utterance — one of title, post, or comment

Conversational-level information

Each Reddit thread (grouped by info_thread.id) is treated as a conversation. Within each thread, reply_to relations establish the comment tree structure.

Usage

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("contextual-abuse"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 11123
Number of Utterances: 26550
Number of Conversations: 1395

The counts for the primary labels are as follows - ‘Neutral’: 21935, ‘IdentityDirectedAbuse’: 2216, ‘AffiliationDirectedAbuse’: 1111, ‘PersonDirectedAbuse’: 951, ‘CounterSpeech’: 210, ‘Slur’: 127.

Additional notes

Data License

This dataset is shared under the Creative Commons Attribution 4.0 International License.

Contact

The original Contextual Abuse Dataset was distributed in the paper Introducing CAD: the Contextual Abuse Dataset (Vidgen et al., NAACL 2021). Corresponding Author: Bertie Vidgen (bvidgen@turing.ac.uk).

The dataset was formatted for Convokit by Hao Wan (hw799@cornell.edu). The demo on transformer usage and analysis was provided by Jadon Geathers (jag569@cornell.edu).