Contextual Abuse Dataset (CAD) Corpus
======================================
This corpus contains around 26,500 annotated Reddit entries (1,394 post titles, 1,394 post bodies, and 23,762 comments). Each entry is labeled into one or more of six primary categories: Identity-directed abuse, Affiliation-directed abuse, Person-directed abuse, Counter Speech, Non-hateful Slurs, and Neutral, with additional secondary subcategories like Derogation, Animosity, Threatening, Dehumanization, and Glorification.
The original dataset can be found here:
`Introducing CAD: the Contextual Abuse Dataset `_.
Bertie Vidgen, Dong Nguyen, Helen Margetts, Patricia Rossini, and Rebekah Tromble.
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021.
Dataset details
---------------
Speaker-level information
^^^^^^^^^^^^^^^^^^^^^^^^^
Speakers in this dataset correspond to Reddit users. Each speaker is identified from the ``meta_author`` field of the original data. If the author value is missing, marked as NA, or deleted, the speaker ID is set to ``[deleted]``.
Utterance-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Each utterance corresponds to one Reddit entry (a post title, post body, or comment). For each utterance, we provide:
* id: unique utterance identifier, taken from ``info_id``
* speaker: Reddit username of the author
* conversation_id: identifier for the Reddit thread containing this utterance
* reply_to: ID of the parent post or comment (``info_id.parent``), or None if no valid parent exists
* timestamp: Unix timestamp (in seconds) of when the utterance was created
* text: cleaned textual content of the utterance, with ``[linebreak]`` markers replaced by newlines
Metadata for each utterance include:
* annotation_Primary: main abuse category assigned by trained experts — one of ``Identity-directed abuse``, ``Affiliation-directed abuse``, ``Person-directed abuse``, ``Counter Speech``, ``Non-hateful Slurs``, or ``Neutral``
* annotation_Secondary: abuse subtype, e.g., ``Derogation``, ``Animosity``, ``Threatening``, ``Dehumanization``, ``Glorification``
* annotation_Context: whether additional context is required to interpret the label (``Yes`` / ``No`` / ``NA``)
* annotation_Target: the specific individual or group targeted, e.g., ``Women``, ``Immigrants``, ``Political groups``
* annotation_Target_top.level.category: higher-level target category, e.g., ``Identity``, ``Group``, ``Other``
* annotation_highlighted: text span(s) highlighted by annotators as abusive or offensive content; ``"NA"`` if none
* meta_date: UTC date of utterance creation (YYYY-MM-DD)
* meta_created_utc: Unix timestamp of utterance creation
* meta_day: day of utterance creation (YYYY-MM-DD)
* meta_permalink: Reddit permalink to the original post or comment
* info_subreddit: name of the subreddit where the utterance was posted
* info_subreddit_id: Reddit's internal ID for the subreddit
* id: original CAD-assigned ID (e.g., ``cad_1``, ``cad_2``)
* info_id: original identifier for the utterance (with ``-title`` or ``-post`` suffix)
* info_id.parent: identifier of the parent utterance
* info_id.link: identifier of the original submission that started the thread
* info_thread.id: identifier grouping all utterances in the same Reddit thread
* info_order: order of the utterance within its thread
* info_image.saved: whether an image was saved with the utterance (``0`` = no, ``1`` = yes)
* split: the dataset split in the original project — one of ``train``, ``dev``, ``test``, ``exclude_empty``, ``exclude_bot``, ``exclude_lang``, or ``exclude_image``
* subreddit_seen: whether the subreddit was included in the annotation set (``1``) or not (``0``)
* entry_type: type of the utterance — one of ``title``, ``post``, or ``comment``
Conversational-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Each Reddit thread (grouped by ``info_thread.id``) is treated as a conversation. Within each thread, ``reply_to`` relations establish the comment tree structure.
Usage
-----
To download directly with ConvoKit:
>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("contextual-abuse"))
For some quick stats:
>>> corpus.print_summary_stats()
Number of Speakers: 11123
Number of Utterances: 26550
Number of Conversations: 1395
The counts for the primary labels are as follows - 'Neutral': 21935, 'IdentityDirectedAbuse': 2216, 'AffiliationDirectedAbuse': 1111, 'PersonDirectedAbuse': 951, 'CounterSpeech': 210, 'Slur': 127.
Additional notes
----------------
Data License
^^^^^^^^^^^^
This dataset is shared under the `Creative Commons Attribution 4.0 International License `_.
Contact
^^^^^^^
The original Contextual Abuse Dataset was distributed in the paper `Introducing CAD: the Contextual Abuse Dataset `_ (Vidgen et al., NAACL 2021). Corresponding Author: Bertie Vidgen (bvidgen@turing.ac.uk).
The dataset was formatted for Convokit by Hao Wan (hw799@cornell.edu).
The demo on transformer usage and analysis was provided by Jadon Geathers (jag569@cornell.edu).