Chromium Conversations Corpus

A collection of almost 1.5 million conversations and 2.8 million comments posted by developers reviewing proposed code changes in the Chromium project.

Contributed by: Benjamin S. Meyers (bsm9339@rit.edu)

Distributed together with: Benjamin S. Meyers, Nuthan Munaiah, Emily Prud’hommeaux, Andrew Meneely, Cecilia O. Alm, Josephine Wolff, and Pradeep Murukannaiah. A Dataset for Identifying Actionable Feedback in Collaborative Software Development. Proceedings of the 2018 Meeting for the Association for Computational Linguistics (ACL). Melbourne, Australia. http://www.aclweb.org/anthology/P18-2021

A full description of the dataset can be found here.

Dataset details

Speaker-level information

Speaker names have been anonymized randomly to ‘developer_#’ where ‘#’ is a number between 1 and 4842.

Additional metadata includes:

  • user_type: either ‘developer’, the developer who proposed the code change, or ‘reviewer’, other developers reviewing the code change

Utterance-level information

Each utterance corresponds to a comment in the Chromium project.

  • id: index of the utterance

  • speaker: the speaker who authored the utterance

  • conversation_id: id of the first utterance in the conversation this utterance belongs to

  • reply_to: id of the utterance to which this utterance replies to (None if the utterance is not a reply)

  • timestamp: time of the utterance

  • text: textual content of the utterance

Additional metadata includes some associated pre-calculated linguistic metrics:

  • yngve_score: The maximum Yngve score of sentences in the code review comment

  • frazier_score: The maximum Frazier score of sentences in the code review comment

  • pdensity: The Propositional Density score of the code review comment

  • cdensity: The Content Density score of the code review comment

  • has_doxastic: Binary indicator of presence of a sentence with doxastic uncertainty in the code review comment

  • has_epistemic: Binary indicator of presence of a sentence with epistemic uncertainty in the code review comment

  • has_conditional: Binary indicator of presence of a sentence with conditional uncertainty in the code review comment

  • has_investigative: Binary indicator of presence of a sentence with investigative uncertainty in the code review comment

  • has_uncertainty: Binary indicator of presence of a sentence with any uncertainty in the code review comment

  • min_formality: Minimum of the formality of sentences in the code review comment

  • max_formality: Maximum of the formality of sentences in the code review comment

Conversation-level information

Each conversation has the associated metadata:

  • review_id: Unique identifier of a code review in the Chromium project. The URL https://codereview.chromium.org/<review_id> may be used to access the review online

  • patchset_id: Unique identifier of a code review patchset (i.e., collection of changes to the source code) associated with a review

  • patch_id: Unique identifier of a code review patch (i.e., individual change to the source code) associated with a patchset

  • file_path: The path to the file being modified in the patch

  • line_number: The line number in the file at which the comment was posted

Usage

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("chromium-corpus"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 4842
Number of Utterances: 2853498
Number of Conversations: 1484843

Additional note

Data License

Creative Commons Attribution 4.0 International

Contact

Please email any questions to: bsm9339@rit.edu (Benjamin S. Meyers)