Chromium Conversations Corpus¶

A collection of almost 1.5 million conversations and 2.8 million comments posted by developers reviewing proposed code changes in the Chromium project.

Contributed by: Benjamin S. Meyers (bsm9339@rit.edu)

Distributed together with: Benjamin S. Meyers, Nuthan Munaiah, Emily Prud’hommeaux, Andrew Meneely, Cecilia O. Alm, Josephine Wolff, and Pradeep Murukannaiah. A Dataset for Identifying Actionable Feedback in Collaborative Software Development. Proceedings of the 2018 Meeting for the Association for Computational Linguistics (ACL). Melbourne, Australia. http://www.aclweb.org/anthology/P18-2021

A full description of the dataset can be found here.

Dataset details¶

Speaker-level information¶

Speaker names have been anonymized randomly to ‘developer_#’ where ‘#’ is a number between 1 and 4842.

Additional metadata includes:

user_type: either ‘developer’, the developer who proposed the code change, or ‘reviewer’, other developers reviewing the code change

Utterance-level information¶

Each utterance corresponds to a comment in the Chromium project.

id: index of the utterance
speaker: the speaker who authored the utterance
conversation_id: id of the first utterance in the conversation this utterance belongs to
reply_to: id of the utterance to which this utterance replies to (None if the utterance is not a reply)
timestamp: time of the utterance
text: textual content of the utterance

Additional metadata includes some associated pre-calculated linguistic metrics:

yngve_score: The maximum Yngve score of sentences in the code review comment
frazier_score: The maximum Frazier score of sentences in the code review comment
pdensity: The Propositional Density score of the code review comment
cdensity: The Content Density score of the code review comment
has_doxastic: Binary indicator of presence of a sentence with doxastic uncertainty in the code review comment
has_epistemic: Binary indicator of presence of a sentence with epistemic uncertainty in the code review comment
has_conditional: Binary indicator of presence of a sentence with conditional uncertainty in the code review comment
has_investigative: Binary indicator of presence of a sentence with investigative uncertainty in the code review comment
has_uncertainty: Binary indicator of presence of a sentence with any uncertainty in the code review comment
min_formality: Minimum of the formality of sentences in the code review comment
max_formality: Maximum of the formality of sentences in the code review comment

Conversation-level information¶

Each conversation has the associated metadata:

review_id: Unique identifier of a code review in the Chromium project. The URL https://codereview.chromium.org/<review_id> may be used to access the review online
patchset_id: Unique identifier of a code review patchset (i.e., collection of changes to the source code) associated with a review
patch_id: Unique identifier of a code review patch (i.e., individual change to the source code) associated with a patchset
file_path: The path to the file being modified in the patch
line_number: The line number in the file at which the comment was posted

Usage¶

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("chromium-corpus"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 4842
Number of Utterances: 2853498
Number of Conversations: 1484843

Additional note¶

Data License¶

Creative Commons Attribution 4.0 International

Contact¶

Please email any questions to: bsm9339@rit.edu (Benjamin S. Meyers)