Ubuntu Chat Logs Corpus ===================================== A collection of 200 goal-oriented conversations from Ubuntu chat logs, where pairs of speakers work together to troubleshoot technical problems. The corpus includes human-annotated conversational friction points as well as friction annotations generated by GPT-4o, GPT-4o-mini, Llama 70B, and Llama 8B. A full description of the dataset can be found here: `Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs `_. Rupak Sarkar, Neha Srikanth, Taylor Hudson, Rachel Rudinger, Claire Bonial, Philip Resnik. arXiv preprint arXiv:2503.12370, 2025. Dataset details --------------- Speaker-level information ^^^^^^^^^^^^^^^^^^^^^^^^^ Speakers in this dataset are troubleshooting problems together. Usually one person has a problem and another is helping them. Speakers are always in pairs. Role A denotes the person seeking assistance. As speakers can take part in multiple conversations, we track the following metadata: * role_A_count: number of conversations in which the speaker served in role A * role_B_count: number of conversations in which the speaker served in role B Utterance-level information ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Each utterance corresponds to one message in a chat log. For each utterance, we provide: * id: unique utterance identifier * speaker: the speaker who authored the utterance * conversation_id: unique ID of the conversation this utterance belongs to * reply_to: index of the utterance to which this utterance is a reply (None if it is not a reply) * timestamp: sequential index of the utterance within the conversation * text: textual content of the utterance Metadata for each utterance include: * time_elapsed: number of minutes elapsed since the start of the conversation * gpt_explanation: explanation of the utterance generated by ChatGPT * conversational_friction: conversational friction scores generated by the original authors * explanation: human-generated explanation of the utterance Conversational-level information ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Each conversation represents a single Ubuntu troubleshooting session. Metadata associated with conversations include: * batch: the batch grouping for the conversation * duration: total duration of the conversation in minutes * role_A: speaker ID for the participant serving in role A * role_B: speaker ID for the participant serving in role B * ending: type of conversation ending — one of ``natural end``, ``abrupt``, or ``ran out of time`` * conversational_success: outcome of the conversation — one of ``success``, ``some progress``, or ``no progress`` For each human annotator and each model (``human``, ``gpt4o``, ``gpt4omini``, ``llama70b``, ``llama8b``), the following metadata fields are provided: * conversational_friction_present_[model]: whether conversational friction was detected anywhere in the conversation by [model] * friction_count_[model]: number of friction instances detected by [model] * friction_index_list_[model]: list of utterance indices where friction was detected by [model] * explanation_list_[model]: list of natural-language explanations for each friction instance generated by [model] Usage ----- To download directly with ConvoKit: >>> from convokit import Corpus, download >>> corpus = Corpus(filename=download("ubuntu-chat-logs")) For some quick stats: >>> corpus.print_summary_stats() Number of Speakers: 361 Number of Utterances: 7950 Number of Conversations: 200 Additional notes ---------------- Data License ^^^^^^^^^^^^ This dataset is shared under the `Creative Commons Attribution 4.0 International License `_. Contact ^^^^^^^ ConvoKit-formatted corpus created by Axel Bax (adb333@cornell.edu). Please email questions about the original dataset to the corresponding author: Rupak Sarkar (rupak@umd.edu).