Ubuntu Chat Logs Corpus¶
A collection of 200 goal-oriented conversations from Ubuntu chat logs, where pairs of speakers work together to troubleshoot technical problems. The corpus includes human-annotated conversational friction points as well as friction annotations generated by GPT-4o, GPT-4o-mini, Llama 70B, and Llama 8B.
A full description of the dataset can be found here: Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs. Rupak Sarkar, Neha Srikanth, Taylor Hudson, Rachel Rudinger, Claire Bonial, Philip Resnik. arXiv preprint arXiv:2503.12370, 2025.
Dataset details¶
Speaker-level information¶
Speakers in this dataset are troubleshooting problems together. Usually one person has a problem and another is helping them. Speakers are always in pairs. Role A denotes the person seeking assistance. As speakers can take part in multiple conversations, we track the following metadata:
role_A_count: number of conversations in which the speaker served in role A
role_B_count: number of conversations in which the speaker served in role B
Utterance-level information¶
Each utterance corresponds to one message in a chat log. For each utterance, we provide:
id: unique utterance identifier
speaker: the speaker who authored the utterance
conversation_id: unique ID of the conversation this utterance belongs to
reply_to: index of the utterance to which this utterance is a reply (None if it is not a reply)
timestamp: sequential index of the utterance within the conversation
text: textual content of the utterance
Metadata for each utterance include:
time_elapsed: number of minutes elapsed since the start of the conversation
gpt_explanation: explanation of the utterance generated by ChatGPT
conversational_friction: conversational friction scores generated by the original authors
explanation: human-generated explanation of the utterance
Conversational-level information¶
Each conversation represents a single Ubuntu troubleshooting session. Metadata associated with conversations include:
batch: the batch grouping for the conversation
duration: total duration of the conversation in minutes
role_A: speaker ID for the participant serving in role A
role_B: speaker ID for the participant serving in role B
ending: type of conversation ending — one of
natural end,abrupt, orran out of timeconversational_success: outcome of the conversation — one of
success,some progress, orno progress
For each human annotator and each model (human, gpt4o, gpt4omini, llama70b, llama8b), the following metadata fields are provided:
conversational_friction_present_[model]: whether conversational friction was detected anywhere in the conversation by [model]
friction_count_[model]: number of friction instances detected by [model]
friction_index_list_[model]: list of utterance indices where friction was detected by [model]
explanation_list_[model]: list of natural-language explanations for each friction instance generated by [model]
Usage¶
To download directly with ConvoKit:
>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("ubuntu-chat-logs"))
For some quick stats:
>>> corpus.print_summary_stats()
Number of Speakers: 361
Number of Utterances: 7950
Number of Conversations: 200
Additional notes¶
Data License¶
This dataset is shared under the Creative Commons Attribution 4.0 International License.
Contact¶
ConvoKit-formatted corpus created by Axel Bax (adb333@cornell.edu).
Please email questions about the original dataset to the corresponding author: Rupak Sarkar (rupak@umd.edu).