Ubuntu Chat Logs Corpus¶

A collection of 200 goal-oriented conversations from Ubuntu chat logs, where pairs of speakers work together to troubleshoot technical problems. The corpus includes human-annotated conversational friction points as well as friction annotations generated by GPT-4o, GPT-4o-mini, Llama 70B, and Llama 8B.

A full description of the dataset can be found here: Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs. Rupak Sarkar, Neha Srikanth, Taylor Hudson, Rachel Rudinger, Claire Bonial, Philip Resnik. arXiv preprint arXiv:2503.12370, 2025.

Dataset details¶

Speaker-level information¶

Speakers in this dataset are troubleshooting problems together. Usually one person has a problem and another is helping them. Speakers are always in pairs. Role A denotes the person seeking assistance. As speakers can take part in multiple conversations, we track the following metadata:

role_A_count: number of conversations in which the speaker served in role A
role_B_count: number of conversations in which the speaker served in role B

Utterance-level information¶

Each utterance corresponds to one message in a chat log. For each utterance, we provide:

id: unique utterance identifier
speaker: the speaker who authored the utterance
conversation_id: unique ID of the conversation this utterance belongs to
reply_to: index of the utterance to which this utterance is a reply (None if it is not a reply)
timestamp: sequential index of the utterance within the conversation
text: textual content of the utterance

Metadata for each utterance include:

time_elapsed: number of minutes elapsed since the start of the conversation
gpt_explanation: explanation of the utterance generated by ChatGPT
conversational_friction: conversational friction scores generated by the original authors
explanation: human-generated explanation of the utterance

Conversational-level information¶

Each conversation represents a single Ubuntu troubleshooting session. Metadata associated with conversations include:

batch: the batch grouping for the conversation
duration: total duration of the conversation in minutes
role_A: speaker ID for the participant serving in role A
role_B: speaker ID for the participant serving in role B
ending: type of conversation ending — one of natural end, abrupt, or ran out of time
conversational_success: outcome of the conversation — one of success, some progress, or no progress

For each human annotator and each model (human, gpt4o, gpt4omini, llama70b, llama8b), the following metadata fields are provided:

conversational_friction_present_[model]: whether conversational friction was detected anywhere in the conversation by [model]
friction_count_[model]: number of friction instances detected by [model]
friction_index_list_[model]: list of utterance indices where friction was detected by [model]
explanation_list_[model]: list of natural-language explanations for each friction instance generated by [model]

Usage¶

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("ubuntu-chat-logs"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 361
Number of Utterances: 7950
Number of Conversations: 200

Additional notes¶

Data License¶

This dataset is shared under the Creative Commons Attribution 4.0 International License.

Contact¶

ConvoKit-formatted corpus created by Axel Bax (adb333@cornell.edu).

Please email questions about the original dataset to the corresponding author: Rupak Sarkar (rupak@umd.edu).