Ubuntu Chat Logs Corpus

A collection of 200 goal-oriented conversations from Ubuntu chat logs, where pairs of speakers work together to troubleshoot technical problems. The corpus includes human-annotated conversational friction points as well as friction annotations generated by GPT-4o, GPT-4o-mini, Llama 70B, and Llama 8B.

A full description of the dataset can be found here: Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs. Rupak Sarkar, Neha Srikanth, Taylor Hudson, Rachel Rudinger, Claire Bonial, Philip Resnik. arXiv preprint arXiv:2503.12370, 2025.

Dataset details

Speaker-level information

Speakers in this dataset are troubleshooting problems together. Usually one person has a problem and another is helping them. Speakers are always in pairs. Role A denotes the person seeking assistance. As speakers can take part in multiple conversations, we track the following metadata:

  • role_A_count: number of conversations in which the speaker served in role A

  • role_B_count: number of conversations in which the speaker served in role B

Utterance-level information

Each utterance corresponds to one message in a chat log. For each utterance, we provide:

  • id: unique utterance identifier

  • speaker: the speaker who authored the utterance

  • conversation_id: unique ID of the conversation this utterance belongs to

  • reply_to: index of the utterance to which this utterance is a reply (None if it is not a reply)

  • timestamp: sequential index of the utterance within the conversation

  • text: textual content of the utterance

Metadata for each utterance include:

  • time_elapsed: number of minutes elapsed since the start of the conversation

  • gpt_explanation: explanation of the utterance generated by ChatGPT

  • conversational_friction: conversational friction scores generated by the original authors

  • explanation: human-generated explanation of the utterance

Conversational-level information

Each conversation represents a single Ubuntu troubleshooting session. Metadata associated with conversations include:

  • batch: the batch grouping for the conversation

  • duration: total duration of the conversation in minutes

  • role_A: speaker ID for the participant serving in role A

  • role_B: speaker ID for the participant serving in role B

  • ending: type of conversation ending — one of natural end, abrupt, or ran out of time

  • conversational_success: outcome of the conversation — one of success, some progress, or no progress

For each human annotator and each model (human, gpt4o, gpt4omini, llama70b, llama8b), the following metadata fields are provided:

  • conversational_friction_present_[model]: whether conversational friction was detected anywhere in the conversation by [model]

  • friction_count_[model]: number of friction instances detected by [model]

  • friction_index_list_[model]: list of utterance indices where friction was detected by [model]

  • explanation_list_[model]: list of natural-language explanations for each friction instance generated by [model]

Usage

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("ubuntu-chat-logs"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 361
Number of Utterances: 7950
Number of Conversations: 200

Additional notes

Data License

This dataset is shared under the Creative Commons Attribution 4.0 International License.

Contact

ConvoKit-formatted corpus created by Axel Bax (adb333@cornell.edu).

Please email questions about the original dataset to the corresponding author: Rupak Sarkar (rupak@umd.edu).