Ubuntu Chat Logs Corpus
=====================================

A collection of 200 goal-oriented conversations from Ubuntu chat logs, where pairs of speakers work together to troubleshoot technical problems. The corpus includes human-annotated conversational friction points as well as friction annotations generated by GPT-4o, GPT-4o-mini, Llama 70B, and Llama 8B.

A full description of the dataset can be found here:
`Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs <https://arxiv.org/abs/2503.12370>`_.
Rupak Sarkar, Neha Srikanth, Taylor Hudson, Rachel Rudinger, Claire Bonial, Philip Resnik.
arXiv preprint arXiv:2503.12370, 2025.

Dataset details
---------------

Speaker-level information
^^^^^^^^^^^^^^^^^^^^^^^^^

Speakers in this dataset are troubleshooting problems together. Usually one person has a problem and another is helping them. Speakers are always in pairs. Role A denotes the person seeking assistance. As speakers can take part in multiple conversations, we track the following metadata:

* role_A_count: number of conversations in which the speaker served in role A
* role_B_count: number of conversations in which the speaker served in role B


Utterance-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each utterance corresponds to one message in a chat log. For each utterance, we provide:

* id: unique utterance identifier
* speaker: the speaker who authored the utterance
* conversation_id: unique ID of the conversation this utterance belongs to
* reply_to: index of the utterance to which this utterance is a reply (None if it is not a reply)
* timestamp: sequential index of the utterance within the conversation
* text: textual content of the utterance

Metadata for each utterance include:

* time_elapsed: number of minutes elapsed since the start of the conversation
* gpt_explanation: explanation of the utterance generated by ChatGPT
* conversational_friction: conversational friction scores generated by the original authors
* explanation: human-generated explanation of the utterance


Conversational-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each conversation represents a single Ubuntu troubleshooting session. Metadata associated with conversations include:

* batch: the batch grouping for the conversation
* duration: total duration of the conversation in minutes
* role_A: speaker ID for the participant serving in role A
* role_B: speaker ID for the participant serving in role B
* ending: type of conversation ending — one of ``natural end``, ``abrupt``, or ``ran out of time``
* conversational_success: outcome of the conversation — one of ``success``, ``some progress``, or ``no progress``

For each human annotator and each model (``human``, ``gpt4o``, ``gpt4omini``, ``llama70b``, ``llama8b``), the following metadata fields are provided:

* conversational_friction_present_[model]: whether conversational friction was detected anywhere in the conversation by [model]
* friction_count_[model]: number of friction instances detected by [model]
* friction_index_list_[model]: list of utterance indices where friction was detected by [model]
* explanation_list_[model]: list of natural-language explanations for each friction instance generated by [model]


Usage
-----

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("ubuntu-chat-logs"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 361
Number of Utterances: 7950
Number of Conversations: 200

Additional notes
----------------

Data License
^^^^^^^^^^^^

This dataset is shared under the `Creative Commons Attribution 4.0 International License <https://creativecommons.org/licenses/by/4.0/>`_.

Contact
^^^^^^^

ConvoKit-formatted corpus created by Axel Bax (adb333@cornell.edu).

Please email questions about the original dataset to the corresponding author: Rupak Sarkar (rupak@umd.edu).