CANDOR Corpus
=============
CANDOR corpus is a dataset of 1650 conversations that strangers had over video chat with rich metadata information obtaind from pre-conversation and post-conversation surveys. The corpus is available by request from the authors (`BetterUp CANDOR Corpus `_) and ConvoKit contains code for converting the transcripts into ConvoKit format, as detailed below.
A full description of the dataset can be found here: `Andrew Reece et al. ,The CANDOR corpus: Insights from a large multimodal dataset of naturalistic conversation. Sci. Adv.9,eadf3197(2023). `_
Please cite this paper when using CANDOR in your research.
Usage
-----
Request CANDOR Corpus from (transcripts only): `BetterUp CANDOR Corpus `_
Convert the CANDOR Corpus into ConvoKit format using this notebook `Converting CANDOR Corpus to ConvoKit Format `_
You will need pick the transcription type when converting CANDOR corpus to ConvoKit that will impact ConvoKit Utterance metadata. See section Utterance-level information below for more detail.
Dataset details
---------------
All ConvoKit metadata attributes preserve the names used in the original corpus, as detailed here `BetterUp CANDOR Corpus Data Dictionary `_
Speaker-level information
^^^^^^^^^^^^^^^^^^^^^^^^^
There were 1454 unique participants from a broad range of backgrounds. The following information is recorded in the speaker level metadata:
Metadata for each speaker include:
* sex: gender of speaker
* politics: political persuasion the speaker most identify (from very conservative to very liberal)
* race: race/ethnicity of speaker
* edu: highest level of school the speaker have completed or received
* employ: current employment situation of speaker
* age: age of speaker
Utterance-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^
According to the paper, utterances are processed in three different algorithms to parse speaker turns into utterances: Audiophile, Cliffhanger, and Backbiter. Please refer back to the paper for more detailed description on how the three algorithms are implemented.
- Audiophile: A turn is when one speaker starts talking until the other speaker starts speaking
- Cliffhanger: A turns is one full sentence said by one speaker based on terminal punctuation marks (periods, question marks, and exclamation points).
- Backbiter: A turn is what one speaker starts talking until the other speaker speaks a non-backchannel words (example backchannel words: "mhm", "yeah", "exactly", etc.)
You can pick the transcript processing algorithms in the ConvoKit conversion code by changing the TRANSCRIPTION_TYPE variable. Note that, for different algorithms used to process utterances in transcripts, Utterance-level metadata will be different.
For each utterance we provide:
* id: Unique identifier for an utterance.
* conversation_id: Utterance id corresponding to the first utterance of the conversation.
* reply_to: Utterance id of the previous utterance in the conversation.
* speaker: Speaker object corresponding to the author of this utterance.
* text: Textual content of the utterance.
Metadata for each utterance include:
* turn_id: The id of the turn in the current conversation.
* speaker: Speaker id of the speaker of this turn.
* start: The time that the turn starts in the conversation (in seconds).
* stop: The time that the turn ends in the conversation (in seconds).
* backchannel: The text of any backchannels that occur during this conversational turn. (For "backbiter" transcription type only)
* backchannel_count: The number of backchannel instances (as defined in the paper) that occur during this conversational turn. Backchannel instances can be multiple tokens. (Method "backbiter" only)
* backchannel_speaker: The user_id of the person backchanneling. (For "backbiter" transcription type only)
* backchannel_start: The start time of the first backchannel during this turn. (For "backbiter" transcription type only)
* backchannel_stop: The end time of the last backchannel during this turn. (For "backbiter" transcription type only)
* interval: The time between the end of the last turn and the start of this turn in seconds. Can be negative if turns overlap.
* delta: The length of the turn (i.e., stop-start) in seconds.
* questions: The number of question marks that appear in the utterance.
* end_question: Indicates if the utterance ends with a question mark.
* overlap: Indicates if interval is negative.
* n_words: The number of words in the utterance.
Conversation-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Conversation metadata contains surveys from each participants organized by survey field names, and the values being speakers' answer organized by speaker ids:
For each conversation we provide:
* id: id of the conversation
Metadata for each conversation correspond to the answer the two speakers gave in the surveys before and after that conversation.
For each conversation, we got 1 survey from each conversation participant, and as this conversation is 2 people video calling, we got 2 surveys per conversation. We decided to organize the metadata in the following way:
convo.meta = {"survey field name" : {speaker_id_x : answer by speaker id speaker_id_x, speaker_id_y : answer by speaker id speaker_id_y} ... }
* i_like_you: How much did you like your conversation partner?
* convo.meta['i_like_you'] = {speaker_id_x : answer by speaker id speaker_id_x, speaker_id_y : answer by speaker id speaker_id_y}
* you_like_me: How much do think your conversation partner liked you?
* i_am_funny: How funny were you in the conversation you just had?
* you_are_funny: How funny was your conversation partner?
* i_am_polite: How polite were you during the conversation?
* you_are_polite: How polite was your conversation partner?
* my_isolation_pre_covid: Prior to the Covid-19 outbreak, how socially isolated did you feel?
* my_isolation_post_covid: SINCE the Covid-19 outbreak, how socially isolated have you felt?
* in_common: How much did you and your partner have in common with one another?
* about 200 other survey fileds detailed in the `BetterUp CANDOR Corpus Data Dictionary `_
Statistics about the dataset
------------------------------
* Number of Speakers: 1454
* Number of Utterances: 527869 (if TRANSCRIPTION_TYPE = "cliffhanger")
* Number of Conversations: 1650
Additional note
---------------
Data License
^^^^^^^^^^^^
ConvoKit is not distributing the corpus separately, and thus no additional data license is applicable. The license of the original distribution applies.
Contact
^^^^^^^
Questions about the conversion into ConvoKit format should be directed to Sean Zhang
Questions about the CANDOR corpus should be directed to the corresponding authors of the original paper.