CANDOR Corpus¶
CANDOR corpus is a dataset of 1650 conversations that strangers had over video chat with rich metadata information obtaind from pre-conversation and post-conversation surveys. The corpus is available by request from the authors (BetterUp CANDOR Corpus) and ConvoKit contains code for converting the transcripts into ConvoKit format, as detailed below.
A full description of the dataset can be found here: Andrew Reece et al. ,The CANDOR corpus: Insights from a large multimodal dataset of naturalistic conversation. Sci. Adv.9,eadf3197(2023). Please cite this paper when using CANDOR in your research.
Usage¶
Request CANDOR Corpus from (transcripts only): BetterUp CANDOR Corpus
Convert the CANDOR Corpus into ConvoKit format using this notebook Converting CANDOR Corpus to ConvoKit Format
You will need pick the transcription type when converting CANDOR corpus to ConvoKit that will impact ConvoKit Utterance metadata. See section Utterance-level information below for more detail.
Dataset details¶
All ConvoKit metadata attributes preserve the names used in the original corpus, as detailed here BetterUp CANDOR Corpus Data Dictionary
Speaker-level information¶
There were 1454 unique participants from a broad range of backgrounds. The following information is recorded in the speaker level metadata:
- Metadata for each speaker include:
sex: gender of speaker
politics: political persuasion the speaker most identify (from very conservative to very liberal)
race: race/ethnicity of speaker
edu: highest level of school the speaker have completed or received
employ: current employment situation of speaker
age: age of speaker
Utterance-level information¶
According to the paper, utterances are processed in three different algorithms to parse speaker turns into utterances: Audiophile, Cliffhanger, and Backbiter. Please refer back to the paper for more detailed description on how the three algorithms are implemented.
Audiophile: A turn is when one speaker starts talking until the other speaker starts speaking
Cliffhanger: A turns is one full sentence said by one speaker based on terminal punctuation marks (periods, question marks, and exclamation points).
Backbiter: A turn is what one speaker starts talking until the other speaker speaks a non-backchannel words (example backchannel words: “mhm”, “yeah”, “exactly”, etc.)
You can pick the transcript processing algorithms in the ConvoKit conversion code by changing the TRANSCRIPTION_TYPE variable. Note that, for different algorithms used to process utterances in transcripts, Utterance-level metadata will be different.
For each utterance we provide:
id: Unique identifier for an utterance.
conversation_id: Utterance id corresponding to the first utterance of the conversation.
reply_to: Utterance id of the previous utterance in the conversation.
speaker: Speaker object corresponding to the author of this utterance.
text: Textual content of the utterance.
Metadata for each utterance include:
turn_id: The id of the turn in the current conversation.
speaker: Speaker id of the speaker of this turn.
start: The time that the turn starts in the conversation (in seconds).
stop: The time that the turn ends in the conversation (in seconds).
backchannel: The text of any backchannels that occur during this conversational turn. (For “backbiter” transcription type only)
backchannel_count: The number of backchannel instances (as defined in the paper) that occur during this conversational turn. Backchannel instances can be multiple tokens. (Method “backbiter” only)
backchannel_speaker: The user_id of the person backchanneling. (For “backbiter” transcription type only)
backchannel_start: The start time of the first backchannel during this turn. (For “backbiter” transcription type only)
backchannel_stop: The end time of the last backchannel during this turn. (For “backbiter” transcription type only)
interval: The time between the end of the last turn and the start of this turn in seconds. Can be negative if turns overlap.
delta: The length of the turn (i.e., stop-start) in seconds.
questions: The number of question marks that appear in the utterance.
end_question: Indicates if the utterance ends with a question mark.
overlap: Indicates if interval is negative.
n_words: The number of words in the utterance.
Conversation-level information¶
Conversation metadata contains surveys from each participants organized by survey field names, and the values being speakers’ answer organized by speaker ids:
For each conversation we provide:
id: id of the conversation
Metadata for each conversation correspond to the answer the two speakers gave in the surveys before and after that conversation. For each conversation, we got 1 survey from each conversation participant, and as this conversation is 2 people video calling, we got 2 surveys per conversation. We decided to organize the metadata in the following way:
convo.meta = {“survey field name” : {speaker_id_x : answer by speaker id speaker_id_x, speaker_id_y : answer by speaker id speaker_id_y} … }
- i_like_you: How much did you like your conversation partner?
convo.meta[‘i_like_you’] = {speaker_id_x : answer by speaker id speaker_id_x, speaker_id_y : answer by speaker id speaker_id_y}
you_like_me: How much do think your conversation partner liked you?
i_am_funny: How funny were you in the conversation you just had?
you_are_funny: How funny was your conversation partner?
i_am_polite: How polite were you during the conversation?
you_are_polite: How polite was your conversation partner?
my_isolation_pre_covid: Prior to the Covid-19 outbreak, how socially isolated did you feel?
my_isolation_post_covid: SINCE the Covid-19 outbreak, how socially isolated have you felt?
in_common: How much did you and your partner have in common with one another?
about 200 other survey fileds detailed in the BetterUp CANDOR Corpus Data Dictionary
Statistics about the dataset¶
Number of Speakers: 1454
Number of Utterances: 527869 (if TRANSCRIPTION_TYPE = “cliffhanger”)
Number of Conversations: 1650
Additional note¶
Data License¶
ConvoKit is not distributing the corpus separately, and thus no additional data license is applicable. The license of the original distribution applies.
Contact¶
Questions about the conversion into ConvoKit format should be directed to Sean Zhang <kz88@cornell.edu>
Questions about the CANDOR corpus should be directed to the corresponding authors <andrew.reece@betterup.com(A.R.);guscooney@gmail.com(G.C.)> of the original paper.