SPOLIN Corpus¶
Selected Pairs of Learnable ImprovisatioN (SPOLIN) is a collection of more than 68,000 “Yes, and” type utterance pairs extracted from the long-form improvisation podcast Spontaneanation by Paul F. Tompkins, the Cornell Movie-Dialogs Corpus, and the SubTle corpus.
Dataset details¶
Speaker-level information¶
There is no speaker-level information.
Each conversation pair has two speakers named {conversation_id}_speaker_1
for the first turn and {conversation_id}_speaker_2
for the second turn.
Utterance-level information¶
Every conversation is labeled with its source (Spontaneantion, Cornell Movie-Dialogs Corpus, or the SubTle corpus) and whether it abides by the “Yes, and” principle or not. The “Yes, and” principle is a rule-of-thumb of improvisational theater that suggests that a participant should accept the reality of what the other participant has said (“Yes”) and expand or refine that reality with additional information (“and”). It does not require the response to explicitly contain the phrase “Yes, and”.
Metadata for each utterance:
split: whether it belongs to the original dataset’s train or validation set
label: 1 if it is part of a “yes-and” pair or 0 otherwise
source: whether it comes from Spontaneantion (
spont
), Cornell Movie-Dialogs Corpus (cornell
), or the SubTle corpus (subtle
)
Conversational-level information¶
Conversation IDs are in the following format: {split}_{idx}
, where:
split: either
train
orvalid
, indicating whether the conversation belongs to the training set or the validation set.idx: an integer value that corresponds to the order that it appears in the original dataset.
Corpus-level information¶
The metadata is as follows:
Usage¶
To download directly with ConvoKit:
>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("spolin-corpus"))
For some quick stats:
>>> corpus.print_summary_stats()
Number of Speakers: 225194
Number of Utterances: 225194
Number of Conversations: 112597
Number of yesands / non-yesands:
Spontaneanation: 10,959 / 6,087
Cornell: 16,926 / 18,810
SubTle: 40,303 / 19,512
Total: 68,188 / 44,409
Number of yesands / non-yesands (train split)
Spontaneanation: 10,459 / 5,587
Cornell: 16,426 / 18,310
SubTle: 40,303 / 19,512
Total: 67,188 / 43,409
Number of yesands / non-yesands (validation split)
Spontaneanation: 500 / 500
Cornell: 500 / 500
Total: 1,000 / 1,000
Additional notes¶
More details about the SPOLIN project can be found on: https://justin-cho.com/spolin
License¶
This dataset is shared under the Creative Commons Attribution-NonCommercial 4.0 International License.
Publication to cite¶
Please cite this paper when using it in your research:
Contact¶
Please email any questions to Hyundong Justin Cho (jcho@isi.edu), Information Sciences Institute, University of Southern California