SPOLIN Corpus

Selected Pairs of Learnable ImprovisatioN (SPOLIN) is a collection of more than 68,000 “Yes, and” type utterance pairs extracted from the long-form improvisation podcast Spontaneanation by Paul F. Tompkins, the Cornell Movie-Dialogs Corpus, and the SubTle corpus.

Dataset details

Speaker-level information

There is no speaker-level information.

Each conversation pair has two speakers named {conversation_id}_speaker_1 for the first turn and {conversation_id}_speaker_2 for the second turn.

Utterance-level information

Every conversation is labeled with its source (Spontaneantion, Cornell Movie-Dialogs Corpus, or the SubTle corpus) and whether it abides by the “Yes, and” principle or not. The “Yes, and” principle is a rule-of-thumb of improvisational theater that suggests that a participant should accept the reality of what the other participant has said (“Yes”) and expand or refine that reality with additional information (“and”). It does not require the response to explicitly contain the phrase “Yes, and”.

Metadata for each utterance:

  • split: whether it belongs to the original dataset’s train or validation set

  • label: 1 if it is part of a “yes-and” pair or 0 otherwise

  • source: whether it comes from Spontaneantion (spont), Cornell Movie-Dialogs Corpus (cornell), or the SubTle corpus (subtle)

Conversational-level information

Conversation IDs are in the following format: {split}_{idx}, where:

  • split: either train or valid, indicating whether the conversation belongs to the training set or the validation set.

  • idx: an integer value that corresponds to the order that it appears in the original dataset.

Corpus-level information

The metadata is as follows:

{
“name”: “spolin”,
“brief description”: “Selected Pairs of Learnable ImprovisatioN (SPOLIN) is a collection of more than 68,000 “Yes, and” type dialogue pairs extracted from the Spontaneanation podcast by Paul F. Tompkins, the Cornell Movie-Dialogs Corpus, and the SubTle corpus.”,
“authors”: “Hyundong Justin Cho, Jonathan May”,
“poc_email”: “jcho@isi.edu”,
“github_url”: “https://github.com/wise-east/spolin”,
“publication_title”: “Grounding Conversations with Improvised Dialogues”,
“publication_venue”: “ACL2020”,
“publication_url”: “https://aclanthology.org/2020.acl-main.218/”,
“license”: “Creative Commons Attribution-NonCommercial 4.0 International License”,
}

Usage

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("spolin-corpus"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 225194
Number of Utterances: 225194
Number of Conversations: 112597

Number of yesands / non-yesands:

  • Spontaneanation: 10,959 / 6,087

  • Cornell: 16,926 / 18,810

  • SubTle: 40,303 / 19,512

  • Total: 68,188 / 44,409

Number of yesands / non-yesands (train split)

  • Spontaneanation: 10,459 / 5,587

  • Cornell: 16,426 / 18,310

  • SubTle: 40,303 / 19,512

  • Total: 67,188 / 43,409

Number of yesands / non-yesands (validation split)

  • Spontaneanation: 500 / 500

  • Cornell: 500 / 500

  • Total: 1,000 / 1,000

Additional notes

More details about the SPOLIN project can be found on: https://justin-cho.com/spolin

Publication to cite

Please cite this paper when using it in your research:

@inproceedings{cho2020spolin,
title={Grounding Conversations with Improvised Dialogues},
author={Cho, Hyundong and May, Jonathan},
booktitle ={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}

Contact

Please email any questions to Hyundong Justin Cho (jcho@isi.edu), Information Sciences Institute, University of Southern California