NPR Interview 2P Dataset Corpus

This corpus contains conversations between NPR show hosts and their guests. The corpus contains dialog from 22,257 speakers with 428,624 utterances and 22,149 conversations total.

This is a Convokit-formatted version of the dataset originally distributed with the following paper:

Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, and Julian McAuley. 2020. Interview: Large-Scale Modeling of Media Dialog with Discourse Patterns and Knowledge Grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8129–41.

Please cite this paper when using this corpus in your research.

Dataset Details

Speaker-Level Information

In this dataset, each speaker is either a show host or guest. The speaker index is the same as the index given in the original dataset, and the following metadata is also provided:
  • name: the speaker’s name as given in the original dataset

  • type: host or guest, depending on the speaker’s role

Utterance-Level Information

The following information about each utterance is provided:
  • id: the index of the utterance in the dataset

  • speaker: the speaker who said the utterance

  • reply_to: the id of the utterance which this utterance replies to, or None if none exists.

  • timestamp: null for the entirety of this corpus

  • text: the text of each utterance

  • episode: the id of the episode this utterance appears in

  • order: the index of this utterance within the episode

Conversational-Level Information

Conversations are indexed by the id of the first utterance that appears in the conversation. The follow information about each utterance is provided:
  • program: the name of the NPR radio program this episode appears in

  • title: the title of this episode

  • date: the date this episode aired

Usage

Convert the NPR-2P Corpus into ConvoKit format using this notebook Converting NPR-2P Corpus to ConvoKit Format

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("npr-2p-corpus"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 22267
Number of Utterances: 428624
Number of Conversations: 22149

Additional note

Contact

Please email any questions to Andrea (aww66@cornell.edu), Lucy (lj287@cornell.edu), or Rebecca (rmh327@cornell.edu).

Files

The original dataset can be found on Kaggle

Dataset Access

Cleaning/Conversion Script