NPR Interview 2P Dataset Corpus¶

This corpus contains conversations between NPR show hosts and their guests. The corpus contains dialog from 22,257 speakers with 428,624 utterances and 22,149 conversations total.

This is a Convokit-formatted version of the dataset originally distributed with the following paper:

Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, and Julian McAuley. 2020. Interview: Large-Scale Modeling of Media Dialog with Discourse Patterns and Knowledge Grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8129–41.

Please cite this paper when using NPR-2P corpus in your research.

Dataset Details¶

Speaker-Level Information¶

In this dataset, each speaker is either a show host or guest. The speaker index is the same as the index given in the original dataset, and the following metadata is also provided:

name: the speaker’s name as given in the original dataset
type: host or guest, depending on the speaker’s role

Utterance-Level Information¶

The following information about each utterance is provided:

id: the index of the utterance in the dataset
speaker: the speaker who said the utterance
reply_to: the id of the utterance which this utterance replies to, or None if none exists.
timestamp: null for the entirety of this corpus
text: the text of each utterance
episode: the id of the episode this utterance appears in
order: the index of this utterance within the episode

Conversational-Level Information¶

Conversations are indexed by the id of the first utterance that appears in the conversation. The follow information about each utterance is provided:

program: the name of the NPR radio program this episode appears in
title: the title of this episode
date: the date this episode aired

Usage¶

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("npr-2p-corpus"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 22267
Number of Utterances: 428624
Number of Conversations: 22149

Additionally, if you want to process the original NPR-2P data into ConvoKit format you can use the following script Converting NPR-2P Corpus to ConvoKit Format

Additional note¶

Contact¶

Please email any questions to Andrea (aww66@cornell.edu), Lucy (lj287@cornell.edu), or Rebecca (rmh327@cornell.edu).

Files¶

The original dataset can be found on Kaggle

Dataset Access

Cleaning/Conversion Script