NPR Interview 2P Dataset Corpus =============================== This corpus contains conversations between NPR show hosts and their guests. The corpus contains dialog from 22,257 speakers with 428,624 utterances and 22,149 conversations total. This is a Convokit-formatted version of the dataset originally distributed with the following paper: Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, and Julian McAuley. 2020. `Interview: Large-Scale Modeling of Media Dialog with Discourse Patterns and Knowledge Grounding. `_ In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8129–41. Please cite this paper when using NPR-2P corpus in your research. Dataset Details --------------- Speaker-Level Information ^^^^^^^^^^^^^^^^^^^^^^^^^ In this dataset, each speaker is either a show host or guest. The speaker index is the same as the index given in the original dataset, and the following metadata is also provided: * name: the speaker’s name as given in the original dataset * type: host or guest, depending on the speaker’s role Utterance-Level Information ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following information about each utterance is provided: * id: the index of the utterance in the dataset * speaker: the speaker who said the utterance * reply_to: the id of the utterance which this utterance replies to, or None if none exists. * timestamp: null for the entirety of this corpus * text: the text of each utterance * episode: the id of the episode this utterance appears in * order: the index of this utterance within the episode Conversational-Level Information ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Conversations are indexed by the id of the first utterance that appears in the conversation. The follow information about each utterance is provided: * program: the name of the NPR radio program this episode appears in * title: the title of this episode * date: the date this episode aired Usage ----- To download directly with ConvoKit: >>> from convokit import Corpus, download >>> corpus = Corpus(filename=download("npr-2p-corpus")) For some quick stats: >>> corpus.print_summary_stats() Number of Speakers: 22267 Number of Utterances: 428624 Number of Conversations: 22149 Additionally, if you want to process the original NPR-2P data into ConvoKit format you can use the following script `Converting NPR-2P Corpus to ConvoKit Format `_ Additional note --------------- Contact ^^^^^^^ Please email any questions to Andrea (aww66@cornell.edu), Lucy (lj287@cornell.edu), or Rebecca (rmh327@cornell.edu). Files ^^^^^^^ The original dataset can be found on `Kaggle `_ Dataset `Access `_ Cleaning/Conversion `Script `_