NewsInterview Corpus¶
A collection of 500 two-person informational interviews from National Public Radio (NPR) and Cable News Network (CNN), containing 16,396 utterances from 860 speakers. The dataset focuses on journalistic interviews between interviewers and sources, from 2000 to 2020.
A full description of the dataset can be found here: NewsInterview: a Dataset and a Playground to Evaluate LLMs’ Grounding Gap via Informational Interviews. Alexander Spangher, Michael Lu, Sriya Kalyan, Hyundong Justin Cho, Tenghao Huang, Weiyan Shi, Jonathan May. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025.
Dataset details¶
Speaker-level information¶
Speakers in this dataset are identified by unique IDs. Each speaker has the following metadata:
display_name: original speaker name as it appears in the transcript
role: speaker type — one of
HOST(interview host/anchor; 76 speakers),GUEST(interview subject/interviewee, default if not specified; 738 speakers), orBYLINE(reporter/correspondent; 46 speakers)programs: list of programs this speaker appears in
num_interviews: total number of interviews the speaker participated in
Utterance-level information¶
Each utterance corresponds to a single speaking turn in an interview. For each utterance, we provide:
id: unique utterance identifier
speaker: speaker ID reference
conversation_id: ID of the interview this utterance belongs to
reply_to: ID of the previous utterance (for threading)
timestamp: time marker (if available)
text: textual content of the utterance
Metadata for each utterance include:
interview_id: original interview identifier
turn_order: position in the conversation sequence
program: NPR/CNN program name
date: interview broadcast date
url: source URL (when available)
Conversational-level information¶
Each conversation represents a complete interview. Metadata associated with conversations include:
title: interview title (when available)
summary: interview summary or description
program: source program name (63 unique programs total)
date: broadcast/publication date (ranging from 2000 to 2020)
url: original source URL
info_items: extracted information items from the interview
info_items_dict: structured version of information items
outlines: interview objectives/outline
Usage¶
To download directly with ConvoKit:
>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("news-interview"))
For some quick stats:
>>> corpus.print_summary_stats()
Number of Speakers: 860
Number of Utterances: 16396
Number of Conversations: 500
Additional notes¶
Data License¶
This dataset is shared under the Creative Commons Attribution 4.0 International License.
Dataset Access¶
The original dataset can be accessed from the authors’ GitHub repository at: https://github.com/alex2awesome/news-interview-question-generation
Contact¶
ConvoKit formatted corpus was created by Axel Bax (adb333@cornell.edu) from the dataset created by Sarkar et al. Corresponding Author: Rupak Sarkar (rupak@umd.edu).