Parliament Question Time Corpus

A collections of questions and answers from parliamentary question periods in the British House of Commons from May 1979 to December 2016 (433,787 utterances), scraped from They Work For You <https://www.theyworkforyou.com/>_.

Distributed together with: Asking Too Much? The Rhetorical Role of Questions in Political Discourse. Justine Zhang, Arthur Spirling, Cristian Danescu-Niculescu-Mizil. EMNLP 2017.

Dataset details

Speaker-level information

The speakers in the dataset are members of the Parliament (MP). For each MP, the dataset further includes the following metadata:

  • name: name of the MP
  • member_start: start date of the MP as the member of the Parliament
  • member_end: end date of the MP as the member of the Parliament (set to year 3020 if the MP was still in Parliament as of Dec 2016)
  • parties: a list of parties that the MP has belonged to in the past
  • first_govt: first government (by Prime Minister) in which the MP was in office
  • first_govt_coarse: first government in which the MP was in office. here, consecutive governments of the same party (e.g., thatcher+major) are grouped together.

Note that some of the metadata information may be missing, especially for MPs active before the Blair government.

Utterance-level information

Each question or answer is viewed as an utterance. For each utterance, we provide:

  • id: index of the utterance
  • speaker: the MP who spoke the utterance
  • conversation_id: id of the first utterance in the conversation this utterance belongs to
  • reply_to: id of the utterance to which this utterance replies to (None if the utterance is not a reply)
  • timestamp: time of the utterance
  • text: textual content of the utterance

Additional metadata include:

  • next_id: id of the utterance replying to this one (None if the utterance has no reply)
  • is_question: whether the utterance is a question
  • is_answer: whether the utterance is an answer to a question
  • pair_idx: index of the question-answer pair
  • is_incumbent: whether the MP is incumbent (i.e., a member of the government party)
  • is_minister: whether the MP is a Minister
  • is_oppn: whether the MP is from the official opposition party
  • party: party affiliation of the MP
  • tenure: the number of years that the MP has been in office at the time of the utterance
  • govt: current government (by Prime Minister) at the time of the utterance
  • govt_coarse: current government (by Prime Minister) at the time of the utterance. here, consecutive governments of the same party (e.g., thatcher+major) are grouped together.
  • pair_has_features: whether the pair to which the utterance belongs has a question that contains at least one q_arc term and an answer that contains at least one arcs term.
  • dept_name: the name of the department to which the answering minister for the question or answer belongs. inferred from the raw HTML (see dept_name_raw for an un-processed version of the same attribute).
  • dept_name_coarse: department name, listed as other for departments other than the 10 containing the most question-answer pairs.

Note that some of the metadata information may be missing, especially for utterances dating back to before the Blair government.

The dataset also comes with the following processed fields, which can be loaded separately via corpus.load_info(‘utterance’,[list of fields]):

  • parsed: SpaCy dependency parse
  • arcs: dependency parse arcs, without nouns
  • q_arcs: dependency parse arcs for questions only, without nouns

The latter two fields are used in the original publication to represent utterances.

Usage

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("parliament-corpus"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 1978
Number of Utterances: 433787
Number of Conversations: 216894

Additional note

See this example notebook for an example of how to group questions in this dataset according to their rhetorical roles.