Supreme Court Oral Arguments Corpus

A collection of cases from the U.S. Supreme Court, along with transcripts of oral arguments. Contains approximately 1,700,000 utterances over 8,000 oral arguments transcripts from 7,700 cases.

The data comes from two sources: transcripts were scraped from the Oyez website, while voting information comes from the Supreme Court Database (SCDB).

Along with the entire corpus, we release another version split up into different years spanning 1955 to 2019, each named “supreme-(year)”. Additional metadata are also included for each case here.

The following examples use this corpus:

Some considerations regarding case and voting information

Each case in the data can have multiple conversations, corresponding to multiple sessions of oral arguments heard. For convenience, we include information for each conversation about how justices voted in the corresponding case, meaning that vote information will be repeated across each conversation corresponding to a case. The case metadata file also lists vote information.

The docket ID was used, along with some heuristics, to match cases in Oyez with those in SCDB. While most cases could be matched this way, a few were done manually (by inspecting case names and decision dates) and a few appear to be missing; please let us know of any mistakes you encounter. The case metadata file contains information about which case IDs, in our data, map to which docket IDs in the SCDB dataset.

SCDB makes finer distinctions about justice votes and case outcomes than whether the petitioner or respondent won. This finer-grained information is listed in the case metadata file; for the vote information included per-conversation in the corpus, we map justice vote and case outcome information to whether the vote/case was in favor of the petitioner or respondent. See below description of the case metadata file for details.

Usage

To download the entire corpus:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("supreme-corpus"))

To download a particular year:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("supreme-2019"))

Dataset details

Speaker-level information

Speakers correspond to justices and lawyers (also referred to as advocates).

For each Speaker, we provide:

  • id: the ID of the Speaker. If provided in Oyez, we use this ID, such that further information about advocates or justices may be found at oyez.org/advocates/<id> or oyez.org/justices/<id>. Otherwise this is inferred (see below)

  • name: the name of the Speaker, as listed in transcripts.

  • type: whether the speaker is a justice J, advocate A or unknown U.

Additional details:

  • When possible, we tried to ensure Speaker information corresponds to information provided in Oyez. Oyez usualy provides explicit lists of the speakers involved in each oral argument, especially for more recent cases; earlier ones are missing these explicit lists. Otherwise we tried to follow the Oyez format for converting between names listed in transcripts and IDs (i.e., replacing spaces with underscores and lowercasing).

Conversation-level information

Conversations correspond to different sessions of oral arguments and re-arguments. Importantly, note that a case can have multiple conversations.

For each Conversation, we provide:

  • id: we use the ID of the corresponding transcript, as provided by Oyez.

  • case_id: the ID of the case (see below).

  • advocates: a dictionary where each entry lists the following information for each lawyer:
    • role: the role that the advocate plays (e.g., “Argued for the petitioner”), as listed by Oyez; “inferred” if no role is listed.

    • side: the side that the advocate is on: 0 for respondent, 1 for petitioner, 2 for amicus curiae (NOTE that we currently do not differentiate between which side the amicus was supporting), 3 for unknown, None for unknown or inaudible speakers (see below, Utterance-level information). If no role is listed in Oyez, this is inferred via some heuristics (documentation forthcoming).

  • votes_side: a dictionary where each entry lists how each justice voted in the case in which the session occurred: 1 for the petitioner and 0 for the respondent. -1 if vote information was not provided or was otherwise unclear.

  • win_side: 1 if the case (in which the session occurred) was decided favorably for the petitioner, 0 if it wasn’t; 2 if the decision was unclear, and -1 if this information was unavailable.

See below description on the case metadata file for further details on votes_side and win_side.

Utterance-level information

For each utterance, we provide:

  • id

  • text. Oyez seems to separate different sentences into different paragraphs to facilitate its audio-to-text matching; we’ve retained this segmentation in the data, where sentences are separated by newline.

  • speaker. Note that some utterances have “<INAUDIBLE>” speakers, corresponding to turns listed in the Oyez transcripts without any speaker information, where an interjection was audible but the identity of the speaker couldn’t be discerned.

  • conversation_id

  • case_id: the ID of the case in which the oral argument took place.

  • speaker_type: whether the speaker is a justice J, advocate A, or unknown/inaudible U.

  • side: the speaker’s side (see above, Conversation-level information, and note that this is sometimes inferred from the data if not explicitly listed)

  • start_times: the timestamp (as listed in Oyez) of when each sentence in the text starts. There is one entry per sentence, corresponding to newlines in the text.

  • stop_times: the timestamp of when each sentence ends.

  • timestamp: the timestamp of the first sentence in the utterance.

  • reply_to: the ID of the preceding utterance.

The dataset also comes with the following processed fields, which can be loaded separately via corpus.load_info(‘utterance’,[list of fields]):

  • parsed: dependency parse of each utterance

  • arcs: dependency parse arcs for each utterance

  • tokens: processed tokens of each utterance

Case information

This file is a list of json objects containing some information about each case, pulled from Oyez and SCDB.

  • id: generally formatted as <year of case>_<docket no>

  • year

  • title: the name of the case

  • petitioner: the name of the petitioner

  • respondent: the name of the respondent

  • docket_no: the docket number of the case, as listed in Oyez.

  • scdb_docket_id: the docket ID of the case, as listed in SCDB.

  • citation: the citation of the case from the United States Reports. Note that there appear to be some missing entries and some duplicates.

  • url: the url of the Oyez listing

  • court: the court that saw the case (corresponding to a particular roster of justices)

  • decided_date: the date the case was decided, according to Oyez

  • win_side: whether the petitioning party won; also included in the corpus. See the corresponding listing in SCDB for details. -1 if no information available.

  • win_side_detail: finer-grained label of case outcome. See the corresponding listing in SCDB for details. -1 if no information available.

  • advocates: the advocates participating in the case.

  • adv_sides_inferred: While most Oyez transcripts explicitly list advocates and their roles, some don’t, so we fill this information in via a set of heuristics. This field is True if at least one advocate had information that was filled in in this way.

  • votes: a dictionary of justice to whether they voted with the majority or dissented. See the corresponding listing in SCDB for details. -1 if no information available.

  • votes_detail: a dictionary of justice to their vote in the case. See the corresponding listing in SCDB for details. -1 if no information available.

  • votes_side: a dictionary of justice to whether they voted for the petitioning party, derived from the win_side and votes_detail information. -1 if no information available; in particular, note that if the vote was equally divided, we cannot infer which side the justice voted for. Also included in the corpus.

  • transcripts: a list of transcript names, URLs and IDs (corresponding to the IDs of conversations in the corpus).

Citation and other versions

This corpus extends a smaller dataset of oral arguments that we previously released together with Echoes of power: Language effects and power differences in social interaction. Cristian Danescu-Niculescu-Mizil, Bo Pang, Lillian Lee and Jon Kleinberg. WWW 2012. Please cite the Echoes of Powers paper if you use either version of the corpus. If you use the ConvoKit version please additionally cite: ConvoKit: A Toolkit for the Analysis of Conversations. Jonathan P. Chang, Caleb Chiam, Liye Fu, Andrew Wang, Justine Zhang, Cristian Danescu-Niculescu-Mizil. Proceedings of SIGDIAL. 2020.

Contact

Please email any questions to: jz727@cornell.edu (Justine Zhang).