Utility Functions

convokit.util.deprecation(prev_name: str, new_name: str, stacklevel: int = 3)

Suppressable deprecation warning.

convokit.util.download(name: str, verbose: bool = True, data_dir: str = None, use_newest_version: bool = True, use_local: bool = False) → str

Use this to download (or use saved) convokit data by name.

  • name

    Which item to download. Currently supported:

    • ”wiki-corpus”: Wikipedia Talk Page Conversations Corpus

      A medium-size collection of conversations from Wikipedia editors’ talk pages. (see http://www.cs.cornell.edu/~cristian/Echoes_of_power.html)

    • ”wikiconv-<year>”: Wikipedia Talk Page Conversations Corpus

      Conversations data for the specified year.

    • ”supreme-corpus”: Supreme Court Dialogs Corpus

      A collection of conversations from the U.S. Supreme Court Oral Arguments. (see http://www.cs.cornell.edu/~cristian/Echoes_of_power.html)

    • ”parliament-corpus”: UK Parliament Question-Answer Corpus

      Parliamentary question periods from May 1979 to December 2016 (see http://www.cs.cornell.edu/~cristian/Asking_too_much.html)

    • ”conversations-gone-awry-corpus”: Wiki Personal Attacks Corpus

      Wikipedia talk page conversations that derail into personal attacks as labeled by crowdworkers (see http://www.cs.cornell.edu/~cristian/Conversations_gone_awry.html)

    • ”conversations-gone-awry-cmv-corpus”

      Discussion threads on the subreddit ChangeMyView (CMV) that derail into rule-violating behavior (see http://www.cs.cornell.edu/~cristian/Conversations_gone_awry.html)

    • ”movie-corpus”: Cornell Movie-Dialogs Corpus

      A large metadata-rich collection of fictional conversations extracted from raw movie scripts. (see https://www.cs.cornell.edu/~cristian/Chameleons_in_imagined_conversations.html)

    • ”tennis-corpus”: Tennis post-match press conferences transcripts

      Transcripts for tennis singles post-match press conferences for major tournaments between 2007 to 2015 (see http://www.cs.cornell.edu/~liye/tennis.html)

    • ”reddit-corpus-small”: Reddit Corpus (sampled):

      A sample from 100 highly-active subreddits

    • ”subreddit-<subreddit-name>”: Subreddit Corpus

      A corpus made from the given subreddit

    • ”friends-corpus”: Friends TV show Corpus

      A collection of all the conversations that occurred over 10 seasons of Friends, a popular American TV sitcom that ran in the 1990s.

    • ”switchboard-corpus”: Switchboard Dialog Act Corpus

      A collection of 1,155 five-minute telephone conversations between two participants,

      annotated with speech act tags.

    • ”persuasionforgood-corpus”: Persuasion For Good Corpus

      A collection of online conversations where a persuader tries to convince a persuadee to donate to charity.

    • ”iq2-corpus”: Intelligence Squared Debates Corpus

      Transcripts of debates held as part of Intelligence Squared Debates.

    • ”diplomacy-corpus”: Deception in Diplomacy Corpus

      Dataset with intended and perceived deception labels in the negotiation-based game Diplomacy.

    • ”reddit-coarse-discourse-corpus”: Coarse Discourse Sequence Corpus

      Reddit dataset with utterances containing discourse act labels.

    • ”chromium-corpus”: Chromium Conversations Corpus

      A collection of almost 1.5 million conversations and 2.8 million comments posted by developers reviewing proposed code changes in the Chromium project.

    • ”wikipedia-politeness-corpus”: Wikipedia Politeness Corpus

      A corpus of politeness annotations on requests from Wikipedia talk pages.

    • ”stack-exchange-politeness-corpus”: Stack Exchange Politeness Corpus

      A corpus of politeness annotations on requests from stack exchange.

  • verbose – Print checkpoint statements for download

  • data_dir – Output path of downloaded file (default: ~/.convokit)

  • use_newest_version – Re-download if new version is found

  • use_local – if True, use the local version of corpus if it exists (regardless of whether a newer version exists)


The path to the downloaded item.

convokit.util.download_local(name: str, data_dir: str)

Get path to a previously-downloaded local version of the corpus (which may be an older version).


name – name of Corpus


string path to local Corpus

convokit.util.subreddit_in_grouping(subreddit: str, grouping_key: str) → bool
  • subreddit – subreddit name

  • grouping_key – example: “askreddit~-~blackburn”


if string is within the grouping range

convokit.util.warn(text: str)

Pre-pends a red-colored ‘WARNING: ‘ to [text]. This is a printed warning and cannot be suppressed.


text – Warning message


‘WARNING: [text]’