Wikipedia Talk Pages Corpus

A collection of conversations from Wikipedia editor’s talk pages, with metadata.

Distributed together with: Echoes of power: Language effects and power differences in social interaction. Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg. WWW 2012.

Dataset details

Speaker-level information

Speakers in this dataset are Wikipedia editors; their account names are taken as the speaker names. Additional information include:

  • is-admin: whether the speaker is an admin

  • edit-count: total number of edits the speaker has made

Utterance-level information

For each utterance, we provide:

  • id: index of the utterance

  • speaker: the speaker who author the utterance

  • conversation_id: id of the first utterance in the conversation this utterance belongs to

  • reply_to: id of the utterance to which this utterance replies to (None if the utterance is not a reply)

  • timestamp: time of the utterance

  • text: textual content of the utterance

Metadata for each utterance include:

  • is-admin: whether the utterance is from an admin

The dataset also comes with the following processed fields, which can be loaded separately via corpus.load_info(‘utterance’,[list of fields]):

  • parsed: SpaCy dependency parse

  • arcs_censored: dependency parse arcs, without nouns

Usage

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("wiki-corpus"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 38462
Number of Utterances: 391294
Number of Conversations: 125292

Additional notes

Data License

This dataset is governed by the CC BY-SA license v4.0.

Contact

Please email any questions to: cristian@cs.cornell.edu (Cristian Danescu-Niculescu-Mizil)