Wikipedia Talk Pages Corpus¶

A collection of conversations from Wikipedia editor’s talk pages, with metadata.

Distributed together with: Echoes of power: Language effects and power differences in social interaction. Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg. WWW 2012.

Dataset details¶

Speakers in this dataset are Wikipedia editors; their account names are taken as the speaker names. Additional information include:

For each utterance, we provide:

id: index of the utterance
speaker: the speaker who author the utterance
conversation_id: id of the first utterance in the conversation this utterance belongs to
reply_to: id of the utterance to which this utterance replies to (None if the utterance is not a reply)
timestamp: time of the utterance
text: textual content of the utterance

Metadata for each utterance include:

The dataset also comes with the following processed fields, which can be loaded separately via corpus.load_info(‘utterance’,[list of fields]):

To download directly with ConvoKit:

>>> from convokit import Corpus, download
>>> corpus = Corpus(filename=download("wiki-corpus"))

For some quick stats:

>>> corpus.print_summary_stats()
Number of Speakers: 38462
Number of Utterances: 391294
Number of Conversations: 125292

This dataset is governed by the CC BY-SA license v4.0.

Please email any questions to: cristian@cs.cornell.edu (Cristian Danescu-Niculescu-Mizil)