This toolkit contains tools to extract conversational features and analyze social phenomena in conversations. Several large conversational datasets are included together with scripts exemplifying the use of the toolkit on these datasets.
The toolkit currently implements features for:
Linguistic coordination, a measure of linguistic influence (and relative power) between individuals or groups based on their use of function words (see the Echoes of Power paper). Example script exploring the balance of power in the US Supreme Court.
Politeness strategies, a set of lexical and parse-based features correlating with politeness and impoliteness (see the A computational approach to politeness paper). Example script for understanding the (mis)use of politeness strategies in conversations gone awry on Wikipedia.
Question typology, an unsupervised method for extracting surface motifs that recur in questions, and for grouping them according to their latent rhetorical role (see the Asking too much paper). Example scripts for extracting common question types in the UK parliament, on Wikipedia edit pages, and in sport interviews.
Conversational prompts, an unsupervised method for extracting types of conversational prompts (see the Conversations gone awry paper). Example script for understanding the use of conversational prompts in conversations gone awry on Wikipedia.
Hypergraph conversation representation (beta), a method for extracting structural features of conversations through a hypergraph representation (see the Patterns of Participant Interactions paper). Example script demonstrates hypergraph creation and feature extraction, visualization and interpretation.
Coming soon: Basic message and turn features, currently available here Constructive conversations
These datasets are included for ready use with the toolkit:
Conversations Gone Awry Corpus: a collection of conversations from Wikipedia talk pages that derail into personal attacks (1,270 conversations, 6,963 comments)
Tennis Corpus: transcripts for tennis singles post-match press conferences for major tournaments between 2007 to 2015 (6,467 post-match press conferences)
Wikipedia Talk Pages Corpus: collection of conversations from Wikipedia editors' talk pages
Supreme Court Corpus: collection of conversations from the U.S. Supreme Court Oral Arguments
Parliament Corpus: parliamentary question periods from May 1979 to December 2016 (216,894 question-answer pairs)
Reddit Conversations Corpus (beta): 99,145 Reddit conversations sampled from 100 subreddits
These datasets can be downloaded using the
convokit.download() helper function. Alternatively you can access them directly here.
To use the toolkit with your own dataset, it needs to be in a standard json format.
This toolkit requires Python 3.
pip3 install convokit
python3 -m spacy download en
Alternatively, visit our Github Page to install from source.
See the example ipython notebooks linked above to familiarize yourself with how to use the different modules of the toolkit. The basic process is:
import convokitinto your python3 project.
corpus = convokit.Corpus(filename=...); use your own corpus or one of the ones provided with the toolkit.
ps = convokit.PolitenessStrategies(corpus)extracts the politeness strategies used in all the conversations.
Documentation is hosted here.
The documentation is built with Sphinx (
pip3 install sphinx). To build it yourself, navigate to
doc/ and run
Andrew Wang (email@example.com) wrote the Coordination code and the respective example script, wrote the helper functions and designed the structure of the toolkit.
Ishaan Jhaveri (firstname.lastname@example.org) refactored the Question Typology code and wrote the respective example scripts.
Jonathan Chang (email@example.com) wrote the example script for Conversations Gone Awry.