Data Format¶
ConvoKit expects and saves each corpus with the following basic structure, which mirrors closely with the intuitions behind the design of Corpus (see Core Concepts).
corpus_directory
|-- utterances.jsonl
|-- speakers.json
|-- conversations.json
|-- corpus.json
|-- index.json
This corpus can be loaded with:
corpus = Corpus(filename="corpus_directory")
Note that the end speakers do not need to manually create these files. ConvoKit provides the functionality to dump a Corpus object to save it with the required format:
At a high level, a custom dataset can be converted to a list of utterances (custom_utterance_list), and saved with ConvoKit format for reuse by:
>>> corpus = Corpus(utterances = custom_utterance_list)
>>> corpus.dump("custom_dataset", base_path="./") # dump to local directory
A more detailed example of how the Cornell Movie–Dialogs Corpus. may be converted from its original release form to ConvoKit format can be found here.
Details of component files¶
utterances.jsonl¶
Each utterance is stored on its own line and represented as a json object, with six mandatory fields:
id: index of the utterance
speaker: the speaker who authored the utterance
conversation_id: id of the first utterance in the conversation this utterance belongs to
reply_to: index of the utterance to which this utterance replies to (None if the utterance is not a reply)
timestamp: time of the utterance
text: textual content of the utterance
Additional information can be added optionally, depending on characteristics of the dataset and intended use, as:
meta: dictionary of utterance metadata
utterances.jsonl contains a list of such utterances. An example utterance is shown below, drawn from the Supreme Court corpus:
{'id': '200', 'speaker': 'mr. srinivasan', 'conversation_id': '145', 'reply_to': '199', 'timestamp': None, 'text': 'It -- it does.', 'meta': {'case': '02-1472', 'side': 'respondent'}}
speakers.json¶
speakers are identified by speaker names. speakers.json keeps a dictionary, where the keys are speaker names, and values are metadata associated with the speakers. Provision of speaker metadata is optional.
An example speaker-metadata pair is shown below, again, drawn from the Supreme Court corpus:
'mr. srinivasan': {'is-justice': False, 'gender': 'male'}
conversation.json¶
Similarly, conversation.json also keeps a dictionary where keys are conversation index, and values are conversational-level metadata (i.e., additional information that stay invariant throughout the conversation).
An example conversation index-metadata pair is shown below, adapted from the conversations gone awry corpus:
"236755381.13326.13326": {"page_title": "speaker talk: Entropy", "conversation_has_personal_attack": true}
Provision of conversational-level metadata is optional. In case no information is provided, the file could simply contain an empty dictionary.
corpus.json¶
Metadata of the corpus is saved in corpus.json, as a dictionary where keys are names of the metadata, and values are the actual content of such metadata.
The contents of the corpus.json file for the Reddit corpus (small) is as follows:
{"subreddit": "reddit-corpus-small", "num_posts": 8286, "num_comments": 288846, "num_speaker": 119889}
index.json¶
To allow speakers the option of previewing available information in the corpus without loading it entirely, ConvoKit requires an index.json file that contains information about all available metadata and their expected types.
There are five mandatory fields:
utterances-index: information of utterance-level metadata
speakers-index: information of speaker-level metadata
conversations-index: information of conversation-level metadata
overall-index: information of corpus-level metadata
version: version number of the corpus
As an example, the corpus-level metadata for the Reddit corpus (small) is shown below:
"overall-index": {"subreddit": "<class 'str'>", "num_posts": "<class 'int'>", "num_comments": "<class 'int'>", "num_speakers": "<class 'int'>"}
While not necessary, speakers experienced with handling json files can choose to convert their custom datasets directly based on the expected data format specifications.