Corpus¶
-
class
convokit.model.corpus.
Corpus
(filename: Optional[str] = None, utterances: Optional[List[convokit.model.utterance.Utterance]] = None, db_collection_prefix: Optional[str] = None, db_host: Optional[str] = None, preload_vectors: List[str] = None, utterance_start_index: int = None, utterance_end_index: int = None, merge_lines: bool = False, exclude_utterance_meta: Optional[List[str]] = None, exclude_conversation_meta: Optional[List[str]] = None, exclude_speaker_meta: Optional[List[str]] = None, exclude_overall_meta: Optional[List[str]] = None, disable_type_check=True, backend: Optional[str] = None, backend_mapper: Optional[convokit.model.backendMapper.BackendMapper] = None)¶ Represents a dataset, which can be loaded from a folder or constructed from a list of utterances.
- Parameters
filename – Path to a folder containing a Corpus or to an utterances.jsonl / utterances.json file to load
utterances – list of utterances to initialize Corpus from
db_collection_prefix – if a db backend is used, this determines how the database will be named. If not specified, a random name will be used.
db_host – if specified, and a db backend is used, connect to the database at this URL. If not specified, will default to the db_host in the ConvoKit global configuration file.
preload_vectors – list of names of vectors to be preloaded from directory; by default, no vectors are loaded but can be loaded any time after corpus initialization (i.e. vectors are lazy-loaded).
utterance_start_index – if loading from directory and the corpus folder contains utterances.jsonl, specify the line number (zero-indexed) to begin parsing utterances from
utterance_end_index – if loading from directory and the corpus folder contains utterances.jsonl, specify the line number (zero-indexed) of the last utterance to be parsed.
merge_lines – whether to merge adjacent lines from same speaker if multiple consecutive utterances belong to the same conversation.
exclude_utterance_meta – utterance metadata to be ignored
exclude_conversation_meta – conversation metadata to be ignored
exclude_speaker_meta – speaker metadata to be ignored
exclude_overall_meta – overall metadata to be ignored
disable_type_check – whether to do type checking when loading the Corpus from a directory. Type-checking ensures that the ConvoKitIndex is initialized correctly. However, it may be unnecessary if the index.json is already accurate and disabling it will allow for a faster corpus load. This parameter is set to True by default, i.e. type-checking is not carried out.
backend – specify the backend type, either “mem” or “db”, default to “mem”.
backend_mapper – (advanced usage only) if provided, use this as the BackendMapper instance instead of initializing a new one.
- Variables
meta_index – index of Corpus metadata
vectors – the vectors stored in the Corpus
corpus_dirpath – path to the directory the corpus was loaded from
-
add_utterances
(utterances=typing.List[convokit.model.utterance.Utterance], warnings: bool = False, with_checks=True)¶ Add utterances to the Corpus.
If the corpus has utterances that share an id with an utterance in the input utterance list,
Optional warnings will be printed: - if the utterances with same id do not share the same data (added utterance is ignored) - added utterances’ metadata have the same key but different values (added utterance’s metadata will overwrite)
- Parameters
utterances – Utterances to be added to the Corpus
warnings – set to True for warnings to be printed
with_checks – set to True if checks on utterance and metadata overlaps are desired. Set to False if newly added utterances are guaranteed to be new and share the same set of metadata keys.
- Returns
a Corpus with the utterances from this Corpus and the input utterances combined
-
append_vector_matrix
(matrix: convokit.model.convoKitMatrix.ConvoKitMatrix)¶ Adds an already constructed ConvoKitMatrix to the Corpus.
- Parameters
matrix – a ConvoKitMatrix object
- Returns
None
-
delete_metadata
(obj_type: str, attribute: str)¶ Delete a specified metadata attribute from all Corpus components of the specified object type.
Note that cancelling this method before it runs to completion may lead to errors in the Corpus.
- Parameters
obj_type – ‘utterance’, ‘conversation’, ‘speaker’
attribute – name of metadata attribute
- Returns
None
-
delete_vector_matrix
(name)¶ Deletes the vector matrix stored under name.
- Parameters
name – name of the vector mtrix
- Returns
None
-
directed_pairwise_exchanges
(selector: Optional[Callable[[convokit.model.speaker.Speaker, convokit.model.speaker.Speaker], bool]] = <function Corpus.<lambda>>, speaker_ids_only: bool = False) → Dict[Tuple, List[convokit.model.utterance.Utterance]]¶ Get all directed pairwise exchanges in the dataset.
- Parameters
selector – optional function that takes in a speaking speaker and a replied-to speaker and returns True to include the pair in the result, or False otherwise.
speaker_ids_only (bool) – if True, index conversations by speaker ids rather than Speaker objects.
- Returns
Dictionary mapping (speaker, target) tuples to a list of utterances given by the speaker in reply to the target.
-
dump
(name: str, base_path: Optional[str] = None, exclude_vectors: List[str] = None, force_version: int = None, overwrite_existing_corpus: bool = False, fields_to_skip=None) → None¶ Dumps the corpus and its metadata to disk. Optionally, set force_version to a desired integer version number, otherwise the version number is automatically incremented.
- Parameters
name – name of corpus
base_path – base directory to save corpus in (None to save to a default directory)
exclude_vectors – list of names of vector matrices to exclude from the dumping step. By default; all vector matrices that belong to the Corpus (whether loaded or not) are dumped.
force_version – version number to set for the dumped corpus
overwrite_existing_corpus – if True, save to the path you loaded the corpus from, overriding the original corpus.
fields_to_skip – a dictionary of {object type: list of metadata attributes to omit when writing to disk}. object types can be one of “speaker”, “utterance”, “conversation”, “corpus”.
-
dump_info
(obj_type, fields, dir_name=None)¶ writes attributes of objects in a corpus to disk. This function, along with load_info, supports cases where a particular attribute is to be stored separately from the other corpus files, for organization or efficiency. These attributes will not be read when the corpus is initialized; rather, they can be loaded on-demand using this function.
For each attribute with name <NAME>, will write to a file called info.<NAME>.jsonl, where rows are json-serialized dictionaries structured as {“id”: id of object, “value”: value of attribute}.
- Parameters
obj_type – type of object the attribute is associated with. can be one of “utterance”, “speaker”, “conversation”.
fields – a list of names of attributes to write to disk.
dir_name – the directory to write attributes to. by default, or if set to None, will read from the directory that the Corpus was loaded from.
- Returns
None
-
filter_conversations_by
(selector: Callable[[convokit.model.conversation.Conversation], bool])¶ Mutate the corpus by filtering for a subset of Conversations within the Corpus.
- Parameters
selector – function for selecting which Conversations to keep
- Returns
the mutated Corpus
-
static
filter_utterances
(source_corpus: convokit.model.corpus.Corpus, selector: Callable[[convokit.model.utterance.Utterance], bool])¶ Returns a new corpus that includes only a subset of Utterances from the source Corpus. This filtering provides no guarantees with regard to maintaining conversational integrity and should be used with care.
Vectors are not preserved. The source corpus will be invalidated and will no longer be usable.
- Parameters
source_corpus – the Corpus to subset from
selector – function for selecting which
- Returns
a new Corpus with a subset of the Utterances
-
static
from_pandas
(utterances_df: pandas.DataFrame, speakers_df: Optional[pandas.DataFrame] = None, conversations_df: Optional[pandas.DataFrame] = None) → convokit.model.corpus.Corpus¶ Generates a Corpus from utterances, speakers, and conversations dataframes. For each dataframe, if the ‘id’ column is absent, the dataframe index will be used as the id. Metadata should be denoted with a ‘meta.<key>’ column in the dataframe. For example, if an utterance is to have a metadata key ‘score’, then the ‘meta.score’ column must be present in dataframe.
speakers_df and conversations_df are optional, as their IDs can be inferred from utterances_df, and so their main purpose is to hold speaker / conversation metadata. They should only be included if there exists metadata for the speakers / conversations respectively.
Metadata values that are not basic Python data structures (i.e. lists, dicts, tuples) may be included in the dataframes but may lead to unexpected behavior, depending on how pandas serializes / deserializes those values. Note that as metadata can be added to the Corpus after it is constructed, there is no need to include all metadata keys in the dataframe if it would be inconvenient.
- Parameters
utterances_df – utterances data in a pandas Dataframe, all primary data fields expected, with metadata optional
speakers_df – (optional) speakers data in a pandas Dataframe
conversations_df – (optional) conversations data in a pandas Dataframe
- Returns
Corpus constructed from the dataframe(s)
-
get_attribute_table
(obj_type, attrs)¶ returns a DataFrame, indexed by the IDs of objects of obj_type, containing attributes of these objects.
- Parameters
obj_type – the type of object to get attributes for. can be ‘utterance’, ‘speaker’ or ‘conversation’.
attrs – a list of names of attributes to get.
- Returns
a Pandas DataFrame of attributes.
-
get_conversation
(convo_id: str) → convokit.model.conversation.Conversation¶ Gets Conversation of the specified id from the corpus
- Parameters
convo_id – id of Conversation
- Returns
Conversation
-
get_conversation_ids
(selector: Optional[Callable[[convokit.model.conversation.Conversation], bool]] = <function Corpus.<lambda>>) → List[str]¶ Get a list of ids of Conversations in the Corpus, with an optional selector that filters for Conversations that should be included
- Parameters
selector – a (lambda) function that takes a Conversation and returns True or False (i.e. include / exclude). By default, the selector includes all Conversations in the Corpus.
- Returns
list of Conversation ids
-
get_conversations_dataframe
(selector: Optional[Callable[[convokit.model.conversation.Conversation], bool]] = <function Corpus.<lambda>>, exclude_meta: bool = False)¶ Get a DataFrame of the conversations with fields and metadata attributes, with an optional selector that filters for conversations that should be included. Edits to the DataFrame do not change the corpus in any way.
- Parameters
exclude_meta – whether to exclude metadata
selector – a (lambda) function that takes a Conversation and returns True or False (i.e. include / exclude). By default, the selector includes all Conversations in the Corpus.
- Returns
a pandas DataFrame
-
get_full_attribute_table
(speaker_convo_attrs, speaker_attrs=None, convo_attrs=None, speaker_suffix='__speaker', convo_suffix='__convo')¶ Returns a table where each row lists a (speaker, convo) level aggregate for each attribute in attrs, along with speaker-level and conversation-level attributes; by default these attributes are suffixed with ‘__speaker’ and ‘__convo’ respectively.
- Parameters
speaker_convo_attrs – list of (speaker, convo) attribute names
speaker_attrs – list of speaker attribute names
convo_attrs – list of conversation attribute names
speaker_suffix – suffix to append to names of speaker-level attributes
convo_suffix – suffix to append to names of conversation-level attributes.
- Returns
DataFrame containing all attributes.
-
get_object
(obj_type: str, oid: str)¶ General Corpus object getter. Gets Speaker / Utterance / Conversation of specified id from the Corpus
- Parameters
obj_type – “speaker”, “utterance”, or “conversation”
oid – object id
- Returns
Corpus object of specified object type with specified object id
-
get_object_ids
(obj_type: str, selector: Callable[[Union[convokit.model.speaker.Speaker, convokit.model.utterance.Utterance, convokit.model.conversation.Conversation]], bool] = <function Corpus.<lambda>>)¶ Get a list of ids of Corpus objects of the specified type in the Corpus, with an optional selector that filters for objects that should be included
- Parameters
obj_type – “speaker”, “utterance”, or “conversation”
selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
- Returns
list of Corpus object ids
-
get_speaker
(speaker_id: str) → convokit.model.speaker.Speaker¶ Gets Speaker of the specified id from the corpus
- Parameters
speaker_id – id of Speaker
- Returns
Speaker
-
get_speaker_convo_attribute_table
(attrs)¶ Returns a table where each row lists a (speaker, convo) level aggregate for each attribute in attrs.
- Parameters
attrs – list of (speaker, convo) attribute names
- Returns
DataFrame containing all speaker,convo attributes.
-
get_speaker_convo_info
(speaker_id, convo_id, key=None)¶ retreives speaker-conversation attribute key for speaker_id in conversation convo_id.
- Parameters
speaker_id – speaker
convo_id – conversation
key – name of attribute. if None, will return all attributes for that speaker-conversation.
- Returns
attribute value
-
get_speaker_ids
(selector: Optional[Callable[[convokit.model.speaker.Speaker], bool]] = <function Corpus.<lambda>>) → List[str]¶ Get a list of ids of Speakers in the Corpus, with an optional selector that filters for Speakers that should be included
- Parameters
selector – a (lambda) function that takes a Speaker and returns True or False (i.e. include / exclude). By default, the selector includes all Speakers in the Corpus.
- Returns
list of Speaker ids
-
get_speakers_dataframe
(selector: Optional[Callable[[convokit.model.speaker.Speaker], bool]] = <function Corpus.<lambda>>, exclude_meta: bool = False)¶ Get a DataFrame of the Speakers with fields and metadata attributes, with an optional selector that filters Speakers that should be included. Edits to the DataFrame do not change the corpus in any way.
- Parameters
exclude_meta – whether to exclude metadata
selector – selector: a (lambda) function that takes a Speaker and returns True or False (i.e. include / exclude). By default, the selector includes all Speakers in the Corpus.
- Returns
a pandas DataFrame
-
get_utterance
(utt_id: str) → convokit.model.utterance.Utterance¶ Gets Utterance of the specified id from the corpus
- Parameters
utt_id – id of Utterance
- Returns
Utterance
-
get_utterance_ids
(selector: Optional[Callable[[convokit.model.utterance.Utterance], bool]] = <function Corpus.<lambda>>) → List[str]¶ Get a list of ids of Utterances in the Corpus, with an optional selector that filters for Utterances that should be included
- Parameters
selector – a (lambda) function that takes an Utterance and returns True or False (i.e. include / exclude). By default, the selector includes all Utterances in the Corpus.
- Returns
list of Utterance ids
-
get_utterances_dataframe
(selector: Optional[Callable[[convokit.model.utterance.Utterance], bool]] = <function Corpus.<lambda>>, exclude_meta: bool = False)¶ Get a DataFrame of the utterances with fields and metadata attributes, with an optional selector that filters utterances that should be included. Edits to the DataFrame do not change the corpus in any way.
- Parameters
exclude_meta – whether to exclude metadata
selector – a (lambda) function that takes a Utterance and returns True or False (i.e. include / exclude). By default, the selector includes all Utterances in the Corpus.
- Returns
a pandas DataFrame
-
get_vector_matrix
(name)¶ Gets the ConvoKitMatrix stored in the corpus as name. Returns None if no such matrix exists.
- Parameters
name – name of the vector matrix
- Returns
a ConvoKitMatrix object
-
get_vectors
(name, ids: Optional[List[str]] = None, columns: Optional[List[str]] = None, as_dataframe: bool = False)¶ Get the vectors for some corpus component objects.
- Parameters
name – name of the vector matrix
ids – optional list of object ids to get vectors for; all by default
columns – optional list of named columns of the vector to include; all by default
as_dataframe – whether to return the vector as a dataframe (True) or in its raw array form (False). False by default.
- Returns
a vector matrix (either np.ndarray or csr_matrix) or a pandas dataframe
-
has_conversation
(convo_id: str) → bool¶ Checks if a Conversation of the specified id exists in the Corpus
- Parameters
convo_id – id of Conversation
- Returns
True if Conversation of specified id is present, False otherwise
-
has_speaker
(speaker_id: str) → bool¶ Checks if a Speaker of the specified id exists in the Corpus
- Parameters
speaker_id – id of Speaker
- Returns
True if Speaker of specified id is present, False otherwise
-
has_utterance
(utt_id: str) → bool¶ Checks if an Utterance of the specified id exists in the Corpus
- Parameters
utt_id – id of Utterance
- Returns
True if Utterance of specified id is present, False otherwise
-
iter_conversations
(selector: Optional[Callable[[convokit.model.conversation.Conversation], bool]] = <function Corpus.<lambda>>) → Generator[convokit.model.conversation.Conversation, None, None]¶ Get conversations in the Corpus, with an optional selector that filters for Conversations that should be included
- Parameters
selector – a (lambda) function that takes a Conversation and returns True or False (i.e. include / exclude). By default, the selector includes all Conversations in the Corpus.
- Returns
a generator of Conversations
-
iter_objs
(obj_type: str, selector: Callable[[Union[convokit.model.speaker.Speaker, convokit.model.utterance.Utterance, convokit.model.conversation.Conversation]], bool] = <function Corpus.<lambda>>)¶ Get Corpus objects of specified type from the Corpus, with an optional selector that filters for Corpus object that should be included
- Parameters
obj_type – “speaker”, “utterance”, or “conversation”
selector – a (lambda) function that takes a Corpus object and returns True or False (i.e. include / exclude). By default, the selector includes all objects of the specified type in the Corpus.
- Returns
a generator of Speakers
-
iter_speakers
(selector: Optional[Callable[[convokit.model.speaker.Speaker], bool]] = <function Corpus.<lambda>>) → Generator[convokit.model.speaker.Speaker, None, None]¶ Get Speakers in the Corpus, with an optional selector that filters for Speakers that should be included
- Parameters
selector – a (lambda) function that takes a Speaker and returns True or False (i.e. include / exclude). By default, the selector includes all Speakers in the Corpus.
- Returns
a generator of Speakers
-
iter_utterances
(selector: Optional[Callable[[convokit.model.utterance.Utterance], bool]] = <function Corpus.<lambda>>) → Generator[convokit.model.utterance.Utterance, None, None]¶ Get utterances in the Corpus, with an optional selector that filters for Utterances that should be included.
- Parameters
selector – a (lambda) function that takes an Utterance and returns True or False (i.e. include / exclude). By default, the selector includes all Utterances in the Corpus.
- Returns
a generator of Utterances
-
load_info
(obj_type, fields=None, dir_name=None)¶ loads attributes of objects in a corpus from disk. This function, along with dump_info, supports cases where a particular attribute is to be stored separately from the other corpus files, for organization or efficiency. These attributes will not be read when the corpus is initialized; rather, they can be loaded on-demand using this function.
For each attribute with name <NAME>, will read from a file called info.<NAME>.jsonl, and load each attribute value into the respective object’s .meta field.
- Parameters
obj_type – type of object the attribute is associated with. can be one of “utterance”, “speaker”, “conversation”.
fields – a list of names of attributes to load. if empty, will load all attributes stored in the specified directory dir_name.
dir_name – the directory to read attributes from. by default, or if set to None, will read from the directory that the Corpus was loaded from.
- Returns
None
-
static
merge
(primary: convokit.model.corpus.Corpus, secondary: convokit.model.corpus.Corpus, warnings: bool = True)¶ Merges two corpora (one primary and one secondary), creating a new Corpus with their combined data.
Utterances with the same id must share the same data. In case of conflicts, the primary Corpus will take precedence and the conflicting Utterance from secondary will be ignored. A warning is printed when this happens.
If metadata of the primary Corpus (or its conversations / utterances) shares a key with the metadata of the secondary Corpus, the secondary’s metadata (or its conversations / utterances) values will be used. A warning is printed when this happens.
Will invalidate primary and secondary in the process.
The resulting Corpus will inherit the primary Corpus’s id and version number.
- Parameters
primary – the primary Corpus
secondary – the secondary Corpus
warnings – print warnings when data conflicts are encountered
- Returns
new Corpus constructed from combined lists of utterances
-
organize_speaker_convo_history
(utterance_filter=None)¶ - For each speaker, pre-computes a list of all of their utterances, organized by the conversation they participated in. Annotates speaker with the following:
n_convos: number of conversations
start_time: time of first utterance, across all conversations
- conversations: a dictionary keyed by conversation id, where entries consist of:
idx: the index of the conversation, in terms of the time of the first utterance contributed by that particular speaker (i.e., idx=0 means this is the first conversation the speaker ever participated in)
n_utterances: the number of utterances the speaker contributed in the conversation
start_time: the timestamp of the speaker’s first utterance in the conversation
utterance_ids: a list of ids of utterances contributed by the speaker, ordered by timestamp.
In case timestamps are not provided with utterances, the present behavior is to sort just by utterance id.
- Parameters
utterance_filter – function that returns True for an utterance that counts towards a speaker having participated in that conversation. (e.g., one could filter out conversations where the speaker contributed less than k words per utterance)
-
print_summary_stats
() → None¶ Helper function for printing the number of Speakers, Utterances, and Conversations in this Corpus
- Returns
None
-
random_conversation
() → convokit.model.conversation.Conversation¶ Get a random Conversation from the Corpus
- Returns
a random Conversation
-
random_speaker
() → convokit.model.speaker.Speaker¶ Get a random Speaker from the Corpus
- Returns
a random Speaker
-
random_utterance
() → convokit.model.utterance.Utterance¶ Get a random Utterance from the Corpus
- Returns
a random Utterance
-
classmethod
reconnect_to_db
(db_collection_prefix: str, db_host: Optional[str] = None)¶ Factory method for a Corpus instance backed by an already-existing database (e.g., one that was created in a previous run of a Python script or interactive session).
This can be used to reconnect to existing Corpus data that you still want to use without having to reload the data from the source file; this can happen for example if your script crashed in the middle of working with the Corpus and you want to resume where you left off.
-
static
reindex_conversations
(source_corpus: convokit.model.corpus.Corpus, new_convo_roots: List[str], preserve_corpus_meta: bool = True, preserve_convo_meta: bool = True, verbose=True) → convokit.model.corpus.Corpus¶ Generates a new Corpus from source Corpus with specified list of utterance ids to use as conversation ids.
The subtrees denoted by these utterance ids should be distinct and should not overlap, otherwise there may be unexpected behavior.
Vectors are not preserved. The source Corpus will be invalidated and no longer usable.
- Parameters
source_corpus – the Corpus containing the original data to select from
new_convo_roots – List of utterance ids to use as conversation ids
preserve_corpus_meta – set as True to copy original Corpus metadata to new Corpus
preserve_convo_meta – set as True to copy original Conversation metadata to new Conversation metadata (For each new conversation, use the metadata of the conversation that the utterance belonged to.)
verbose – whether to print a warning when
- Returns
new Corpus with reindexed Conversations
-
reinitialize_index
()¶ Reinitialize the Corpus Index from scratch.
- Returns
None (sets the .meta_index of Corpus and of the corpus component objects)
-
set_speaker_convo_info
(speaker_id, convo_id, key, value)¶ assigns speaker-conversation attribute key with value to speaker speaker_id in conversation convo_id.
- Parameters
speaker_id – speaker
convo_id – conversation
key – name of attribute
value – value of attribute
- Returns
None
-
set_vector_matrix
(name: str, matrix, ids: List[str] = None, columns: List[str] = None)¶ Adds a vector matrix to the Corpus, where the matrix is an array of vector representations of some set of Corpus components (i.e. Utterances, Conversations, Speakers).
A ConvoKitMatrix object is initialized from the arguments and stored in the Corpus.
- Parameters
name – descriptive name for the matrix
matrix – numpy or scipy array matrix
ids – optional list of Corpus component object ids, where each id corresponds to each row of the matrix
columns – optional list of names for the columns of the matrix
- Returns
None
-
speaking_pairs
(selector: Optional[Callable[[convokit.model.speaker.Speaker, convokit.model.speaker.Speaker], bool]] = <function Corpus.<lambda>>, speaker_ids_only: bool = False) → Set[Tuple[str, str]]¶ Get all directed speaking pairs (a, b) of speakers such that a replies to b at least once in the dataset.
- Parameters
selector – optional function that takes in a Speaker and a replied-to Speaker and returns True to include the pair in the result, or False otherwise.
speaker_ids_only (bool) – if True, return just pairs of speaker names rather than speaker objects.
- Returns
Set containing all speaking pairs selected by the selector function, or all speaking pairs in the dataset if no selector function was used.
-
update_speakers_data
() → None¶ Updates the conversation and utterance lists of every Speaker in the Corpus
- Returns
None