Question Typology¶

Implements unsupervised identification of rhetorical roles in questions, see as described in this paper.

Example usage: extracting common question types in UK parliament

class convokit.questionTypology.questionTypology.QuestionTypology(num_clusters: int = 8, question_threshold: int = 100, answer_threshold: int = 100, num_dims: int = 100, verbose: int = 5000, dedup_threshold: int = 0.9, follow_conj: int = True, norm: str = 'l2', num_svds: int = 50, num_dims_to_inspect: int = 5, max_iter_for_k_means: int = 1000, remove_first: bool = False, min_support: int = 5, item_set_size: int = 5, leaves_only_for_assign: bool = True, idf: bool = False, snip: bool = True, leaves_only_for_extract: bool = False, random_seed: int = 0, is_question: Callable[[str], bool] = None, questions_only: bool = True, enforce_formatting: bool = True)¶

Encapsulates computation of question types from a question-answer corpus. Can be trained and evaluated on separate corpora.

Parameters

num_clusters – the number of question types to be extracted
question_threshold – the minimum number of questions a motif must occur in for it to be considered
answer_threshold – the minimum number of answers a motif must occur in for it to be considered
num_dims – the number of latent dimensions in the sparse matrix
verbose – False or 0 if nothing should be printed, otherwise equal to the interval at which the number of completed steps of any part of the algorithm are printed
dedup_threshold – If two motifs co-occur in a higher proportion of cases than this threshold, they are considered duplicates and one is removed
follow_conj – whether to follow conjunctions and treat subtrees as sentences too.
norm – the normalizer to use in the normalization of the sparse matrix
num_svds – the number of dimensions to preserve in the SVD
num_dims_to_inspect – the number of dimensions to inspect
max_iter_for_k_means – the maximum iterations to run the k means algorithm for
remove_first – Whether to remove the first element in the k means classification set
min_support – the minimum number of times an itemset has to show up for the frequent itemset counter to consider
item_set_size – the size of the item set
leaves_only_for_assign – whether to assign only sink motifs to clusters
idf – Whether to represent data using inverse document frequency
snip – Whether to increment the number of singular values and vectors to compute by one
leaves_only_for_extract – whether to include only sink motifs in extracted clusters
random_seed – the random seed to provide to the clustering algorithm
is_question – the function that will be used to determine whether an utterance is a question. If nothing is specified, by default the code assumes all sentences that end in ‘?’ are questions
questions_only – whether motif extraction should look only at utterances that are questions (as defined by is_question). Disable this to make the algorithm derive prompt types instead of question types.
enforce_formatting – whether to enforce that utterances must be well-formed sentences in order to count as questions or answers. Well-formedness is defined as starting with an uppercase letter. Enable this for corpora that are known to contain properly formatted utterances (e.g. Parliament corpus)

Variables

num_clusters – the number of question types to be extracted
mtx_obj – an object that contains information about the QA matrix from the paper
km – the Kmeans object that has the labels
types_to_data – an object that contains information about motifs, fragments and questions in each type
lq – the low dimensional Q matrix
a_u – the low dimensional A matrix

display_answer_fragments_for_type(cluster_num, num_egs=10)¶: Displays num_egs number of answer fragments whose corresponding question motif were assigned to cluster cluster_num by the clustering algorithm

display_motifs_for_type(cluster_num: int, num_egs: int = 10)¶: Displays num_egs number of motifs that were assigned to cluster cluster_num by the clustering algorithm

static display_question_answer_pairs_for_type(corpus: convokit.model.corpus.Corpus, type_num: int, num_egs: int = 10)¶: Displays num_egs number of question-answer pairs in the given corpus that were assigned type type_num by the typing algorithm.

static display_questions_for_type(corpus: convokit.model.corpus.Corpus, type_num: int, num_egs: int = 10)¶: Displays num_egs number of questions in the given corpus that were assigned type type_num by the typing algorithm.

display_totals()¶: Displays the total number of questions, motifs and fragments present in this data, as well as the number of motifs in each cluster and questions of each type

fit(corpus: convokit.model.corpus.Corpus)¶

Extract question-answer pairs from the given corpus and use them to construct the internal matrix objects (in other words, “train” the QuestionTypology object on the given corpus)

Parameters: corpus (Corpus) – the Corpus to use for fitting the model

transform(corpus) → convokit.model.corpus.Corpus¶

Computes the distance to each question type cluster for some (possibly previously unseen) text.

Parameters: corpus (Corpus) – the Corpus to apply the fitted model to