Question Typology

Implements unsupervised identification of rhetorical roles in questions, see as described in this paper.

Example usage: extracting common question types in UK parliament

class convokit.questionTypology.questionTypology.QuestionTypology(num_clusters: int = 8, question_threshold: int = 100, answer_threshold: int = 100, num_dims: int = 100, verbose: int = 5000, dedup_threshold: int = 0.9, follow_conj: int = True, norm: str = 'l2', num_svds: int = 50, num_dims_to_inspect: int = 5, max_iter_for_k_means: int = 1000, remove_first: bool = False, min_support: int = 5, item_set_size: int = 5, leaves_only_for_assign: bool = True, idf: bool = False, snip: bool = True, leaves_only_for_extract: bool = False, random_seed: int = 0, is_question: Callable[[str], bool] = None, questions_only: bool = True, enforce_formatting: bool = True)

Encapsulates computation of question types from a question-answer corpus. Can be trained and evaluated on separate corpora.

Parameters
  • num_clusters – the number of question types to be extracted

  • question_threshold – the minimum number of questions a motif must occur in for it to be considered

  • answer_threshold – the minimum number of answers a motif must occur in for it to be considered

  • num_dims – the number of latent dimensions in the sparse matrix

  • verbose – False or 0 if nothing should be printed, otherwise equal to the interval at which the number of completed steps of any part of the algorithm are printed

  • dedup_threshold – If two motifs co-occur in a higher proportion of cases than this threshold, they are considered duplicates and one is removed

  • follow_conj – whether to follow conjunctions and treat subtrees as sentences too.

  • norm – the normalizer to use in the normalization of the sparse matrix

  • num_svds – the number of dimensions to preserve in the SVD

  • num_dims_to_inspect – the number of dimensions to inspect

  • max_iter_for_k_means – the maximum iterations to run the k means algorithm for

  • remove_first – Whether to remove the first element in the k means classification set

  • min_support – the minimum number of times an itemset has to show up for the frequent itemset counter to consider

  • item_set_size – the size of the item set

  • leaves_only_for_assign – whether to assign only sink motifs to clusters

  • idf – Whether to represent data using inverse document frequency

  • snip – Whether to increment the number of singular values and vectors to compute by one

  • leaves_only_for_extract – whether to include only sink motifs in extracted clusters

  • random_seed – the random seed to provide to the clustering algorithm

  • is_question – the function that will be used to determine whether an utterance is a question. If nothing is specified, by default the code assumes all sentences that end in ‘?’ are questions

  • questions_only – whether motif extraction should look only at utterances that are questions (as defined by is_question). Disable this to make the algorithm derive prompt types instead of question types.

  • enforce_formatting – whether to enforce that utterances must be well-formed sentences in order to count as questions or answers. Well-formedness is defined as starting with an uppercase letter. Enable this for corpora that are known to contain properly formatted utterances (e.g. Parliament corpus)

Variables
  • num_clusters – the number of question types to be extracted

  • mtx_obj – an object that contains information about the QA matrix from the paper

  • km – the Kmeans object that has the labels

  • types_to_data – an object that contains information about motifs, fragments and questions in each type

  • lq – the low dimensional Q matrix

  • a_u – the low dimensional A matrix

display_answer_fragments_for_type(cluster_num, num_egs=10)

Displays num_egs number of answer fragments whose corresponding question motif were assigned to cluster cluster_num by the clustering algorithm

display_motifs_for_type(cluster_num: int, num_egs: int = 10)

Displays num_egs number of motifs that were assigned to cluster cluster_num by the clustering algorithm

static display_question_answer_pairs_for_type(corpus: convokit.model.corpus.Corpus, type_num: int, num_egs: int = 10)

Displays num_egs number of question-answer pairs in the given corpus that were assigned type type_num by the typing algorithm.

static display_questions_for_type(corpus: convokit.model.corpus.Corpus, type_num: int, num_egs: int = 10)

Displays num_egs number of questions in the given corpus that were assigned type type_num by the typing algorithm.

display_totals()

Displays the total number of questions, motifs and fragments present in this data, as well as the number of motifs in each cluster and questions of each type

fit(corpus: convokit.model.corpus.Corpus)

Extract question-answer pairs from the given corpus and use them to construct the internal matrix objects (in other words, “train” the QuestionTypology object on the given corpus)

Parameters

corpus (Corpus) – the Corpus to use for fitting the model

transform(corpus) → convokit.model.corpus.Corpus

Computes the distance to each question type cluster for some (possibly previously unseen) text.

Parameters

corpus (Corpus) – the Corpus to apply the fitted model to