CRAFT Model

A backend for Forecaster that implements the CRAFT algorithm from the EMNLP 2019 paper “Trouble on the Horizon: Forecasting the Derailment of Conversations as they Develop”.

CRAFT is a neural model based on a pre-train-then-fine-tune paradigm. As the purpose of this class is to enable CRAFT to be used as a backend for Forecaster, it uses the author-provided already-trained CRAFT instance. Training a new CRAFT instance from scratch is considered outside the scope of ConvoKit. Users interested in creating their own custom CRAFT models can instead consult the authors’ official implementation

IMPORTANT NOTE: This implementation directly uses the author-provided CRAFT model that was used in the paper’s experiments. This model was developed separately from ConvoKit and uses its own tokenization scheme, which differs from ConvoKit’s default. Using ConvoKit’s tokenization could therefore result in tokens that are inconsistent with what the CRAFT model expects, leading to errors. ConvoKit ships with a workaround in the form of a special tokenizer, craft_tokenize, which implements the tokenization scheme used in the CRAFT model. Users of this class should therefore always use craft_tokenize in place of ConvoKit’s default tokenization. See the CRAFT demo notebook for an example of how to do this.

class convokit.forecaster.CRAFTModel.CRAFTModel(initial_weights: str, vocab_index2word: str = 'auto', vocab_word2index: str = 'auto', decision_threshold: Union[float, str] = 'auto', torch_device: str = 'cpu', config: dict = {'batch_size': 64, 'clip': 50.0, 'dropout': 0.1, 'finetune_epochs': 30, 'learning_rate': 1e-05, 'print_every': 10, 'validation_size': 0.2})

A ConvoKit Forecaster-adherent reimplementation of the CRAFT conversational forecasting model from the paper “Trouble on the Horizon: Forecasting the Derailment of Online Conversations as they Develop” (Chang and Danescu-Niculescu-Mizil, 2019).

Usage note: CRAFT is a neural network model; full end-to-end training of neural networks is considered outside the scope of ConvoKit, so the ConvoKit CRAFTModel must be initialized with existing weights. ConvoKit provides weights for the CGA-WIKI and CGA-CMV corpora. If you just want to run a fully-trained CRAFT model on those corpora (i.e., only transform, no fit), you can use the finetuned weights (craft-wiki-finetuned and craft-cmv-finetuned, respectively). If you want to take a pretrained model and finetune it on your own data (i.e., both fit and transform), you can use the pretrained weights (craft-wiki-pretrained and craft-cmv-pretrained, respectively), which provide trained versions of the underlying utterance and conversation encoder layers but leave the classification layers at their random initializations so that they can be fitted to your data.

Parameters
  • initial_weights – Specifies where to find the saved model to be loaded to initialize CRAFT. To use ConvoKit’s provided models, use “craft-wiki-pretrained” for the model pretrained on Wikipedia data, or “craft-wiki-finetuned” for the model already fine-tuned on CGA-WIKI. Replace “wiki” with “cmv” for the Reddit CMV equivalents. Alternatively, if you have a custom model you want to use, you can pass in the full path to the saved PyTorch checkpoint file.

  • vocab_index2word – File containing the mapping from vocabulary index to raw string tokens. If you are using a provided model, you MUST leave this as the default value of “auto” (other values will be ignored and overridden to “auto”). Conversely, if using a custom model, you CANNOT leave this as “auto” and you must provide a full path to the vocabulary file that you made for your custom model.

  • vocab_word2index – File containing the mapping from raw string tokens to vocabulary index. If you are using a provided model, you MUST leave this as the default value of “auto” (other values will be ignored and overridden to “auto”). Conversely, if using a custom model, you CANNOT leave this as “auto” and you must provide a full path to the vocabulary file that you made for your custom model.

  • decision_threshold – Output probability beyond which a forecast should be considered “positive”/”True”. Highly recommended to leave this at auto, which will use published values for the provided models, or 0.5 for custom models.

  • torch_device – “cpu” or “cuda” (for GPUs). If you have access to a GPU it is strongly recommended to set this to “cuda”; the default is “cpu” only for compatibility with non-GPU setups.

  • config – Allows overwriting of CRAFT hyperparameters. Strongly recommended to keep this at default unless you know what you’re doing!

fit(contexts, val_contexts=None)

Fine-tune the CRAFT model, and save the best model according to validation performance.

Parameters
  • contexts – an iterator over context tuples, provided by the Forecaster framework

  • val_contexts – an iterator over context tuples to be used only for validation. IMPORTANT: this is marked Optional only for compatibility with the generic Forecaster API; CRAFT actually REQUIRES a validation set so leaving this parameter at None will raise an error!

transform(contexts, forecast_attribute_name, forecast_prob_attribute_name)

Run a fine-tuned CRAFT model on the provided data

Parameters
  • contexts – context tuples from the Forecaster framework

  • forecast_attribute_name – Forecaster will use this to look up the table column containing your model’s discretized predictions (see output specification below)

  • forecast_prob_attribute_name – Forecaster will use this to look up the table column containing your model’s raw forecast probabilities (see output specification below)

Returns

a Pandas DataFrame, with one row for each context, indexed by the ID of that context’s current utterance. Contains two columns, one with raw probabilities named according to forecast_prob_attribute_name, and one with discretized (binary) forecasts named according to forecast_attribute_name