Transformer Forecaster Configuration¶
TransformerForecasterConfig is a dataclass that holds training and runtime settings for transformer-based
ForecasterModel implementations, including Transformer Encoder-based Forecasting Model and Transformer Decoder-based Forecasting Model.
Pass a config when constructing the model; the model stores it on self.config and uses it during fit(),
transform(), checkpoint loading, and inference.
Install LLM dependencies first: pip install convokit[llm].
Usage¶
output_dir is the only required field. Other hyperparameters have defaults:
from convokit.forecaster import TransformerForecasterConfig, TransformerEncoderModel
config = TransformerForecasterConfig(
output_dir="YOUR_SAVING_DIRECTORY",
per_device_batch_size=4,
gradient_accumulation_steps=1,
num_train_epochs=4,
learning_rate=6.7e-6,
random_seed=1,
context_mode="normal",
device="cuda",
)
model = TransformerEncoderModel("bert-base-uncased", config=config)
For inference-only runs (loading an existing checkpoint), you can pass a minimal config with output_dir,
context_mode, and device; training hyperparameters are ignored.
Configuration fields¶
output_dirDirectory for checkpoints, prediction CSVs, and training logs. Created automatically if it does not exist. Encoder models write
test_predictions.csvandval_predictions.csvhere; decoder models writepredictions.csv. Checkpoints are subfolders namedcheckpoint-*.per_device_batch_sizeBatch size per GPU during training. Effective batch size is
per_device_batch_size * gradient_accumulation_steps(times the number of GPUs if using distributed training).gradient_accumulation_stepsAccumulate gradients over this many steps before an optimizer update. Useful when GPU memory limits batch size; decoder models often use a larger value (e.g. 32) to simulate a larger batch.
num_train_epochsNumber of passes over the training data.
learning_rateOptimizer learning rate. Encoder and decoder models ship with different class-level defaults (see below).
random_seedSeed passed to the Hugging Face
Trainer(encoder models). Improves reproducibility across runs.deviceWhere tensors are placed at inference time, e.g.
"cuda","cuda:0", or"cpu".context_modeHow conversational input is built for each context tuple:
"normal"— include prior utterances plus the current utterance (default)."no-context"— use only the current utterance, ablating conversational history.
Must be one of these two values; anything else raises
ValueError.
Model defaults¶
If you omit config, each model class uses its own DEFAULT_CONFIG:
TransformerEncoderModel — per_device_batch_size=4, gradient_accumulation_steps=1,
num_train_epochs=1, learning_rate=6.7e-6.
TransformerDecoderModel — per_device_batch_size=2, gradient_accumulation_steps=32,
num_train_epochs=1, learning_rate=1e-4.
These defaults are starting points; demos and benchmarks often override them (see Transformer Forecaster demo).
See also¶
Forecaster — Forecaster wrapper and decision policies
Transformer Encoder-based Forecasting Model — BERT-style encoder forecasters
Transformer Decoder-based Forecasting Model — LLM decoder forecasters (Llama, Gemma, etc.)