Training

Training a Model

To train a model, you need to provide the following to the training_gpytorch.py script.

  • Training data (CSV or pickle format)

  • Model configuration: a YAML file specifying the architecture and training parameters.

YAML Configuration

data_conf:
  num_inputs: int
  num_outputs: int
  output_index: int | list[int]

transform_conf:
  transform_input:
    transform_data: bool
    transformer_class: str
    transformer_options:
      method: str
      standardize: bool
    columns: list | null
  transform_output:
    transform_data: bool
    transformer_class: str
    transformer_options: dict | null
    columns: list | null

training_conf:
  model:
    model_class: str
  likelihood:
    likelihood_class: str
    likelihood_options:
      option1: str | float
  optimizer:
    optimizer_class: str
    learning_rate: float
  learning_iterations: int
  botorch: bool
  debug: bool

testing_conf:
  test: bool
  test_size: float
  strat_shuffle_split: bool
  kfold: bool
  kfold_bins: int | null

data_conf (Required)

Specifies input/output structure:

  • num_inputs: Number of input columns (first N columns).

  • num_outputs: Number of output columns.

  • output_index: (Optional) Index or list of indices for outputs to train on. Index 0 corresponds to the first output column (i.e., column total_columns - num_inputs).

Example:

output_index: [1, 2]

This selects the second and third output variables.

training_conf (Required)

Specifies model details and training parameters:

  • model_class must match a model defined in gp_models.py

  • likelihood_class is the likelihood used by GPyTorch

  • optimizer_class is typically Adam, but can be any PyTorch optimizer

  • botorch: true enables Bayesian optimization using BoTorch instead of Adam, ignores the other optimizer options

Note

Training is always performed using double precision (float64).

For more on BoTorch, visit: https://botorch.org/

transform_conf (Optional)

Configures preprocessing steps for input and output data. Supports any valid sklearn.preprocessing transformer class.

Transformed and fitted transformers are stored in the final model pickle file.

testing_conf (Optional)

While GP models do not require a validation set, testing options are provided for evaluation and comparison purposes:

Three testing modes are available:

  • Random split

  • Stratified shuffle split

  • K-fold cross-validation

If test: true and test_size is set, a random train/test split is applied using scikit-learn.train_test_split.

Running the Training Script

The training process is initiated via a command-line interface or programmatically through the training_gpytorch.py script.

Command-line Example

python train.py \
    --input data.csv \
    --file-type csv \
    --config config.yaml \
    --output model.pth \
    --directory results/ \
    --test-set test.csv

Command-line Arguments:

  • --input: Path to training dataset

  • --file-type: Format of input data (csv or pickle)

  • --config: Path to YAML configuration file

  • --output: Name of output model file

  • --directory: Output directory (created if it does not exist)

  • --test-set: Optional test set path (incompatible with cross-validation)

Programmatic Example

from training_gpytorch import main

args = {
    "input": "data.csv",
    "file_type": "csv",
    "config": "config.yaml",
    "output": "model.pth",
    "directory": "results/",
    "test_set": None
}

main(args)

Main Workflow

The following is the main workflow used by the training script.

1. Load and validate input data
2. Parse YAML configuration
3. Split into inputs and outputs
4. Apply optional data transformations
5. Train GPR model with GPyTorch
6. Evaluate (optional)
7. Save model and metadata to disk

Note

Data is always converted to float64 tensors. Model training and evaluation are logged and optionally saved.