OpenFold3 Training¶

Welcome to the training documentation for OpenFold3. This guide covers how to train OpenFold3 on the PDB dataset from scratch or fine-tune from an existing checkpoint.

1. Prerequisites¶

OpenFold3 Conda environment. See OpenFold3 Installation for instructions on how to build the environment.

2. Download the Dataset¶

The pre-processed PDB training dataset is hosted on AWS S3. Download it using the AWS CLI:

aws s3 sync s3://openfold3-data/pdb_training_set/ /shared/openfold3/pdb_training_set/ --no-sign-request

3. Prepare the Training Config¶

The training configuration is stored in a YAML file that controls all aspects of training: model settings, dataset configuration, distributed training parameters, logging, and checkpointing.

Note: Make sure you update the paths to match your file locations. The examples below assume a /shared/openfold3 directory that’s accessible from all your training nodes.

Complete example YAML configurations for all stages of training are available in examples/training_yamls/.

3.1 Basic Training Config¶

Here’s a minimal configuration for single-GPU training:

experiment_settings:
  mode: train
  output_dir: ./test_train_output 
  restart_checkpoint_path: last

data_module_args:
  batch_size: 1
  num_workers: 1
  epoch_len: 500  # Ckpt every 500 steps (effective batch_size * # of steps)

logging_config:
  log_lr: false
  wandb_config: null

pl_trainer_args:
  devices: 1
  num_nodes: 1
  precision: bf16-mixed
  max_epochs: -1
  log_every_n_steps: 50

checkpoint_config:
  every_n_epochs: 1
  auto_insert_metric_name: false
  save_last: true
  save_top_k: -1

model_update:
  presets: 
    - train
  custom:
    architecture:
      shared:
        use_confidence_emb_prob: 0.8
        diffusion:
          use_conditioning_prob: 0.8

dataset_configs:
  train:
    weighted-pdb:
      dataset_class: WeightedPDBDataset
      weight: 1.0
      config:
        debug_mode: true
        template:
          n_templates: 4
          take_top_k: false
        crop:
          token_crop:
            enabled: true
            token_budget: 384
            crop_weights:
              contiguous: 0.2
              spatial: 0.4
              spatial_interface: 0.4
          chain_crop:
            enabled: true

  validation:
    val-weighted-pdb:
      dataset_class: ValidationPDBDataset
      config:
        debug_mode: true
        msa:
          subsample_main: false
        template:
          n_templates: 4
          take_top_k: true
        crop:
          token_crop:
            enabled: false

dataset_paths:
  weighted-pdb:
    alignments_directory: none
    alignment_db_directory: none 
    alignment_array_directory: /shared/openfold3/pdb_training_set/alignment_arrays
    dataset_cache_file: /shared/openfold3/pdb_training_set/dataset_caches/training_cache_with_templates.json
    target_structures_directory: /shared/openfold3/pdb_training_set/preprocessed_pdb_data/standard/structure_files
    target_structure_file_format: npz
    reference_molecule_directory: /shared/openfold3/pdb_training_set/preprocessed_pdb_data/standard/reference_mols
    template_cache_directory: /shared/openfold3/pdb_training_set/templates/train_template_cache
    template_structure_array_directory: /shared/openfold3/pdb_training_set/templates/template_structure_arrays
    template_structures_directory: none
    template_file_format: npz
    ccd_file: null

  val-weighted-pdb:
    alignments_directory: none
    alignment_db_directory: none 
    alignment_array_directory: /shared/openfold3/pdb_training_set/alignment_arrays
    dataset_cache_file: /shared/openfold3/pdb_training_set/dataset_caches/validation_cache_with_templates.json
    target_structures_directory: /shared/openfold3/pdb_training_set/preprocessed_pdb_data/standard/structure_files
    target_structure_file_format: npz
    reference_molecule_directory: /shared/openfold3/pdb_training_set/preprocessed_pdb_data/standard/reference_mols
    template_cache_directory: /shared/openfold3/pdb_training_set/templates/val_template_cache
    template_structure_array_directory: /shared/openfold3/pdb_training_set/templates/template_structure_arrays
    template_structures_directory: none
    template_file_format: npz
    ccd_file: null

For example configurations for all stages of training, please see examples/training_yamls/:

initial_training.yml: Standard initial training configuration
finetune_1.yml: Fine-tuning stage 1 configuration
finetune_2.yml: Fine-tuning stage 2 configuration
finetune_3.yml: Fine-tuning stage 3 configuration (used in OF3p)

4. Launch Training¶

4.1 Single-GPU Training¶

For testing or debugging:

run_openfold train --runner-yaml training.yaml --seed 42

4.2 Multi-GPU Training¶

To train on multiple GPUs within a single node, configure your YAML:

pl_trainer_args:
  devices: 8      # GPUs per node
  num_nodes: 1

For multi-node distributed training, update your config as follows:

pl_trainer_args:
  devices: 8       # GPUs per node
  num_nodes: 32    # Total number of nodes

Then launch training with:

run_openfold train --runner_yaml training.yaml --seed 42

5. Monitoring Training¶

Enable Weights & Biases logging by configuring wandb_config in your YAML:

logging_config:
  log_lr: true
  wandb_config:
    project: openfold3-training
    entity: your-wandb-entity
    group: null
    id: null
    experiment_name: my-training-run

To use W&B, ensure you’re logged in:

wandb login

6. Checkpointing and Resuming¶

6.1 Checkpoint Configuration¶

checkpoint_config:
  every_n_epochs: 1              # Save checkpoint every N epochs
  auto_insert_metric_name: false # Don't add metric to filename
  save_last: true                # Always save 'last.ckpt'
  save_top_k: -1                 # Keep all checkpoints (-1) or top K

Checkpoints are saved to {output_dir}/checkpoints/.

6.2 Resuming Training¶

To resume from the last checkpoint:

experiment_settings:
  restart_checkpoint_path: last

To resume from a specific checkpoint:

experiment_settings:
  restart_checkpoint_path: /path/to/checkpoint.ckpt

7. Fine-tuning¶

7.1 Starting from a pre-trained checkpoint¶

To fine-tune from a pre-trained checkpoint, specify the checkpoint path and adjust training parameters as needed:

experiment_settings:
  mode: train
  output_dir: ./finetune_output
  seed: 42
  restart_checkpoint_path: /path/to/pretrained.ckpt

7.2 Inferencing on fine-tuning checkpoints¶

The checkpoints are generated in a format that’s not compatible with torch.load. You’ll see this error

_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.

You can run a small utility script to get a functional checkpoint

python scripts/dev/convert_ckpt_to_ema_only.py <wandb_id>/checkpoints/<epoch>-<step_num>.ckpt inference.ckpt
ls -l
-rw-rw-r-- 1 jandom jandom 2287987691 Jul  2 06:48 inference.ckpt.tmp
-rw-rw-r-- 1 jandom jandom 2287578277 Jul  2 06:48 inference.ckpt

This format of the checkpoint can be directly used with run_openfold predict

run_openfold predict \
    --inference-ckpt-path inference.ckpt \
    ... # other options