OpenFold3 Training

Welcome to the training documentation for OpenFold3. This guide covers how to train OpenFold3 on the PDB dataset from scratch or fine-tune from an existing checkpoint.

1. Prerequisites

2. Download the Dataset

The pre-processed PDB training dataset is hosted on AWS S3. Download it using the AWS CLI:

aws s3 sync s3://openfold3-data/pdb_training_set/ /shared/openfold3/pdb_training_set/ --no-sign-request

3. Prepare the Training Config

The training configuration is stored in a YAML file that controls all aspects of training: model settings, dataset configuration, distributed training parameters, logging, and checkpointing.

Note: Make sure you update the paths to match your file locations. The examples below assume a /shared/openfold3 directory that’s accessible from all your training nodes.

Complete example YAML configurations for all stages of training are available in examples/training_yamls/.

3.1 Basic Training Config

Here’s a minimal configuration for single-GPU training:

experiment_settings:
  mode: train
  output_dir: ./test_train_output 
  restart_checkpoint_path: last

data_module_args:
  batch_size: 1
  num_workers: 1
  epoch_len: 500  # Ckpt every 500 steps (effective batch_size * # of steps)

logging_config:
  log_lr: false
  wandb_config: null

pl_trainer_args:
  devices: 1
  num_nodes: 1
  precision: bf16-mixed
  max_epochs: -1
  log_every_n_steps: 50

checkpoint_config:
  every_n_epochs: 1
  auto_insert_metric_name: false
  save_last: true
  save_top_k: -1

model_update:
  presets: 
    - train
  custom:
    architecture:
      shared:
        use_confidence_emb_prob: 0.8
        diffusion:
          use_conditioning_prob: 0.8

dataset_configs:
  train:
    weighted-pdb:
      dataset_class: WeightedPDBDataset
      weight: 1.0
      config:
        debug_mode: true
        template:
          n_templates: 4
          take_top_k: false
        crop:
          token_crop:
            enabled: true
            token_budget: 384
            crop_weights:
              contiguous: 0.2
              spatial: 0.4
              spatial_interface: 0.4
          chain_crop:
            enabled: true

  validation:
    val-weighted-pdb:
      dataset_class: ValidationPDBDataset
      config:
        debug_mode: true
        msa:
          subsample_main: false
        template:
          n_templates: 4
          take_top_k: true
        crop:
          token_crop:
            enabled: false

dataset_paths:
  weighted-pdb:
    alignments_directory: none
    alignment_db_directory: none 
    alignment_array_directory: /shared/openfold3/pdb_training_set/alignment_arrays
    dataset_cache_file: /shared/openfold3/pdb_training_set/dataset_caches/training_cache_with_templates.json
    target_structures_directory: /shared/openfold3/pdb_training_set/preprocessed_pdb_data/standard/structure_files
    target_structure_file_format: npz
    reference_molecule_directory: /shared/openfold3/pdb_training_set/preprocessed_pdb_data/standard/reference_mols
    template_cache_directory: /shared/openfold3/pdb_training_set/templates/train_template_cache
    template_structure_array_directory: /shared/openfold3/pdb_training_set/templates/template_structure_arrays
    template_structures_directory: none
    template_file_format: npz
    ccd_file: null

  val-weighted-pdb:
    alignments_directory: none
    alignment_db_directory: none 
    alignment_array_directory: /shared/openfold3/pdb_training_set/alignment_arrays
    dataset_cache_file: /shared/openfold3/pdb_training_set/dataset_caches/validation_cache_with_templates.json
    target_structures_directory: /shared/openfold3/pdb_training_set/preprocessed_pdb_data/standard/structure_files
    target_structure_file_format: npz
    reference_molecule_directory: /shared/openfold3/pdb_training_set/preprocessed_pdb_data/standard/reference_mols
    template_cache_directory: /shared/openfold3/pdb_training_set/templates/val_template_cache
    template_structure_array_directory: /shared/openfold3/pdb_training_set/templates/template_structure_arrays
    template_structures_directory: none
    template_file_format: npz
    ccd_file: null

For example configurations for all stages of training, please see examples/training_yamls/:

  • initial_training.yml: Standard initial training configuration

  • finetune_1.yml: Fine-tuning stage 1 configuration

  • finetune_2.yml: Fine-tuning stage 2 configuration

  • finetune_3.yml: Fine-tuning stage 3 configuration (used in OF3p)

4. Launch Training

4.1 Single-GPU Training

For testing or debugging:

run_openfold train --runner-yaml training.yaml --seed 42

4.2 Multi-GPU Training

To train on multiple GPUs within a single node, configure your YAML:

pl_trainer_args:
  devices: 8      # GPUs per node
  num_nodes: 1

For multi-node distributed training, update your config as follows:

pl_trainer_args:
  devices: 8       # GPUs per node
  num_nodes: 32    # Total number of nodes

Then launch training with:

run_openfold train --runner_yaml training.yaml --seed 42

5. Monitoring Training

Enable Weights & Biases logging by configuring wandb_config in your YAML:

logging_config:
  log_lr: true
  wandb_config:
    project: openfold3-training
    entity: your-wandb-entity
    group: null
    id: null
    experiment_name: my-training-run

To use W&B, ensure you’re logged in:

wandb login

6. Checkpointing and Resuming

6.1 Checkpoint Configuration

checkpoint_config:
  every_n_epochs: 1              # Save checkpoint every N epochs
  auto_insert_metric_name: false # Don't add metric to filename
  save_last: true                # Always save 'last.ckpt'
  save_top_k: -1                 # Keep all checkpoints (-1) or top K

Checkpoints are saved to {output_dir}/checkpoints/.

6.2 Resuming Training

To resume from the last checkpoint:

experiment_settings:
  restart_checkpoint_path: last

To resume from a specific checkpoint:

experiment_settings:
  restart_checkpoint_path: /path/to/checkpoint.ckpt

7. Fine-tuning

To fine-tune from a pre-trained checkpoint, specify the checkpoint path and adjust training parameters as needed:

experiment_settings:
  mode: train
  output_dir: ./finetune_output
  seed: 42
  restart_checkpoint_path: /path/to/pretrained.ckpt