OpenFold3 Configuration Reference

The full_config.yml file is a comprehensive reference configuration file that demonstrates all available configuration options for OpenFold3 inference and training experiments. This file is located at examples/reference_full_config/full_config.yml and serves as a complete example of all configurable settings.

1. Overview

The configuration file is organized into several main sections. Each section corresponds to a specific Pydantic model class defined in the OpenFold3 codebase. When you provide a runner.yml file, it overrides the default settings defined in validator.py.

2. Important Notes

  • Selective Configuration: Only specify the settings you want to override in your runner YAML file. All unspecified options will use their default values.

  • Command-line Priority: Command-line arguments take precedence and will override any values specified in the YAML file.

  • Reference Implementation: The full configuration file serves as a reference - create your own simplified runner YAML based on your specific needs. See examples/example_runner_yamls/ for common usage examples.

3. Configuration Sections

3.1. Experiment Settings (experiment_settings)

Defines overall experiment parameters, including execution mode and seed configuration.

Pydantic Model: InferenceExperimentSettings

All Options:

  • mode (ValidModeType): Experiment mode - predict or train (default: predict)

  • output_dir (Path): Directory where outputs will be written (default: ./)

  • log_dir (Path | None): Directory for logs (default: null)

  • seeds (int | list[int]): Starting seed or list of random seeds for inference (default: [42])

  • num_seeds (int | None): Number of seeds to generate if only a starting seed is provided (default: null)

  • use_msa_server (bool): Whether to use ColabFold MSA server (default: true)

  • use_templates (bool): Whether to use template structures (default: true)

  • skip_existing (bool): Skip results that already exist (default: false)

Example:

experiment_settings:
  mode: predict
  output_dir: ./results
  seeds: [42, 100, 200]
  use_msa_server: true

3.2. PyTorch Lightning Trainer Args (pl_trainer_args)

Configures the PyTorch Lightning trainer for distributed training and multi-GPU inference.

Pydantic Model: PlTrainerArgs

All Options:

  • max_epochs (int): Maximum number of training epochs (default: 1000)

  • accelerator (str): Device type - gpu or cpu (default: gpu)

  • precision (int | str): Numerical precision - 32-true, 16-mixed, etc. (default: 32-true)

  • num_nodes (int): Number of compute nodes (default: 1)

  • devices (int): Number of GPUs per node (default: 1)

  • profiler (str | None): Profiler to use (default: null)

  • log_every_n_steps (int): Logging frequency in steps (default: 1)

  • enable_checkpointing (bool): Enable checkpointing (default: true)

  • enable_model_summary (bool): Enable model summary (default: false)

  • deepspeed_config_path (Path | None): Path to DeepSpeed configuration file (default: null)

  • distributed_timeout (timedelta | None): Timeout for distributed operations (default: PT30M)

  • mpi_plugin (bool): Use MPI plugin (default: false)

Example:

pl_trainer_args:
  devices: 4
  num_nodes: 1
  precision: 16-mixed

3.3. Model Update (model_update)

Specifies model presets and custom architecture modifications.

Pydantic Model: ModelUpdate

All Options:

  • presets (list[str]): List of model presets to apply (default: [])

    • predict: Inference configuration (required for inference)

    • low_mem: Low memory mode for large structures

  • custom (dict): Custom model configuration overrides (default: {})

Example:

model_update:
  presets:
    - predict
    - low_mem
  custom: {}

3.4. Checkpoint and Cache Paths

Pydantic Model: Fields on InferenceExperimentConfig

All Options:

  • inference_ckpt_path (Path | None): Path to model checkpoint file (.pt file)

    • Default: $HOME/.openfold3/of3_ft3_v1.pt

    • Will download parameters if not present

  • inference_ckpt_name (str | None): Name of the model checkpoint to use.

    • Default: openfold3_p2_v1

    • Must be a key in OPENFOLD_MODEL_CHECKPOINT_REGISTRY(https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/entry_points/parameters.py#L29)

  • cache_path (Path | None): Directory for storing cached model parameters

    • Default: $HOME/.openfold3/


3.5. Data Module Args (data_module_args)

Configures data loading and processing.

Pydantic Model: DataModuleArgs

All Options:

  • batch_size (int): Batch size (default: 1)

  • data_seed (int | None): Random seed for data processing (default: 42)

  • num_workers (int): Number of data loading workers (default: 10)

  • num_workers_validation (int): Number of workers for validation (default: 4)

  • epoch_len (int): Length of training epoch (default: 4)

Example:

data_module_args:
  batch_size: 1
  num_workers: 8

3.X Checkpoint Confiugration (checkpoint_config)

Configures Checkpoint writing settings, which are passed to pl.ModelCheckpoint callback.

3.6. Dataset Config Kwargs (dataset_config_kwargs)

Configures MSA and template feature generation.

Pydantic Model: InferenceDatasetConfigKwargs

All Options:

  • ccd_file_path (FilePath | None): Path to Chemical Component Dictionary file, uses CCD from Biotite if null (default: null)

  • msa (MSASettings): MSA processing settings (see below)

  • template (TemplateSettings): Template processing settings (see below)

3.6.1. MSA Settings (msa)

Controls how MSAs are parsed and processed into features.

Pydantic Model: MSASettings

All Options:

  • max_rows_paired (int): Maximum rows for paired MSAs (default: 8191)

  • max_rows (int): Maximum total MSA rows (default: 16384)

  • subsample_with_bands (bool): Use MMSeqs2-style subsampling (default: false, not currently supported)

  • min_chains_paired_partial (int): Minimum chains for partial pairing (default: 2)

  • pairing_mask_keys (list[str]): Masks to apply during pairing (default: ["shared_by_two", "less_than_600"])

  • moltypes (list[MoleculeType]): Molecule types to process (default: [0, 1] for protein and RNA)

  • max_seq_counts (dict): Max sequences per MSA file (default includes: uniref90_hits: 10000, uniprot_hits: 50000, etc.)

  • msas_to_pair (list[str]): MSA files to use for online pairing (default: ["uniprot_hits", "uniprot"])

  • aln_order (list): Order to vertically concatenate MSA files (default includes: uniref90_hits, bfd_uniclust_hits, etc.)

  • paired_msa_order (list): Order to vertically concatenrate pre-paired MSAs (default: ["colabfold_paired"])

Example:

dataset_config_kwargs:
  msa:
    max_rows: 16384
    max_rows_paired: 8191
    moltypes: [0, 1]  # protein and RNA

3.6.2. Template Settings (template)

Controls template structure processing.

Pydantic Model: TemplateSettings

All Options:

  • n_templates (int): Number of templates to use (default: 4)

  • take_top_k (bool): Use top K templates by quality (default: false)

  • min_n_tokens_per_chain (int): Minimum number of tokens a chain has to have for it to get template features (default: 4)

  • distogram (TemplateDistogramSettings): Distogram binning settings

    • min_bin (float): Minimum distance bin (default: 3.25)

    • max_bin (float): Maximum distance bin (default: 50.75)

    • n_bins (int): Number of bins (default: 39)

Example:

dataset_config_kwargs:
  template:
    n_templates: 4
    take_top_k: true

3.7. Output Writer Settings (output_writer_settings)

Configures the format of output files.

Pydantic Model: OutputWritingSettings

All Options:

  • structure_format (Literal[“pdb”, “cif”, “cif.gz”]): Output format (default: cif)

  • full_confidence_output_format (Literal[“json”, “npz”]): Confidence output format (default: json)

  • write_features (bool): Write intermediate features (default: false)

  • write_latent_outputs (bool): Write model intermediate outputs (default: false)

  • write_full_confidence_scores (bool): Write full confidence scores, e.g. PAE, PDE, PLDDT (default: true)

Example:

output_writer_settings:
  structure_format: pdb
  full_confidence_output_format: json

3.8. MSA Computation Settings (msa_computation_settings)

Configures the ColabFold MSA server integration.

Pydantic Model: MsaComputationSettings

All Options:

  • msa_file_format (Literal[“npz”, “a3m”]): Format for saved MSAs (default: npz)

  • server_user_agent (str): User agent string (default: openfold)

  • server_url (Url): ColabFold server URL (default: https://api.colabfold.com)

  • save_mappings (bool): Save sequence ID mappings (default: true)

  • msa_output_directory (Path): Directory for MSA outputs (default: temporary directory/of3-of-<user>/colabfold_msas)

  • cleanup_msa_dir (bool): Delete MSAs after processing (default: true)

Example:

msa_computation_settings:
  msa_file_format: npz
  cleanup_msa_dir: false
  msa_output_directory: /path/to/msas

3.9. Template Preprocessor Settings (template_preprocessor_settings)

Configures template structure preprocessing and filtering.

Pydantic Model: TemplatePreprocessorSettings

All Options:

  • mode (Literal[“train”, “predict”]): Processing mode (default: predict)

  • moltypes (list[MoleculeType]): Molecule types to process (default: [0] for protein)

  • max_sequences_parse (int): Max sequences to parse (default: 200)

  • max_seq_id (float | None): Maximum sequence identity threshold (default: null)

  • min_align (float | None): Minimum alignment coverage (default: null)

  • min_len (int | None): Minimum aligned residues (default: null)

  • max_release_date (datetime | None): Maximum template release date (default: null)

  • min_release_date_diff (int | None): Minimum days between query and template release (default: null)

  • max_templates (int): Maximum templates per chain (default: 20)

  • fetch_missing_structures (bool): Fetch missing structures from PDB (default: true)

  • create_precache (bool): Cache template structure data (default: false)

  • preparse_structures (bool): Preparse structures into .npz files (default: false)

  • create_logs (bool): Create preprocessing logs (default: false)

  • n_processes (int): Number of preprocessing processes (default: 1)

  • chunksize (int): Tasks per worker in multiprocessing (default: 1)

  • structure_directory (Path | None): Directory for template structures (default: null)

  • structure_file_format (str): File format of structures - cif or pdb (default: cif)

  • output_directory (Path | None): Output directory for templates (default: null)

  • precache_directory (Path | None): Directory for template precache (default: null)

  • structure_array_directory (Path | None): Directory for preparsed structures (default: null)

  • cache_directory (Path | None): Directory for template cache (default: null)

  • log_directory (Path | None): Directory for logs (default: null)

  • ccd_file_path (Path | None): Path to Chemical Component Dictionary file (default: null)

Example:

template_preprocessor_settings:
  mode: predict
  max_templates: 20
  fetch_missing_structures: true

4. Default Values Reference

For the complete list of default values, see the Pydantic model classes in: