# OpenFold3 Configuration Reference

The [`full_config.yml` file](https://github.com/aqlaboratory/openfold-3/blob/main/examples/reference_full_config/full_config.yml) is a comprehensive reference configuration file that demonstrates all available configuration options for OpenFold3 inference and training experiments. This file is located at `examples/reference_full_config/full_config.yml` and serves as a complete example of all configurable settings.

## 1. Overview

The configuration file is organized into several main sections. Each section corresponds to a specific Pydantic model class defined in the OpenFold3 codebase. When you provide a `runner.yml` file, it **overrides** the default settings defined in [`validator.py`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/entry_points/validator.py).

## 2. Important Notes

- **Selective Configuration**: Only specify the settings you want to override in your runner YAML file. All unspecified options will use their default values.
- **Command-line Priority**: Command-line arguments take precedence and will override any values specified in the YAML file.
- **Reference Implementation**: The full configuration file serves as a reference - create your own simplified runner YAML based on your specific needs. See `examples/example_runner_yamls/` for common usage examples.

## 3. Configuration Sections

### 3.1. Experiment Settings (`experiment_settings`)

Defines overall experiment parameters, including execution mode and seed configuration.

**Pydantic Model**: [`InferenceExperimentSettings`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/entry_points/validator.py#L247)

**All Options**:
- `mode` *(ValidModeType)*: Experiment mode - `predict` or `train` (default: `predict`)
- `output_dir` *(Path)*: Directory where outputs will be written (default: `./`)
- `log_dir` *(Path | None)*: Directory for logs (default: `null`)
- `seeds` *(int | list[int])*: Starting seed or list of random seeds for inference (default: `[42]`)
- `num_seeds` *(int | None)*: Number of seeds to generate if only a starting seed is provided (default: `null`)
- `use_msa_server` *(bool)*: Whether to use ColabFold MSA server (default: `true`)
- `use_templates` *(bool)*: Whether to use template structures (default: `true`)
- `skip_existing` *(bool)*: Skip results that already exist (default: `false`)

**Example**:
```yaml
experiment_settings:
  mode: predict
  output_dir: ./results
  seeds: [42, 100, 200]
  use_msa_server: true
```

---

### 3.2. PyTorch Lightning Trainer Args (`pl_trainer_args`)

Configures the PyTorch Lightning trainer for distributed training and multi-GPU inference.

**Pydantic Model**: [`PlTrainerArgs`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/entry_points/validator.py#L121)

**All Options**:
- `max_epochs` *(int)*: Maximum number of training epochs (default: `1000`)
- `accelerator` *(str)*: Device type - `gpu` or `cpu` (default: `gpu`)
- `precision` *(int | str)*: Numerical precision - `32-true`, `16-mixed`, etc. (default: `32-true`)
- `num_nodes` *(int)*: Number of compute nodes (default: `1`)
- `devices` *(int)*: Number of GPUs per node (default: `1`)
- `profiler` *(str | None)*: Profiler to use (default: `null`)
- `log_every_n_steps` *(int)*: Logging frequency in steps (default: `1`)
- `enable_checkpointing` *(bool)*: Enable checkpointing (default: `true`)
- `enable_model_summary` *(bool)*: Enable model summary (default: `false`)
- `deepspeed_config_path` *(Path | None)*: Path to DeepSpeed configuration file (default: `null`)
- `distributed_timeout` *(timedelta | None)*: Timeout for distributed operations (default: `PT30M`)
- `mpi_plugin` *(bool)*: Use MPI plugin (default: `false`)

**Example**:
```yaml
pl_trainer_args:
  devices: 4
  num_nodes: 1
  precision: 16-mixed
```

---

### 3.3. Model Update (`model_update`)

Specifies model presets and custom architecture modifications.

**Pydantic Model**: [`ModelUpdate`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/projects/of3_all_atom/project_entry.py#L28)

**All Options**:
- `presets` *(list[str])*: List of model presets to apply (default: `[]`)
  - `predict`: Inference configuration (required for inference)
  - `low_mem`: Low memory mode for large structures
- `custom` *(dict)*: Custom model configuration overrides (default: `{}`)

**Example**:
```yaml
model_update:
  presets:
    - predict
    - low_mem
  custom: {}
```

---

### 3.4. Checkpoint and Cache Paths

**Pydantic Model**: Fields on [`InferenceExperimentConfig`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/entry_points/validator.py#L347)

**All Options**:
- `inference_ckpt_path` *(Path | None)*: Path to model checkpoint file (`.pt` file)
  - Default: `$HOME/.openfold3/of3_ft3_v1.pt`
  - Will download parameters if not present
- `inference_ckpt_name` *(str | None)*: Name of the model checkpoint to use.
  - Default: `openfold3_p2_v1`
  - Must be a key in `OPENFOLD_MODEL_CHECKPOINT_REGISTRY`(https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/entry_points/parameters.py#L29)
- `cache_path` *(Path | None)*: Directory for storing cached model parameters
  - Default: `$HOME/.openfold3/`

---

### 3.5. Data Module Args (`data_module_args`)

Configures data loading and processing.

**Pydantic Model**: [`DataModuleArgs`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/entry_points/validator.py#L110)

**All Options**:
- `batch_size` *(int)*: Batch size (default: `1`)
- `data_seed` *(int | None)*: Random seed for data processing (default: `42`)
- `num_workers` *(int)*: Number of data loading workers (default: `10`)
- `num_workers_validation` *(int)*: Number of workers for validation (default: `4`)
- `epoch_len` *(int)*: Length of training epoch (default: `4`)

**Example**:
```yaml
data_module_args:
  batch_size: 1
  num_workers: 8
```

---
### 3.X Checkpoint Confiugration (`checkpoint_config`)

Configures Checkpoint writing settings, which are passed to [pl.ModelCheckpoint callback](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ModelCheckpoint.html). 


### 3.6. Dataset Config Kwargs (`dataset_config_kwargs`)

Configures MSA and template feature generation.

**Pydantic Model**: [`InferenceDatasetConfigKwargs`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/projects/of3_all_atom/config/dataset_configs.py#L270)

**All Options**:
- `ccd_file_path` *(FilePath | None)*: Path to Chemical Component Dictionary file, uses CCD from Biotite if null (default: `null`)
- `msa` *(MSASettings)*: MSA processing settings (see below)
- `template` *(TemplateSettings)*: Template processing settings (see below)

#### 3.6.1. MSA Settings (`msa`)

Controls how MSAs are parsed and processed into features.

**Pydantic Model**: [`MSASettings`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/projects/of3_all_atom/config/dataset_config_components.py#L32)

**All Options**:
- `max_rows_paired` *(int)*: Maximum rows for paired MSAs (default: `8191`)
- `max_rows` *(int)*: Maximum total MSA rows (default: `16384`)
- `subsample_with_bands` *(bool)*: Use MMSeqs2-style subsampling (default: `false`, not currently supported)
- `min_chains_paired_partial` *(int)*: Minimum chains for partial pairing (default: `2`)
- `pairing_mask_keys` *(list[str])*: Masks to apply during pairing (default: `["shared_by_two", "less_than_600"]`)
- `moltypes` *(list[MoleculeType])*: Molecule types to process (default: `[0, 1]` for protein and RNA)
- `max_seq_counts` *(dict)*: Max sequences per MSA file (default includes: uniref90_hits: 10000, uniprot_hits: 50000, etc.)
- `msas_to_pair` *(list[str])*: MSA files to use for online pairing (default: `["uniprot_hits", "uniprot"]`)
- `aln_order` *(list)*: Order to vertically concatenate MSA files (default includes: uniref90_hits, bfd_uniclust_hits, etc.)
- `paired_msa_order` *(list)*: Order to vertically concatenrate pre-paired MSAs (default: `["colabfold_paired"]`)

**Example**:
```yaml
dataset_config_kwargs:
  msa:
    max_rows: 16384
    max_rows_paired: 8191
    moltypes: [0, 1]  # protein and RNA
```

#### 3.6.2. Template Settings (`template`)

Controls template structure processing.

**Pydantic Model**: [`TemplateSettings`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/projects/of3_all_atom/config/dataset_config_components.py#L113)

**All Options**:
- `n_templates` *(int)*: Number of templates to use (default: `4`)
- `take_top_k` *(bool)*: Use top K templates by quality (default: `false`)
- `min_n_tokens_per_chain` *(int)*: Minimum number of tokens a chain has to have for it to get template features (default: `4`)
- `distogram` *(TemplateDistogramSettings)*: Distogram binning settings
  - `min_bin` *(float)*: Minimum distance bin (default: `3.25`)
  - `max_bin` *(float)*: Maximum distance bin (default: `50.75`)
  - `n_bins` *(int)*: Number of bins (default: `39`)

**Example**:
```yaml
dataset_config_kwargs:
  template:
    n_templates: 4
    take_top_k: true
```

---

### 3.7. Output Writer Settings (`output_writer_settings`)

Configures the format of output files.

**Pydantic Model**: [`OutputWritingSettings`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/entry_points/validator.py#L141)

**All Options**:
- `structure_format` *(Literal["pdb", "cif", "cif.gz"])*: Output format (default: `cif`)
- `full_confidence_output_format` *(Literal["json", "npz"])*: Confidence output format (default: `json`)
- `write_features` *(bool)*: Write intermediate features (default: `false`)
- `write_latent_outputs` *(bool)*: Write model intermediate outputs (default: `false`)
- `write_full_confidence_scores` *(bool)*: Write full confidence scores, e.g. PAE, PDE, PLDDT (default: `true`)

**Example**:
```yaml
output_writer_settings:
  structure_format: pdb
  full_confidence_output_format: json
```

---

### 3.8. MSA Computation Settings (`msa_computation_settings`)

Configures the ColabFold MSA server integration.

**Pydantic Model**: [`MsaComputationSettings`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/core/data/tools/colabfold_msa_server.py#L904)

**All Options**:
- `msa_file_format` *(Literal["npz", "a3m"])*: Format for saved MSAs (default: `npz`)
- `server_user_agent` *(str)*: User agent string (default: `openfold`)
- `server_url` *(Url)*: ColabFold server URL (default: `https://api.colabfold.com`)
- `save_mappings` *(bool)*: Save sequence ID mappings (default: `true`)
- `msa_output_directory` *(Path)*: Directory for MSA outputs (default: `temporary directory/of3-of-<user>/colabfold_msas`)
- `cleanup_msa_dir` *(bool)*: Delete MSAs after processing (default: `true`)

**Example**:
```yaml
msa_computation_settings:
  msa_file_format: npz
  cleanup_msa_dir: false
  msa_output_directory: /path/to/msas
```

---

### 3.9. Template Preprocessor Settings (`template_preprocessor_settings`)

Configures template structure preprocessing and filtering.

**Pydantic Model**: [`TemplatePreprocessorSettings`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/core/data/pipelines/preprocessing/template.py#L1459)

**All Options**:
- `mode` *(Literal["train", "predict"])*: Processing mode (default: `predict`)
- `moltypes` *(list[MoleculeType])*: Molecule types to process (default: `[0]` for protein)
- `max_sequences_parse` *(int)*: Max sequences to parse (default: `200`)
- `max_seq_id` *(float | None)*: Maximum sequence identity threshold (default: `null`)
- `min_align` *(float | None)*: Minimum alignment coverage (default: `null`)
- `min_len` *(int | None)*: Minimum aligned residues (default: `null`)
- `max_release_date` *(datetime | None)*: Maximum template release date (default: `null`)
- `min_release_date_diff` *(int | None)*: Minimum days between query and template release (default: `null`)
- `max_templates` *(int)*: Maximum templates per chain (default: `20`)
- `fetch_missing_structures` *(bool)*: Fetch missing structures from PDB (default: `true`)
- `create_precache` *(bool)*: Cache template structure data (default: `false`)
- `preparse_structures` *(bool)*: Preparse structures into .npz files (default: `false`)
- `create_logs` *(bool)*: Create preprocessing logs (default: `false`)
- `n_processes` *(int)*: Number of preprocessing processes (default: `1`)
- `chunksize` *(int)*: Tasks per worker in multiprocessing (default: `1`)
- `structure_directory` *(Path | None)*: Directory for template structures (default: `null`)
- `structure_file_format` *(str)*: File format of structures - `cif` or `pdb` (default: `cif`)
- `output_directory` *(Path | None)*: Output directory for templates (default: `null`)
- `precache_directory` *(Path | None)*: Directory for template precache (default: `null`)
- `structure_array_directory` *(Path | None)*: Directory for preparsed structures (default: `null`)
- `cache_directory` *(Path | None)*: Directory for template cache (default: `null`)
- `log_directory` *(Path | None)*: Directory for logs (default: `null`)
- `ccd_file_path` *(Path | None)*: Path to Chemical Component Dictionary file (default: `null`)

**Example**:
```yaml
template_preprocessor_settings:
  mode: predict
  max_templates: 20
  fetch_missing_structures: true
```

---

## 4. Default Values Reference

For the complete list of default values, see the Pydantic model classes in:
- [`openfold3/entry_points/validator.py`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/entry_points/validator.py) - Main configuration classes
- [`openfold3/projects/of3_all_atom/config/dataset_config_components.py`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/projects/of3_all_atom/config/dataset_config_components.py) - MSA and template settings
- [`openfold3/core/data/tools/colabfold_msa_server.py`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/core/data/tools/colabfold_msa_server.py) - MSA server settings
- [`openfold3/core/data/pipelines/preprocessing/template.py`](http://github.com/aqlaboratory/openfold-3/blob/main/openfold3/core/data/pipelines/preprocessing/template.py) - Template preprocessing settings