# OpenFold3 Input Format ## 1. High-level Structure The OpenFold3 inference pipeline takes a single JSON file as input, specifying the data and options required for structure prediction. This file can define multiple prediction targets (`queries`), which can be proteins, including individual protein chains and complexes, nucleic acids, and ligands. Multiple queries can be combined as follows: ```text { "queries": { "query_1": { ... }, "query_2": { ... } } } ``` **Required Input Fields** - `queries` *(dict, required)* - A dictionary containing one or more prediction targets. Each entry defines a single query (e.g., a protein or protein complex). - The keys (e.g., `query_1`, `query_2`, ...) uniquely identify each query and are used to name the corresponding output files. **Reference Fields** - `seeds` *(list, reference)* - This section is present in the logged `inference_query_set.json` that is part of the model run output as a reference of the seeds used. - However, this field should not be included as part of the input of an input `query.json`. Instead, the user should use either the `--num-model-seeds` command line argument or specify seeds in the `runner.yml`. See {ref}`Custom Model Seeds for more details ` (2-queries)= ## 2. Queries Each entry in the ```queries``` dictionary specifies a single bioassembly, which will be predicted in one forward pass of OpenFold3. To run **batch inference**, include multiple such query entries (e.g., ```query_1```, ```query_2```, ...) in the top-level ```queries``` field of the input JSON. The key of each query (e.g., ```query_1```) is used to name output files or directories -- either by prefixing output files or creating a directory named after the key. Each query entry is a dictionary with the following structure: ``` "query_1": { "chains": [ { ... }, { ... } ], } ``` In the current inference release, the only required field is: - `chains` *(list of dict, required)* - A list of chain definitions, where each sub-dictionary specifies one chain in the assembly. See {ref}`Section 3 <3-chains>` for a full breakdown of chain-level fields. (3-chains)= ## 3. Chains Each entry in the ```chains``` list defines one or more instances of a molecular chain in the bioassembly. The required and optional fields vary depending on the type of molecule (```protein```, ```rna```, ```dna```, or ```ligand```). All chains must define a unique ```chain_ids``` field and appropriate sequence or structure information. Below are the supported molecule types and their associated schema: (31-protein-chains)= ### 3.1. Protein chains ``` { "molecule_type": "protein", "chain_ids": "A", "description": "Optional metadata example", "sequence": "PVLSCGEWQCL", "use_msas": true, "use_main_msas": true, "use_paired_msas": true, "main_msa_file_paths": "/absolute/path/to/main_msas", "paired_msa_file_paths": "/absolute/path/to/paired_msas", "template_alignment_file_path": "/absolute/path/to/template_msa", "template_entry_chain_ids": ["entry1_A", "entry2_B", "entry3_A"], } ``` - `molecule_type` *(str, required)* - Must be "protein". - `chain_ids` *(str | list[str], required)* - One or more identifiers for this chain. Used to map sequences to structure outputs. - `description` *(str | None, optional, default = null) - Optional metadata to provide for each chain. - `sequence` *(str, required)* - Amino acid sequence (1-letter codes), supporting standard residues, X (unknown), and U (selenocysteine). - `non_canonical_residues` *(dict, optional, default = null)* - A dictionary mapping residue indices (1-based) to non-canonical residue names. - Note that MSA computation will only refer to the primary `sequence`. - Example: `{"1": "MHO", "5": "SEP"}` - `use_msas` *(bool, optional, default = true)* - Enables MSA usage. If false, empty MSA features are provided to the model. We suggest running MSA-free inference mode via a dummy MSA with only the query sequence and completely omitting MSA inputs is {ref}`discouraged <323-inference-without-msas>` if the goal is to obtain the highest-accuracy structures. - `use_main_msas` *(bool, optional, default = true)* - Controls whether to use unpaired MSAs. - For monomers or homomers, disabling this results in using only the single sequence(s) as MSA features. - For heteromers, disabling this results in using only the paired MSAs, including the query sequences, as MSA features. - `use_paired_msas` *(bool, optional, default = true)* - Controls the use of explicitly paired MSAs. - For homomers, main MSAs are internally concatenated and treated as implicitly paired, so disabling use_paired_msas does not change their MSA features. - For heteromers, paired alignments across chains are used if available and disabling use_paired_msas results in using only main MSAs as MSA features. - `main_msa_file_paths` *(str | list[str], optional, default = null)* - Path or list of paths to the MSA files for this chain. - Use this field only when running inference with **precomputed MSAs**. See the {doc}`Precomputed MSA documentation ` for details. - If using the ColabFold MSA server (`--use_msa_server=True`), this field will be automatically populated and will **override any user-provided path**. - `paired_msa_file_paths` *(str | list[str], optional, default = null)* - Path or list of paths to paired MSA files for this chain, pre-paired in the context of the full complex. - Use this field only when running inference with **precomputed MSAs** and the corresponding query has at least two unique polymer chains. See the {doc}`Precomputed MSA documentation ` for details. - If not provided, online MSA pairing can still be performed for protein chains if species information is available in one or more main MSA files per chain. See {ref}`Online Cross-Chain Pairing in OF3 <3-online-msa-pairing>` for details. - If using the ColabFold MSA server, this field is automatically populated and will **override any user-provided path**. - `template_alignment_file_path` *(str, optional, default = null)* - Path to template alignment file for this chain. - Use this field only when running inference with **precomputed alignments**. See the {doc}`Running with Templates Documentation ` for details. - If using the ColabFold MSA server, this field is automatically populated and will **override any user-provided path**. - `template_entry_chain_ids` *(str, optional, default = null)* - !!! Currently, only populated automatically !!! - A list of template PDB entry + chain IDs to use for this chain. - Use this field only when running inference with **precomputed alignments**. See the {doc}`Running with Templates Documentation ` for details. - If using the ColabFold MSA server, this field is automatically populated and will **override any user-provided path**. ### 3.2. RNA Chains ``` { "molecule_type": "rna", "chain_ids": "E", "sequence": "AGCU", "use_msas": true, "use_main_msas": true, "main_msa_file_paths": "/absolute/path/to/main_msa.sto/a3m", } ``` - `molecule_type` *(str, required)* - Must be "rna". - `chain_ids` *(str | list[str], required)* - One or more identifiers for this chain. Used to map sequences to structure outputs. - `sequence` *(str, required)* - Nucleic acid sequence (1-letter codes). - `use_msas` *(bool, optional, default = true)* - Enables MSA usage. If false, empty MSA features are provided to the model. We suggest running MSA-free inference mode via a dummy MSA with only the query sequence and completely omitting MSA inputs is {ref}`discouraged <323-inference-without-msas>` if the goal is to obtain the highest-accuracy structures. - `use_main_msas` *(bool, optional, default = true)* - Controls whether to use unpaired MSAs. For monomers or homomers, disabling this results in using only the single sequence. - `main_msa_file_paths` *(str | list[str], optional, default = null)* - Path or list of paths to the MSA files for this chain. - Use this field only when running inference with **precomputed MSAs**. See the {doc}`Precomputed MSA documentation ` for details. ### 3.3. DNA Chains ``` { "molecule_type": "dna", "chain_ids": "C", "sequence": "GACCTCT", } ``` - `molecule_type` *(str, required)* - Must be "dna". - `chain_ids` *(str | list[str], required)* - One or more identifiers for this chain. Used to map sequences to structure outputs. - `sequence` *(str, required)* - Nucleic acid sequence (1-letter codes). ### 3.4. Small Molecule / Ligand Chains Ligand chains can be specified either using SMILES: ``` { "molecule_type": "ligand", "chain_ids": "Z", "smiles": "CC(=O)OC1C[NH+]2CCC1CC2" } ``` or using CCD codes: ``` { "molecule_type": "ligand", "chain_ids": "I", "ccd_codes": "NAG", } ``` - `molecule_type` *(str, required)* - Must be "ligand". - `chain_ids` *(str | list[str], required)* - Identifiers for the ligand chain(s). - `smiles` *(str, required if ccd_codes not given)* - Canonical SMILES string of the ligand. - Mutually exclusive with `ccd_codes`. - `ccd_codes` *(str | list[str], required if smiles not given)* - Three-letter CCD code for the ligand component. - Support for providing a list of CCD codes (for instance for polymeric ligands) will be supported in a later release of the inference pipeline. - Mutually exclusive with `smiles`. ## 4. Example Input Json for a Single Query Complex Below is a complete example of an input JSON file specifying a single bioassembly, consisting of: - Two protein chains (`A` and `B`), with MSAs enabled - One DNA chain (`C`) - One RNA chain (`E`), with MSAs enabled - Two types of non-covalently bound ligands: - A small molecule ligand (`Z`), defined by a SMILES string - A single-residue glycan-like ligand (`I`), specified CCD code `NAG` ```json { "queries": { "query_1": { "chains": [ { "molecule_type": "protein", "chain_ids": "A", "sequence": "PVLSCGEWQCL", "use_msas": true, "use_main_msas": true, "use_paired_msas": true, }, { "molecule_type": "protein", "chain_ids": "B", "sequence": "RPACQLWWSRGNWERINQLWW", "use_msas": true, "use_main_msas": true, "use_paired_msas": true, }, { "molecule_type": "dna", "chain_ids": "C", "sequence": "GACCTCT", }, { "chain_ids": "E", "molecule_type": "rna", "sequence": "AGCU", "use_msas": true, }, { "molecule_type": "ligand", "chain_ids": "Z", "smiles": "CC(=O)OC1C[NH+]2CCC1CC2" }, { "molecule_type": "ligand", "chain_ids": "I", "ccd_codes": ["NAG"], } ], } }, "ccd_file_path": "/path/to/CCD/file.cif" } ``` Additional example input JSON files can be found here: - [Single-chain protein (monomer)](../../examples/example_inference_inputs/query_ubiquitin.json): Ubiquitin (PDB: 1UBQ) - [Multi-chain protein with identical chains (homomer)](../../examples/example_inference_inputs/query_homomer.json): GCN4 leucine zipper (PDB: 2ZTA) - [Multi-chain protein with different chains (multimer)](../../examples/example_inference_inputs/query_multimer.json): Deoxy human hemoglobin (PDB: 1A3N) - [Protein-ligand complex](../../examples/example_inference_inputs/query_protein_ligand.json): Mcl-1 with small molecule inhibitor (PDB: 5FDR) - [Sigle protein-single ligand complex](../../examples/example_inference_inputs/query_single_protein_single_ligand.json): T4 Lysozyme (L99A mutant) with toluene (PDB: 7L39) - [Multiple Protein-ligand complexes](../../examples/example_inference_inputs/query_protein_ligand_multiple.json): Two queries with Mcl-1 and different small molecule inhibitors (PDB: 5FDR)