# Running OpenFold3 Inference with Templates This document contains instructions on how to use template information for OF3 predictions. Here, we assume that you already generated all of your template alignments or intend to fetch them from Colabfold on-the-fly. If you do not have any precomputed template alignments and do not want to use Colabfold, refer to our {doc}`MSA Generation Guide ` before consulting this document. If you need further clarifications on how some of the template components of our inference pipeline work, refer to {doc}`this explanatory document `. The template pipeline currently supports monomeric templates and has been tested for protein chains only. The main steps detailed in this guide are: 1. {ref}`Providing files for template featurization <1-template-files>` 2. {ref}`Adding template information to the inference query json <2-specifying-template-information-in-the-inference-query-file>` 3. {ref}`High-throughput workflow support <3-optimizations-for-high-throughput-workflows>` (1-template-files)= ## 1. Template Files Template featurization requires query-to-template **alignments** and template **structures**. (11-template-aligment-file-format)= ### 1.1. Template Aligment File Format Template alignments can be provided in either `sto`, `a3m` or `m8` format. Template alignments from the Colabfold server are in `m8` format. #### 1.1.1. STO Files in `sto` format need to contain the fields provided by default by hmmer alignment tools (hmmsearch, hmmalign). These are: 1. metadata headers: `#=GS _/- mol:` - `#=GS`: indicates header info - `_`: entry identifier indicating which structure file to parse (usually PDB entry ID) and chain identifier indicating which chain in this complex is to be used as the template chain - `-`: start and end residue indices (1-indexed) indicating which position of the aligned template sequence with respect to the full template sequence - `mol:`: type of the template molecule, currently only support *protein* 2. alignment rows: `_ ALIGNED-SEQUENCE` - `_`: to match the alignment to the header, may contain /start-end positions but these are not used - `ALIGNED-SEQUENCE`: the actual sequence alignment, may be split across multiple rows ``` # STOCKHOLM 1.0 #=GS entry1_A/1-100 mol:protein #=GS entry2_B/50-150 mol:protein entry1_A MKLLVVDDA--GQKFT entry2_B MK--VVDDARGQGKFT // ``` Note that the `sto` parser attempts to derive the query-to-template residue correspondences from the existing alignment. If this is not possbile, we realign the template sequences to the provided query sequence using Kalign. More on this in the [template processing explanatory document](template_explanation.md). #### 1.1.2. A3M Files in the `a3m` format require the standard fasta format with optional start/end positions: 1. headers: `>_/-` - `_`: entry identifier indicating which structure file to parse (usually PDB entry ID) and chain identifier indicating which chain in this complex is to be used as the template chain - `-`: *optional*, start and end residue indices (1-indexed) indicating which position of the aligned template sequence with respect to the full template sequence 2. alignment rows: `ALIGNED-SEQUENCE` - `ALIGNED-SEQUENCE`: the actual sequence, needs to be aligned if the header contains start-end positions, otherwise the unaligned sequence ``` >entry1_A/1-100 MKLLVVDDA--GQGKFT >entry2_B/50-150 MK--VVDDAaRGQGKFT ``` Note that the `a3m` parser attempts to derive the query-to-template residue correspondences from the existing alignment. If this is not possbile, we realign the template sequences to the provided query sequence using Kalign. More on this in the [template processing explanatory document](template_explanation.md). #### 1.1.3. M8 Files in `m8` format expect the standard BLAST tabular output format with 12 tab-separated columns. We only use columns 1. (`_`), 3. (sequence identity of the template to the query) and 11. (e value). For all columns, see https://linsalrob.github.io/ComputationalGenomicsManual/SequenceFileFormats/. ``` query_A template_B 85.7 14 2 0 1 14 50 63 1e-05 28.1 query_A template_C 71.4 14 4 0 5 18 75 88 2e-03 22.3 ``` Note that since `m8` files do not provide actual alignments, we only use them to identify which structure files to get templates from, retrieve sequences from these structure files and always realign them to the query sequence using Kalign. More on this in the [template processing explanatory document](template_explanation.md). ### 1.2. Template Structure File Format Template structures currently can only be provided in `cif` format. An upcoming release will add support for parsing templates from `pdb` files. (2-specifying-template-information-in-the-inference-query-file)= ## 2. Specifying Template Information in the Inference Query File ### 2.1. Specifying Alignments The data pipeline needs to know which template alignment to use for which chain. This information is provided by specifying the {ref}`paths to the alignments <31-protein-chains>` for each chain's `template_alignment_file_path` field in the inference query json file. Note that when fetching alignments from the Colabfold server, `template_alignment_file_path` fields are automatically populated.
Template alignment file path example ...

{
    "queries": {
        "example_query": {
            "chains": [
                {
                    "molecule_type": "protein",
                    "chain_ids": "A",
                    "sequence": "GCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLLGAGESGKSTIVKQMKIIHEAGYSEEECKQYKAVVYSNTIQSIIAIIRAMGRLKIDFGDAARADDARQLFVLAGAAEEGFMTAELAGVIKRLWKDSGVQACFNRSREYQLNDSAAYYLNDLDRIAQPNYIPTQQDVLRTRVKTTGIVETHFTFKDLHFKMFDVGAQRSERKKWIHCFEGVTAIIFCVALSDYDLVLAEDEEMNRMHESMKLFDSICNNKWFTDTSIILFLNKKDLFEEKIKKSPLTICYPEYAGSNTYEEAAAYIQCQFEDLNKRKDTKEIYTHFTCATDTKNVQFVFDAVTDVIIKNNLKDCGLF",
                    "template_alignment_file_path": "example_chain_A.sto"
                },
                {
                    "molecule_type": "protein",
                    "chain_ids": "B",
                    "sequence": "MSELDQLRQEAEQLKNQIRDARKACADATLSQITNNIDPVGRIQMRTRRTLRGHLAKIYAMHWGTDSRLLVSASQDGKLIIWDSYTTNKVHAIPLRSSWVMTCAYAPSGNYVACGGLDNICSIYNLKTREGNVRVSRELAGHTGYLSCCRFLDDNQIVTSSGDTTCALWDIETGQQTTTFTGHTGDVMSLSLAPDTRLFVSGACDASAKLWDVREGMCRQTFTGHESDINAICFFPNGNAFATGSDDATCRLFDLRADQELMTYSHDNIICGITSVSFSKSGRLLLAGYDDFNCNVWDALKADRAGVLAGHDNRVSCLGVTDDGMAVATGSWDSFLKIWN",
                    "template_alignment_file_path": "example_chain_B.sto"
                },
                {
                    "molecule_type": "protein",
                    "chain_ids": "C",
                    "sequence": "MASNNTASIAQARKLVEQLKMEANIDRIKVSKAAADLMAYCEAHAKEDPLLTPVPASENPFREKKFFSAIL",
                    "template_alignment_file_path": "example_chain_C.sto"
                },
            ],
        }
    }
}
### 2.2. Using Specific Templates By default, the template pipeline automatically populates the `template_entry_chain_ids` field with [n templates](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/core/data/pipelines/preprocessing/template.py#L1535) from the alignment, which is then further subset to the [top k templates](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/projects/of3_all_atom/config/dataset_config_components.py#L116) during featurization for inference. In an **upcoming release**, we will add support for specifying *specific templates* for the data pipeline to use for featurization. This will be possible through the `template_entry_chain_ids` field: ``` { "queries": { "example_query": { "chains": [ { "molecule_type": "protein", "chain_ids": "A", "sequence": "EXAMPLEPROTEINSEQUENCE", "template_alignment_file_path": "example_chain_A.sto", "template_entry_chain_ids": ["entry1_A", "entry2_B", "entry3_A"] }, ], } } } ``` Note that the corresponding template IDs need to be present in the provided raw alignment file, so here, IDs `"entry1_A"`, `"entry2_B"`, `"entry3_A"` and corresponding alignments need be present in `example_chain_A.sto` like so: ``` # STOCKHOLM 1.0 #=GS entry1_A/1-100 mol:protein #=GS entry2_B/50-150 mol:protein entry1_A MKLLVVDDA--GQKFT entry2_B MK--VVDDARGQGKFT entry3_A MK----DDARGQGKFT // ``` (3-optimizations-for-high-throughput-workflows)= ## 3. Optimizations for High-Throughput Workflows For high-throughput use cases, where a large number of structures are to be predicted, template processing can take a significant amount of time even with the built-in {doc}`deduplication utility ` we have for template alignment and structure processing. To avoid having to spend GPU compute on data transformations, we provide separate template preprocessing scripts to generate the necessary inputs from which template featurization can run efficiently in a subsequent job without being a bottleneck to the model forward pass. ### 3.1. Template Alignment Preprocessing A recommended workflow for providing template data for very large datasets is the following: 1. Compute {ref}`template alignments <311-precomputed-template-alignments>`. 2. [Download the PDB](https://github.com/aqlaboratory/openfold-3/blob/main/scripts/snakemake_msa/download_of3_databases.py) or other template structure dataset locally. 3. Precompute the {ref}`template precache <312-template-precache>` from template structures to speed up template cache precomputation. 4. Precompute the {ref}`template cache <313-template-cache>` from template alignments and the template precache. 5. Preparse the {ref}`template structures <32-template-structure-preprocessing>` into template structure arrays. This workflow produces a set of *template cache entries* and *preparsed template structures* for on-the-fly data processing that happens concurrently with the model forward pass. Each of these steps are detailed below. (311-precomputed-template-alignments)= #### 3.1.1. Precomputed Template Alignments Our template processing pipeline accepts MSAs generated from our {doc}`OF3-style MSA pipeline ` or from other workflows as long as they are in one of the {ref}`expected formats <11-template-aligment-file-format>`. (312-template-precache)= #### 3.1.2. Template Precache We found that preprocessing template alignments for large datasets can take a long time, partly due to the requirement to parse template structures so we can correspond them to the template alignment sequences. We provide a [preprocessing script](https://github.com/aqlaboratory/openfold-3/blob/main/scripts/data_preprocessing/preprocess_template_alignments_precache_of3.py) that compresses template structure files into metadata files which we call *template precache entries*, containing the release date and a mapping from chain `asym_id` identifiers to their canonical sequences denoted in the structure file: ```python { 'release_date': , 'chain_id_seq_map': { '': '', '': '', <...> }, } ``` You can run this script using: ``` python preprocess_template_alignments_precache_of3.py \ --runner_yaml ``` with runner.yml like this: ``` template_preprocessor_settings: n_processes: 4 chunksize: 1 structure_directory: structure_file_format: "cif" precache_directory: ``` Using these files instead of the raw structure files during template cache creation drastically speeds up processing. For the full PDB we observed a reduction of template processing runtimes from *120 hours without a template precache* to *3 hours with a template precache* (including precache computation time) when running template cache creation on 250 parallel processes. (313-template-cache)= #### 3.1.2. Template Cache Under the hood, the OF3 inference pipeline uses a preprocessed version of the template alignments during online data processing, which we call the *template cache*. In short, each unique sequence in the inference query set gets its own template cache entry and each of these cache entries contain processed and validated template alignment data: ```python { '