# AlphaFold3 Batch Predictions with `afusion.api`

## Introduction

This tutorial demonstrates how to use the `afusion.api` module to create batch tasks from a pandas DataFrame and run batch predictions using AlphaFold. We'll guide you through the entire process—from preparing your data to interpreting the results—using a minimal example with protein sequences.

**Prerequisites**:

- Python 3.10 or higher
- AlphaFold 3 installed and configured
- The `afusion` package installed
- Basic knowledge of Python and pandas

---

## 1. Import Necessary Modules

Begin by importing the required modules:

```python
import pandas as pd
from afusion.api import create_tasks_from_dataframe, run_batch_predictions
```

- **`pandas`**: Used for data manipulation and analysis.
- **`afusion.api`**: Contains the functions we'll use to create tasks and run predictions.

---

## 2. Prepare the Data

We'll create a pandas DataFrame containing the necessary information for our batch predictions. In this example, we'll predict the structure of a minimal protein sequence for two jobs.

```python
# Define the DataFrame for the test
data = {
    'job_name': ['TestJob1', 'TestJob1', 'TestJob2'],
    'type': ['protein', 'protein', 'protein'],
    'id': ['A', 'B', 'A'],  # Entity IDs
    'sequence': ['MVLSPADKTNVKAAW', 'MVLSPADKTNVKAAW', 'MVLSPADKTNVKAAW'],  # Protein sequences
    # No modifications or optional parameters in this example
}

df = pd.DataFrame(data)
```

Let's display the DataFrame to verify its contents:

```python
df
```

**Output**:

|    | job_name | type    | id   | sequence        |
|----|----------|---------|------|-----------------|
| 0  | TestJob1 | protein | A    | MVLSPADKTNVKAAW |
| 1  | TestJob1 | protein | B    | MVLSPADKTNVKAAW |
| 2  | TestJob2 | protein | A    | MVLSPADKTNVKAAW |

**Explanation**:

- **`job_name`**: Specifies the name of the job. `TestJob1` has two entities, while `TestJob2` has one.
- **`type`**: Indicates the entity type; all are proteins in this case.
- **`id`**: Identifier for each entity.
- **`sequence`**: The protein sequences to be predicted.

---

## 3. Create Tasks from the DataFrame

We will use the `create_tasks_from_dataframe` function to convert our DataFrame into a list of task dictionaries suitable for AlphaFold predictions.

```python
# Create tasks from DataFrame
tasks = create_tasks_from_dataframe(df)
```

This function processes the DataFrame and creates tasks based on the parameters provided.

Let's inspect the tasks generated:

```python
tasks
```

**Output**:

```python
[
    {
        'name': 'TestJob1',
        'modelSeeds': [1],
        'sequences': [
            {
                'protein': {
                    'sequence': 'MVLSPADKTNVKAAW',
                    'unpairedMsa': None,
                    'pairedMsa': None,
                    'templates': [],
                    'id': 'A'
                }
            },
            {
                'protein': {
                    'sequence': 'MVLSPADKTNVKAAW',
                    'unpairedMsa': None,
                    'pairedMsa': None,
                    'templates': [],
                    'id': 'B'
                }
            }
        ],
        'dialect': 'alphafold3',
        'version': 1
    },
    {
        'name': 'TestJob2',
        'modelSeeds': [1],
        'sequences': [
            {
                'protein': {
                    'sequence': 'MVLSPADKTNVKAAW',
                    'unpairedMsa': None,
                    'pairedMsa': None,
                    'templates': [],
                    'id': 'A'
                }
            }
        ],
        'dialect': 'alphafold3',
        'version': 1
    }
]
```

**Explanation**:

- **`name`**: Corresponds to `job_name` in the DataFrame.
- **`modelSeeds`**: List of model seeds; defaults to `[1]` if not specified.
- **`sequences`**: Contains the sequences and related data for each entity.
  - **`protein`**: Dictionary with the protein sequence and optional parameters.
    - **`sequence`**: The protein sequence.
    - **`unpairedMsa`, `pairedMsa`, `templates`**: Set to `None` or empty lists as we didn't specify them.
    - **`id`**: Entity ID.
- **`dialect`**: Specifies the AlphaFold version; set to `'alphafold3'`.
- **`version`**: Version number; set to `1`.

---

## 4. Configure Paths

Before running predictions, specify the input and output directories, as well as the locations of the model parameters and databases.

```python
# Path configurations (update these paths as needed)
af_input_base_path = "/path/to/af_input"
af_output_base_path = "/path/to/af_output"
model_parameters_dir = "/path/to/model_parameters"  # Update with the correct path
databases_dir = "/path/to/databases"                # Update with the correct path
```

**Note**: Replace `"/path/to/..."` with the actual paths on your system.

Ensure that the directories exist:

```python
import os

os.makedirs(af_input_base_path, exist_ok=True)
os.makedirs(af_output_base_path, exist_ok=True)
```

---

## 5. Run Batch Predictions

Now, run the batch predictions using the `run_batch_predictions` function.

```python
# Run batch predictions
results = run_batch_predictions(
    tasks=tasks,
    af_input_base_path=af_input_base_path,
    af_output_base_path=af_output_base_path,
    model_parameters_dir=model_parameters_dir,
    databases_dir=databases_dir,
    run_data_pipeline=False,  # Set to True to run the data pipeline
    run_inference=False,      # Set to True to run inference (requires GPU)
)
```

**Parameters**:

- **`tasks`**: The list of tasks generated from the DataFrame.
- **`af_input_base_path`**: Base directory for AlphaFold input files.
- **`af_output_base_path`**: Base directory for AlphaFold output files.
- **`model_parameters_dir`**: Directory containing the model parameters.
- **`databases_dir`**: Directory containing the required databases.
- **`run_data_pipeline`**: Whether to run the data pipeline (`True` or `False`).
- **`run_inference`**: Whether to run inference (`True` if you have a GPU).

**Important**: For this minimal test, we set `run_data_pipeline` and `run_inference` to `False`. In a real prediction scenario, you should set these to `True`.

---

## 6. Interpret the Results

After running the predictions, process and print the results:

```python
# Process and print results
for result in results:
    print(f"Job Name: {result['job_name']}")
    print(f"Output Folder: {result['output_folder']}")
    print(f"Status: {result['status']}")
    print("---")
```

**Sample Output**:

```
Job Name: TestJob1
Output Folder: /path/to/af_output
Status: Success
---
Job Name: TestJob2
Output Folder: /path/to/af_output
Status: Success
---
```

**Explanation**:

- **`job_name`**: The name of the job.
- **`output_folder`**: The directory where the output files are stored.
- **`status`**: The status of the prediction (`'Success'` or an error message).

---

## 7. Notes and Considerations

- **System Configuration**: This example was run on a system with the following configuration:
    - CPU: Xeon Silver 4410T
    - GPU: NVIDIA RTX 4090
    - Operating System: Ubuntu Server 24.04 LTS
   
     **Note**: Your hardware configuration can significantly affect the performance and runtime of AlphaFold predictions. Ensure that your system meets the necessary requirements, especially regarding GPU capabilities.

- **GPU Requirement**: AlphaFold predictions require substantial computational resources and are optimized for GPUs. Ensure that your system meets the hardware requirements if you plan to run actual predictions.

- **Database Availability**: The databases specified in `databases_dir` are essential for AlphaFold's operation. Make sure that you have downloaded and correctly configured all necessary databases.

- **Model Parameters**: The `model_parameters_dir` should contain the AlphaFold model parameters. Ensure that you have obtained these parameters and placed them in the specified directory.

- **Adjusting Parameters**: Depending on your use case, you may need to adjust additional parameters such as `model_seeds`, `msa_option`, and others. Refer to the `afusion.api` documentation for more details.

- **Error Handling**: The `run_batch_predictions` function returns a list of results with status messages. If any job fails, the status will contain the error message, which can be used for debugging.

---

## Conclusion

In this tutorial, we demonstrated how to use the `afusion.api` module to create batch tasks from a pandas DataFrame and run batch predictions using AlphaFold. By following these steps, you can streamline the process of setting up and executing multiple predictions, making it efficient to handle large-scale protein structure predictions.

---

## Additional Resources

- **AlphaFold GitHub Repository**: [https://github.com/deepmind/alphafold](https://github.com/deepmind/alphafold)

---

**Disclaimer**: Ensure that you comply with all licensing and usage terms when using AlphaFold and associated data.