AlphaFold3 Batch Predictions with afusion.api¶
Introduction¶
This tutorial demonstrates how to use the afusion.api module to create batch tasks from a pandas DataFrame and run batch predictions using AlphaFold. We’ll guide you through the entire process—from preparing your data to interpreting the results—using a minimal example with protein sequences.
Prerequisites:
Python 3.10 or higher
AlphaFold 3 installed and configured
The
afusionpackage installedBasic knowledge of Python and pandas
1. Import Necessary Modules¶
Begin by importing the required modules:
import pandas as pd
from afusion.api import create_tasks_from_dataframe, run_batch_predictions
pandas: Used for data manipulation and analysis.afusion.api: Contains the functions we’ll use to create tasks and run predictions.
2. Prepare the Data¶
We’ll create a pandas DataFrame containing the necessary information for our batch predictions. In this example, we’ll predict the structure of a minimal protein sequence for two jobs.
# Define the DataFrame for the test
data = {
'job_name': ['TestJob1', 'TestJob1', 'TestJob2'],
'type': ['protein', 'protein', 'protein'],
'id': ['A', 'B', 'A'], # Entity IDs
'sequence': ['MVLSPADKTNVKAAW', 'MVLSPADKTNVKAAW', 'MVLSPADKTNVKAAW'], # Protein sequences
# No modifications or optional parameters in this example
}
df = pd.DataFrame(data)
Let’s display the DataFrame to verify its contents:
df
Output:
job_name |
type |
id |
sequence |
|
|---|---|---|---|---|
0 |
TestJob1 |
protein |
A |
MVLSPADKTNVKAAW |
1 |
TestJob1 |
protein |
B |
MVLSPADKTNVKAAW |
2 |
TestJob2 |
protein |
A |
MVLSPADKTNVKAAW |
Explanation:
job_name: Specifies the name of the job.TestJob1has two entities, whileTestJob2has one.type: Indicates the entity type; all are proteins in this case.id: Identifier for each entity.sequence: The protein sequences to be predicted.
3. Create Tasks from the DataFrame¶
We will use the create_tasks_from_dataframe function to convert our DataFrame into a list of task dictionaries suitable for AlphaFold predictions.
# Create tasks from DataFrame
tasks = create_tasks_from_dataframe(df)
This function processes the DataFrame and creates tasks based on the parameters provided.
Let’s inspect the tasks generated:
tasks
Output:
[
{
'name': 'TestJob1',
'modelSeeds': [1],
'sequences': [
{
'protein': {
'sequence': 'MVLSPADKTNVKAAW',
'unpairedMsa': None,
'pairedMsa': None,
'templates': [],
'id': 'A'
}
},
{
'protein': {
'sequence': 'MVLSPADKTNVKAAW',
'unpairedMsa': None,
'pairedMsa': None,
'templates': [],
'id': 'B'
}
}
],
'dialect': 'alphafold3',
'version': 1
},
{
'name': 'TestJob2',
'modelSeeds': [1],
'sequences': [
{
'protein': {
'sequence': 'MVLSPADKTNVKAAW',
'unpairedMsa': None,
'pairedMsa': None,
'templates': [],
'id': 'A'
}
}
],
'dialect': 'alphafold3',
'version': 1
}
]
Explanation:
name: Corresponds tojob_namein the DataFrame.modelSeeds: List of model seeds; defaults to[1]if not specified.sequences: Contains the sequences and related data for each entity.protein: Dictionary with the protein sequence and optional parameters.sequence: The protein sequence.unpairedMsa,pairedMsa,templates: Set toNoneor empty lists as we didn’t specify them.id: Entity ID.
dialect: Specifies the AlphaFold version; set to'alphafold3'.version: Version number; set to1.
4. Configure Paths¶
Before running predictions, specify the input and output directories, as well as the locations of the model parameters and databases.
# Path configurations (update these paths as needed)
af_input_base_path = "/path/to/af_input"
af_output_base_path = "/path/to/af_output"
model_parameters_dir = "/path/to/model_parameters" # Update with the correct path
databases_dir = "/path/to/databases" # Update with the correct path
Note: Replace "/path/to/..." with the actual paths on your system.
Ensure that the directories exist:
import os
os.makedirs(af_input_base_path, exist_ok=True)
os.makedirs(af_output_base_path, exist_ok=True)
5. Run Batch Predictions¶
Now, run the batch predictions using the run_batch_predictions function.
# Run batch predictions
results = run_batch_predictions(
tasks=tasks,
af_input_base_path=af_input_base_path,
af_output_base_path=af_output_base_path,
model_parameters_dir=model_parameters_dir,
databases_dir=databases_dir,
run_data_pipeline=False, # Set to True to run the data pipeline
run_inference=False, # Set to True to run inference (requires GPU)
)
Parameters:
tasks: The list of tasks generated from the DataFrame.af_input_base_path: Base directory for AlphaFold input files.af_output_base_path: Base directory for AlphaFold output files.model_parameters_dir: Directory containing the model parameters.databases_dir: Directory containing the required databases.run_data_pipeline: Whether to run the data pipeline (TrueorFalse).run_inference: Whether to run inference (Trueif you have a GPU).
Important: For this minimal test, we set run_data_pipeline and run_inference to False. In a real prediction scenario, you should set these to True.
6. Interpret the Results¶
After running the predictions, process and print the results:
# Process and print results
for result in results:
print(f"Job Name: {result['job_name']}")
print(f"Output Folder: {result['output_folder']}")
print(f"Status: {result['status']}")
print("---")
Sample Output:
Job Name: TestJob1
Output Folder: /path/to/af_output
Status: Success
---
Job Name: TestJob2
Output Folder: /path/to/af_output
Status: Success
---
Explanation:
job_name: The name of the job.output_folder: The directory where the output files are stored.status: The status of the prediction ('Success'or an error message).
7. Notes and Considerations¶
System Configuration: This example was run on a system with the following configuration:
CPU: Xeon Silver 4410T
GPU: NVIDIA RTX 4090
Operating System: Ubuntu Server 24.04 LTS
Note: Your hardware configuration can significantly affect the performance and runtime of AlphaFold predictions. Ensure that your system meets the necessary requirements, especially regarding GPU capabilities.
GPU Requirement: AlphaFold predictions require substantial computational resources and are optimized for GPUs. Ensure that your system meets the hardware requirements if you plan to run actual predictions.
Database Availability: The databases specified in
databases_dirare essential for AlphaFold’s operation. Make sure that you have downloaded and correctly configured all necessary databases.Model Parameters: The
model_parameters_dirshould contain the AlphaFold model parameters. Ensure that you have obtained these parameters and placed them in the specified directory.Adjusting Parameters: Depending on your use case, you may need to adjust additional parameters such as
model_seeds,msa_option, and others. Refer to theafusion.apidocumentation for more details.Error Handling: The
run_batch_predictionsfunction returns a list of results with status messages. If any job fails, the status will contain the error message, which can be used for debugging.
Conclusion¶
In this tutorial, we demonstrated how to use the afusion.api module to create batch tasks from a pandas DataFrame and run batch predictions using AlphaFold. By following these steps, you can streamline the process of setting up and executing multiple predictions, making it efficient to handle large-scale protein structure predictions.
Additional Resources¶
AlphaFold GitHub Repository: https://github.com/deepmind/alphafold
Disclaimer: Ensure that you comply with all licensing and usage terms when using AlphaFold and associated data.