Try on Your Own Data!
An as detailed as possible operation guide
Last updated
An as detailed as possible operation guide
Last updated
What users need to prepare is a particle stack with pose information (existing in the form of a starfile
and .mrcs
files, the corresponding consensus map, and the built atomic model based on the consensus map.
Our method requires properly prepared particle stacks, which carry pose and CTF information. These should exist in the form of starfile
and mrcs
files.
Exploring Datasets from EMPIAR
You could try using publicly available datasets from EMPIAR. There are several datasets on EMPIAR that contain heterogeneity. Opting for preprocessed datasets that provide particle stacks, including consensus map pose, is advisable. However, if these aren't available, you'll have to process the data from micrographs by yourself.
If you're not accustomed to downloading datasets from EMPIAR, feel free to refer to these guidelines:
Aspera Connect is required for downloading large datasets from EMPIAR. The necessary file asperaweb_id_dsa.openssh
for downloading from EMPIAR is only provided by Aspera's version less than 4.2. Hence, a latest version 4.1.3 of Aspera Connect is recommended. If you don't have it installed, please refer to the discussion here and follow the instructions to download and install it.
An example downloading command is showing below, you can see more information on official website.
Applying CryoSTAR to Your Own Experimental Data
If you are using RELION to handle your experimental data, you can easily export the job related to the consensus map.
If you are using cryoSPARC, we recommend exporting the results of the homogeneous refinement
job or the heterogeneous refinement
job.
Be aware that the exported data from cryoSPARC will be in the .cs
format -- numpy
data files. You can use the csparc2star.py
tool provided by pyem to convert .cs
files into .star
files.
Please ensure you have prepared your dataset correctly before moving towards code execution. This will facilitate smoother operations and accurate results.
In order to make the most of our method, it's crucial to have a suitable reference PDB structure. Depending on the source of your data, there are various methods to obtain the structure:
Using EMPIAR's Datasets: Typically, datasets from EMPIAR have associated PDB files that can be conveniently downloaded. However, you should ensure that the PDB files offer relatively complete density modeling, especially in potentially dynamic areas. Even if the PDB structure is not entirely accurate, it's quite beneficial.
On Your Own Dataset: This pathway could necessitate the traditional model building process to create an atomic model to serve as a reference PDB structure.
Additionally, consider using structures from AlphaFold2 or some homologous sequence structures as required references.
Note on Coordinate Origin ⚠️
PDB structure usually determines the coordinate origin, typically corresponding to the bottom-left corner of the electron microscope density map. Applying rotation matrices directly on these coordinates is erroneous. The atomic structure's coordinate origin needs to travel to the center of the electron microscope density map.
Firstly, ensure that you have the consensus map constructed from the particle stack and the corresponding atomic model is docked into the map. Next, you could simply run the cstar_center_origin
command:
The output appeared in the current directory named <pdb_file_name>_centered.pdb
is the reference structure we need.
This is an optional step. It involves the value of low_pass_bandwidth
in the configuration file in atom_configs
. You can generally set it straightforwardly to 10.
Within the atomic model-dependent training procedure, cryoSTAR makes use of low-frequency filtering to bridge the gap between Gaussian density and the real density. In the practice, we determine the cut-off frequency as the resolution at FSC=0.5 by evaluating the FSC between the Gaussian density linked with the atomic model and the consensus map reconstructed given pose.
Assuming you've gotten a particle stack containing pose, the reconstruction of the consensus map can be accomplished by using the relion_reconstruct
command supported by RELION. If the tasks are exported from RELION or CryoSPARC, it's straightforward to receive the consensus map straight from these tasks. Take heed to tweak the origin of these mrc
files via earlier noted steps.
The atomic model based Gaussian density can be generated by:
This command will generate a file named with <your_input_pdb_file_name>_gaussian.mrc
in the current directory. If you don't know how to set the shape
and apix
arguments, check the consensus map shape
and apix
using cstar_show_mrc_info <consensus_map_path.mrc>
, it will print something like:
For the example output above, the shape
is 480
and apix
is 1.43
.
You can calculate the statistics of the FSC curve by EMAN e2proc3d.py:
Then use plotfsc.py
script to show the FSC curve and choose the cutoff frequency.
This is an optional step. It's feasible to directly use a bandwidth around 10-15A, bypassing the complicated steps of defining the cut-off frequency.
The code is structured into two primary parts:
cryostar
: A universal function python package. This package encompasses all the basic components that could be used in multiple projects, thus offering scalability and high reusability.
projects/star
: This directory constitutes the code necessary for running cryoSTAR.
Configuration of input parameters for various data sets is made possible through .py
files within the projects/star/xxx_configs/
directory.
In order to run our method, you will need to navigate to the projects/star
directory using:
Alternatively, you could copy the projects/star
directory to a location of your choice. This can be accomplished using the cp -r
command like so:
This makes the process more flexible, as you can choose to run the method either directly from the original folder, or from a copied version in a location that suits your workflow better. Moving on to the procedure for running our method, it is framed in two core steps:
This step involves executing the train_atom.py
script. Your primary task is to set up your own config.py
according to your dataset and place it within the atom_configs
directory.
We provide you with a reference script namely example.py
in atom_configs/
directory, copy and modify it.
This example.py
can serve as a starting point. You are encouraged to copy this example.py
and modify the parameters and filename to suit your data. Remember to give your customized config.py
a distinct name that represents your dataset, which will further help in smooth execution and debug, if needed. At this point, what you mainly need to change are the values within the dataset_attr
dict. A dict in Python contains key-value pairs. Your task is to adjust the values
associated with respective keys
to suit the requirements of your data.
After that, run the script to start training:
Logs print on screen should be like:
and end with:
After the training is finished, the outputs will be stored in the work_dirs/atom_xxxxx
directory, and we perform evaluations every 12,000 steps. Within this directory, you'll observe sub-directories with the name epoch-number_step-number
. We choose the most recent directory as the final results.
We recommend you to use ChimeraX for visualizing sampled pdb structures. Some helpful commands are:
This step involves executing the train_density.py
script. Your primary task is to set up your own config.py
according to your dataset and place it within the density_configs
directory.
We provide you with a reference script namely example.py
in density_configs/
directory.
In this step, you need to make changes in two areas within the example.py
: 1) Just like in Step 1, you need to modify the data-related configurations dataset_attr
here too. However, you now have an option to use a larger data_process.down_side_shape
to obtain a better quality of density. 2) The given_z
in the extra_input_data_attr
dict should be set as the path of the latest output z.npy
file from Step 1.
Run the following command to start training:
If you want to change configurations in command line, an example is:
Our config use mmengine
, see details on their official documentation.
Logs print on screen should be like:
and end with:
After the training is finished, results will be saved to work_dirs/density_xxxxx
, and each subdirectory has the name epoch-number_step-number
. We choose the most recent directory as the final results.
Some tips for visualization in ChimeraX: