Try on Your Own Data!

An as detailed as possible operation guide

Data preparation 📚

What users need to prepare is a particle stack with pose information (existing in the form of a starfile and .mrcs files, the corresponding consensus map, and the built atomic model based on the consensus map.

Particle stack

Our method requires properly prepared particle stacks, which carry pose and CTF information. These should exist in the form of starfile and mrcs files.

Exploring Datasets from EMPIAR
- You could try using publicly available datasets from EMPIAR. There are several datasets on EMPIAR that contain heterogeneity. Opting for preprocessed datasets that provide particle stacks, including consensus map pose, is advisable. However, if these aren't available, you'll have to process the data from micrographs by yourself.
- If you're not accustomed to downloading datasets from EMPIAR, feel free to refer to these guidelines:
  - Aspera Connect is required for downloading large datasets from EMPIAR. The necessary file asperaweb_id_dsa.openssh for downloading from EMPIAR is only provided by Aspera's version less than 4.2. Hence, a latest version 4.1.3 of Aspera Connect is recommended. If you don't have it installed, please refer to the discussion here and follow the instructions to download and install it.
  - An example downloading command is showing below, you can see more information on official website.
    $ ascp -QT -l 200M -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh emp_ext2@fasp.ebi.ac.uk:/test test_download
Applying CryoSTAR to Your Own Experimental Data
- If you are using RELION to handle your experimental data, you can easily export the job related to the consensus map.
- If you are using cryoSPARC, we recommend exporting the results of the homogeneous refinement job or the heterogeneous refinement job.
  - Be aware that the exported data from cryoSPARC will be in the .cs format -- numpy data files. You can use the csparc2star.py tool provided by pyem to convert .cs files into .star files.

Please ensure you have prepared your dataset correctly before moving towards code execution. This will facilitate smoother operations and accurate results.

Retrieving Reference PDB

In order to make the most of our method, it's crucial to have a suitable reference PDB structure. Depending on the source of your data, there are various methods to obtain the structure:

Using EMPIAR's Datasets: Typically, datasets from EMPIAR have associated PDB files that can be conveniently downloaded. However, you should ensure that the PDB files offer relatively complete density modeling, especially in potentially dynamic areas. Even if the PDB structure is not entirely accurate, it's quite beneficial.
On Your Own Dataset: This pathway could necessitate the traditional model building process to create an atomic model to serve as a reference PDB structure.

Additionally, consider using structures from AlphaFold2 or some homologous sequence structures as required references.

Note on Coordinate Origin ⚠️

PDB structure usually determines the coordinate origin, typically corresponding to the bottom-left corner of the electron microscope density map. Applying rotation matrices directly on these coordinates is erroneous. The atomic structure's coordinate origin needs to travel to the center of the electron microscope density map.

Firstly, ensure that you have the consensus map constructed from the particle stack and the corresponding atomic model is docked into the map. Next, you could simply run the cstar_center_origin command:

(cryostar) $ cstar_center_origin <pdb_file_path.pdb> <consensus_map_path.mrc>

The output appeared in the current directory named <pdb_file_name>_centered.pdb is the reference structure we need.

Optional: Determine low-pass filter bandwidth

This is an optional step. It involves the value of low_pass_bandwidth in the configuration file in atom_configs. You can generally set it straightforwardly to 10.

Within the atomic model-dependent training procedure, cryoSTAR makes use of low-frequency filtering to bridge the gap between Gaussian density and the real density. In the practice, we determine the cut-off frequency as the resolution at FSC=0.5 by evaluating the FSC between the Gaussian density linked with the atomic model and the consensus map reconstructed given pose.

Assuming you've gotten a particle stack containing pose, the reconstruction of the consensus map can be accomplished by using the relion_reconstruct command supported by RELION. If the tasks are exported from RELION or CryoSPARC, it's straightforward to receive the consensus map straight from these tasks. Take heed to tweak the origin of these mrc files via earlier noted steps.

The atomic model based Gaussian density can be generated by:

(cryostar) $ cstar_generate_gaussian_density <pdb_file_path.pdb> <shape> <apix>

This command will generate a file named with <your_input_pdb_file_name>_gaussian.mrc in the current directory. If you don't know how to set the shape and apix arguments, check the consensus map shape and apix using cstar_show_mrc_info <consensus_map_path.mrc>, it will print something like:

The mrcfile contains volume data with:
shape (nz, ny, nx):     (480, 480, 480)
voxel_size/A (x, y, z): (1.43, 1.43, 1.43)
origin/A (x, y, z):     (0., 0., 0.)

For the example output above, the shape is 480 and apix is 1.43.

You can calculate the statistics of the FSC curve by EMAN e2proc3d.py:

$ e2proc3d.py consensus_map.mrc output.txt --calcfsc gaussian_density.mrc

Then use plotfsc.py script to show the FSC curve and choose the cutoff frequency.

This is an optional step. It's feasible to directly use a bandwidth around 10-15A, bypassing the complicated steps of defining the cut-off frequency.

Run CryoSTAR 🚀

The code is structured into two primary parts:

cryostar: A universal function python package. This package encompasses all the basic components that could be used in multiple projects, thus offering scalability and high reusability.
projects/star: This directory constitutes the code necessary for running cryoSTAR.

Configuration of input parameters for various data sets is made possible through .py files within the projects/star/xxx_configs/ directory.

In order to run our method, you will need to navigate to the projects/star directory using:

$ cd projects/star

Alternatively, you could copy the projects/star directory to a location of your choice. This can be accomplished using the cp -r command like so:

$ cp -r projects/star <desired_directory>

This makes the process more flexible, as you can choose to run the method either directly from the original folder, or from a copied version in a location that suits your workflow better. Moving on to the procedure for running our method, it is framed in two core steps:

Step 1: Training the atom generator

This step involves executing the train_atom.py script. Your primary task is to set up your own config.py according to your dataset and place it within the atom_configs directory.

We provide you with a reference script namely example.py in atom_configs/ directory, copy and modify it.

This example.py can serve as a starting point. You are encouraged to copy this example.py and modify the parameters and filename to suit your data. Remember to give your customized config.py a distinct name that represents your dataset, which will further help in smooth execution and debug, if needed. At this point, what you mainly need to change are the values within the dataset_attr dict. A dict in Python contains key-value pairs. Your task is to adjust the values associated with respective keys to suit the requirements of your data.

After that, run the script to start training:

$ python train_atom.py atom_configs/xxxxx.py

Logs print on screen should be like:

...
2023/10/12 12:31:10 - cryostar - INFO - epoch 0 [0/5117] | loss: -0.36890 | cryoem(gmm): -0.37057 | con: 0.00072 | sse: 0.00000 | dist: 0.00058 | clash: 0.00037 | kld: 0.00000 | kld(/dim): 3.00000
2023/10/12 12:31:17 - cryostar - INFO - epoch 0 [50/5117] | loss: -0.41110 | cryoem(gmm): -0.41165 | con: 0.00008 | sse: 0.00000 | dist: 0.00047 | clash: 0.00000 | kld: 0.00000 | kld(/dim): 3.00000
...

and end with:

`Trainer.fit` stopped: `max_steps=96000` reached.

After the training is finished, the outputs will be stored in the work_dirs/atom_xxxxx directory, and we perform evaluations every 12,000 steps. Within this directory, you'll observe sub-directories with the name epoch-number_step-number. We choose the most recent directory as the final results.

atom_xxxxx/
├── 0000_0000000/
├── ...
├── 0112_0096000/        # evaluation results
│  ├── ckpt.pt           # model parameters
│  ├── input_image.png   # visualization of input cryo-EM images
│  ├── pca-1.pdb         # sampled coarse-grained atomic structures along 1st PCA axis
│  ├── pca-2.pdb
│  ├── pca-3.pdb
│  ├── pred.pdb          # sampled structures at Kmeans cluster centers
│  ├── pred_gmm_image.png
│  └── z.npy             # the latent code of each particle
|                        # a matrix whose shape is num_of_particle x 8
├── yyyymmdd_hhmmss.log  # running logs
├── config.py            # a backup of the config file
└── train_atom.py        # a backup of the training script

We recommend you to use ChimeraX for visualizing sampled pdb structures. Some helpful commands are:

# show as ribbon diagram
hide atoms
show cartoons

# show as animation
mseries all

# shows a graphical interface in which the slider can be dragged
mseries slider all

Step 2: Training the density generator

This step involves executing the train_density.py script. Your primary task is to set up your own config.py according to your dataset and place it within the density_configs directory.

We provide you with a reference script namely example.py in density_configs/ directory.

In this step, you need to make changes in two areas within the example.py: 1) Just like in Step 1, you need to modify the data-related configurations dataset_attr here too. However, you now have an option to use a larger data_process.down_side_shape to obtain a better quality of density. 2) The given_z in the extra_input_data_attr dict should be set as the path of the latest output z.npy file from Step 1.

Run the following command to start training:

$ python train_density.py density_configs/xxxxx.py

If you want to change configurations in command line, an example is:

$ python train_density.py density_configs/xxxxx.py --cfg-options extra_input_data_attr.given_z=work_dirs/atom_xxxxxx/0018_0096000/z.npy

Our config use mmengine, see details on their official documentation.

Logs print on screen should be like:

2023/10/12 22:00:53 - cryostar - INFO - Config:
...
2023/10/12 22:04:07 - cryostar - INFO - epoch 0 [0/6823] | em: 0.08792 | kld: 0.00000
2023/10/12 22:04:53 - cryostar - INFO - epoch 0 [100/6823] | em: 0.04898 | kld: 0.00000
...

and end with:

`Trainer.fit` stopped: `max_epochs=5` reached.

After the training is finished, results will be saved to work_dirs/density_xxxxx, and each subdirectory has the name epoch-number_step-number. We choose the most recent directory as the final results.

density_xxxxx/
├── 0004_0014470/          # evaluation results
│  ├── ckpt.pt             # model parameters
│  ├── vol_pca_1_000.mrc   # density sampled along the PCA axis, named by vol_pca_pca-axis_serial-number.mrc
│  ├── ...
│  ├── vol_pca_3_009.mrc
│  ├── z.npy
│  ├── z_pca_1.txt         # sampled z values along the 1st PCA axis
│  ├── z_pca_2.txt
│  └── z_pca_3.txt
├── yyyymmdd_hhmmss.log    # running logs
├── config.py              # a backup of the config file
└── train_density.py       # a backup of the training script

Some tips for visualization in ChimeraX:

# set to same threshold, replace xxxx with desired iso-surface level
vol all level xxxx
vol all color cornflowerblue

# show as animation
mseries all
# or
mseries slider all

Last updated 1 year ago