A Real Case: EMPIAR-10180

Step-by-step guide for pre-catalytic spliceosome dataset

Data preparation 📚

Dataset acquisition

Download EMPIAR-10180 (~126.5GB) from https://www.ebi.ac.uk/empiar/EMPIAR-10180/
- Direct browser downloads are possible, yet we suggest utilizing Aspera for said downloads.
- Aspera Connect is required for downloading large datasets from EMPIAR. The necessary file asperaweb_id_dsa.openssh for downloading from EMPIAR is only provided by Aspera's version less than 4.2. Hence, a latest version 4.1.3 of Aspera Connect is recommended.
  - If you don't have it installed, please refer to the discussion here and follow the instructions to download and install it.
- Run the following command to download the data, you can see more information on official website.
  $ ascp -QT -l 200M -P33001 -k 1 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh emp_ext2@fasp.ebi.ac.uk:/10180 .

This dataset contains 327,490 extracted particles with box size of 320 and pixel size of 1.7 A/pix. Below is the file structure of this dataset:

data/
├── Example/
|   ├── 4-bodies-tight-mask.star
|   ├── consensus_data.star
|   ├── consensus_half1_class001.mrc
|   ├── consensus_half1_model.star
|   ├── consensus_half2_class001.mrc
|   ├── consensus_half2_model.star
|   ├── consensus_optimiser.star
|   ├── consensus_sampling.star
|   └── multibody.sh
├── Mask-and-Ref/
├── Micrographs_20160622/
├── Micrographs_20160710/
├── Micrographs_20160716/
├── Micrographs_20160813/
├── Micrographs_20160820/
├── Micrographs_20160911/
├── Micrographs_thick/
└── Micrographs_thin/

The particle images are saved in Micrographs_xxxxxxxx/ directories, pose and CTF parameters are saved in data/Example/consensus_data.star.

Reference structure preparation

Option 1: Using off-the-shelf file:

In this step, we provide a pre-processed PDB file ready for direct download. Use the following command to download it:

(cryostar) $ wget https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/cryostar/datasets/tutorial_10180_centered_5nrl.pdb

Option 2: Prepare from scratch:

Given that the rotation of the pose under cryo-EM takes the center of the reconstructed three-dimensional density as the pivot point, we need to relocate the coordinate origin of the PDB file to coincide with this center.
Download PDB structure from RCSB, (cryostar) $ wget https://files.rcsb.org/download/5NRL.cif
You can use data/Example/consensus_half1_class001.mrc as the consensus map, use ChimeraX to dock the downloaded PDB into the map and save it as 5nrl.pdb:

Use our command line tool cstar_center_origin to alter the origin of their coordinates, the output 5nrl_centered.pdb is what we need:
```
(cryostar) $ cstar_center_origin 5nrl.pdb consensus_half1_class001.mrc
```

Optional: determine low-pass filter bandwidth

This step explains how the low_pass_bandwidth value under data_process in the atom_configs/10180.py file is determined. You can skip this step if you want.
Within the atomic model-dependent training procedure, cryoSTAR makes use of low-frequency filtering to bridge the gap between Gaussian density and the consensus map density. The cutoff frequency of this low-frequency filter takes reference from the FSC curve between the Gaussian density and consensus map. We choose 23.4A (FSC=0.5) for 10180.

We provide a command-line tool called cstar_generate_gaussian_density that can generate a coarse-grained gaussian density based on PDB files. The usage of this function is cstar_generate_gaussian_density <pdb_file_path.pdb> <shape> <apix>. The input parameters are respectively the PDB file path, the shape of the generated density, and the pixel size in Angstrom. The result will be saved in the current directory, named with the input file name plus _gaussian.mrc. For this tutorial, use cstar_generate_gaussian_density 5nrl.pdb 320 1.7 and you will get an output file called 5nrl_gaussian.mrc.
You can then use EMAN e2proc3d.py tool to calculate the FSC curve between the generated Gaussian density and the consensus map.

CryoSTAR training 🚀

CryoSTAR configures information about the dataset and training hyper-parameters through the configs/xxxxx.py file. We structure our code into two sections: a universal Python package cryostar and a directory for specific methods projects/star. The training is GPU-dependent, ensure the machine is equipped with a GPU and torch can utilize the GPU.

Training the atom generator

The atomic model based training script, train_atom.py, can be found in the projects/star directory. The configuration files for varying cases reside in the projects/star/atom_configs folder, and the configuration file for this tutorial is projects/star/atom_configs/10180.py.
In the 10180.py file, we use the dataset_dir key in the dataset_attr dictionary to specify the location of the data downloaded from EMPIAR, and use ref_pdb_path key to point to the path of the reference pdb file obtained from previous steps. Change these values according to your only file locations.
After that, run the script to start training:
```
$ python train_atom.py atom_configs/10180.py
```

Logs print on screen should be like:

...
2023/10/12 12:31:10 - cryostar - INFO - epoch 0 [0/5117] | loss: -0.36890 | cryoem(gmm): -0.37057 | con: 0.00072 | sse: 0.00000 | dist: 0.00058 | clash: 0.00037 | kld: 0.00000 | kld(/dim): 3.00000
2023/10/12 12:31:17 - cryostar - INFO - epoch 0 [50/5117] | loss: -0.41110 | cryoem(gmm): -0.41165 | con: 0.00008 | sse: 0.00000 | dist: 0.00047 | clash: 0.00000 | kld: 0.00000 | kld(/dim): 3.00000
...

and end with:

`Trainer.fit` stopped: `max_steps=96000` reached.

After the training is finished, the outputs will be stored in the work_dirs/atom_xxxxx directory, and we perform evaluations every 12,000 steps. Within this directory, you'll observe sub-directories with the name epoch-number_step-number. We choose the most recent directory as the final results.

atom_xxxxx/
├── 0000_0000000/
├── ...
├── 0112_0096000/        # evaluation results
│  ├── ckpt.pt           # model parameters
│  ├── input_image.png   # visualization of input cryo-EM images
│  ├── pca-1.pdb         # sampled coarse-grained atomic structures along 1st PCA axis
│  ├── pca-2.pdb
│  ├── pca-3.pdb
│  ├── pred.pdb          # sampled structures at Kmeans cluster centers
│  ├── pred_gmm_image.png
│  └── z.npy             # the latent code of each particle
|                        # a matrix whose shape is num_of_particle x 8
├── yyyymmdd_hhmmss.log  # running logs
├── config.py            # a backup of the config file
└── train_atom.py        # a backup of the training script

We recommend you to use ChimeraX for visualizing sampled pdb structures. Some helpful commands are:

# show as ribbon diagram
hide atoms
show cartoons

# show as animation
mseries all

# shows a graphical interface in which the slider can be dragged
mseries slider all

Training the density generator

The density based training script, train_density.py, can be found in the projects/star directory. The configuration files for varying cases reside in the projects/star/density_configs folder, and the configuration file for this tutorial is projects/star/density_configs/10180.py.
In this 10180.py file, we use the dataset_dir key in the dataset_attr dictionary to specify the location of the data downloaded from EMPIAR, and use given_z key in extra_input_data_attr dictionary to point to the path of the previous step's output z.npy. Change these values according to your only file locations. As an alternative, you can change their values by command line arguments.
Run the following command to start training:
```
$ python train_density.py density_configs/10180.py
```
If you want to change configurations in command line, an example is:
```
$ python train_density.py density_configs/10180.py --cfg-options extra_input_data_attr.given_z=work_dirs/atom_10180/0018_0096000/z.npy
```
Our config use mmengine, see details on their official documentation.

Logs print on screen should be like:

2023/10/12 22:00:53 - cryostar - INFO - Config:
...
2023/10/12 22:04:07 - cryostar - INFO - epoch 0 [0/6823] | em: 0.08792 | kld: 0.00000
2023/10/12 22:04:53 - cryostar - INFO - epoch 0 [100/6823] | em: 0.04898 | kld: 0.00000
...

and end with:

`Trainer.fit` stopped: `max_epochs=5` reached.

After the training is finished, results will be saved to work_dirs/density_xxxxx, and each subdirectory has the name epoch-number_step-number. We choose the most recent directory as the final results.

density_xxxxx/
├── 0004_0014470/          # evaluation results
│  ├── ckpt.pt             # model parameters
│  ├── vol_pca_1_000.mrc   # density sampled along the PCA axis, named by vol_pca_pca-axis_serial-number.mrc
│  ├── ...
│  ├── vol_pca_3_009.mrc
│  ├── z.npy
│  ├── z_pca_1.txt         # sampled z values along the 1st PCA axis
│  ├── z_pca_2.txt
│  └── z_pca_3.txt
├── yyyymmdd_hhmmss.log    # running logs
├── config.py              # a backup of the config file
└── train_density.py       # a backup of the training script

Some tips for visualization in ChimeraX:

# set to same threshold, replace xxxx with desired iso-surface level
vol all level xxxx
vol all color cornflowerblue

# show as animation
mseries all
# or
mseries slider all

Last updated 1 year ago