A Real Case: EMPIAR-10180
Step-by-step guide for pre-catalytic spliceosome dataset
Data preparation 📚
Dataset acquisition
Download EMPIAR-10180 (~126.5GB) from https://www.ebi.ac.uk/empiar/EMPIAR-10180/
Direct browser downloads are possible, yet we suggest utilizing Aspera for said downloads.
Aspera Connect is required for downloading large datasets from EMPIAR. The necessary file
asperaweb_id_dsa.openssh
for downloading from EMPIAR is only provided by Aspera's version less than 4.2. Hence, a latest version 4.1.3 of Aspera Connect is recommended.If you don't have it installed, please refer to the discussion here and follow the instructions to download and install it.
Run the following command to download the data, you can see more information on official website.
This dataset contains 327,490 extracted particles with box size of 320 and pixel size of 1.7 A/pix. Below is the file structure of this dataset:
The particle images are saved in
Micrographs_xxxxxxxx/
directories, pose and CTF parameters are saved indata/Example/consensus_data.star
.
Reference structure preparation
Option 1: Using off-the-shelf file:
In this step, we provide a pre-processed PDB file ready for direct download. Use the following command to download it:
Option 2: Prepare from scratch:
Given that the rotation of the pose under cryo-EM takes the center of the reconstructed three-dimensional density as the pivot point, we need to relocate the coordinate origin of the PDB file to coincide with this center.
Download PDB structure from RCSB,
(cryostar) $ wget https://files.rcsb.org/download/5NRL.cif
You can use
data/Example/consensus_half1_class001.mrc
as the consensus map, use ChimeraX to dock the downloaded PDB into the map and save it as5nrl.pdb
:
Use our command line tool
cstar_center_origin
to alter the origin of their coordinates, the output5nrl_centered.pdb
is what we need:
Optional: determine low-pass filter bandwidth
This step explains how the
low_pass_bandwidth
value underdata_process
in theatom_configs/10180.py
file is determined. You can skip this step if you want.Within the atomic model-dependent training procedure, cryoSTAR makes use of low-frequency filtering to bridge the gap between Gaussian density and the consensus map density. The cutoff frequency of this low-frequency filter takes reference from the FSC curve between the Gaussian density and consensus map. We choose 23.4A (FSC=0.5) for 10180.
We provide a command-line tool called
cstar_generate_gaussian_density
that can generate a coarse-grained gaussian density based on PDB files. The usage of this function iscstar_generate_gaussian_density <pdb_file_path.pdb> <shape> <apix>
. The input parameters are respectively the PDB file path, the shape of the generated density, and the pixel size in Angstrom. The result will be saved in the current directory, named with the input file name plus_gaussian.mrc
. For this tutorial, usecstar_generate_gaussian_density 5nrl.pdb 320 1.7
and you will get an output file called5nrl_gaussian.mrc
.You can then use EMAN e2proc3d.py tool to calculate the FSC curve between the generated Gaussian density and the consensus map.
CryoSTAR training 🚀
CryoSTAR configures information about the dataset and training hyper-parameters through the configs/xxxxx.py
file. We structure our code into two sections: a universal Python package cryostar
and a directory for specific methods projects/star
. The training is GPU-dependent, ensure the machine is equipped with a GPU and torch
can utilize the GPU.
Training the atom generator
The atomic model based training script,
train_atom.py
, can be found in theprojects/star
directory. The configuration files for varying cases reside in theprojects/star/atom_configs
folder, and the configuration file for this tutorial isprojects/star/atom_configs/10180.py
.In the
10180.py
file, we use thedataset_dir
key in thedataset_attr
dictionary to specify the location of the data downloaded from EMPIAR, and useref_pdb_path
key to point to the path of the reference pdb file obtained from previous steps. Change these values according to your only file locations.After that, run the script to start training:
Logs print on screen should be like:
and end with:
After the training is finished, the outputs will be stored in the
work_dirs/atom_xxxxx
directory, and we perform evaluations every 12,000 steps. Within this directory, you'll observe sub-directories with the nameepoch-number_step-number
. We choose the most recent directory as the final results.We recommend you to use ChimeraX for visualizing sampled pdb structures. Some helpful commands are:
Training the density generator
The density based training script,
train_density.py
, can be found in theprojects/star
directory. The configuration files for varying cases reside in theprojects/star/density_configs
folder, and the configuration file for this tutorial isprojects/star/density_configs/10180.py
.In this
10180.py
file, we use thedataset_dir
key in thedataset_attr
dictionary to specify the location of the data downloaded from EMPIAR, and usegiven_z
key inextra_input_data_attr
dictionary to point to the path of the previous step's outputz.npy
. Change these values according to your only file locations. As an alternative, you can change their values by command line arguments.Run the following command to start training:
If you want to change configurations in command line, an example is:
Our config use
mmengine
, see details on their official documentation.Logs print on screen should be like:
and end with:
After the training is finished, results will be saved to
work_dirs/density_xxxxx
, and each subdirectory has the nameepoch-number_step-number
. We choose the most recent directory as the final results.Some tips for visualization in ChimeraX:
Last updated