A Real Case: EMPIAR-10180
Step-by-step guide for pre-catalytic spliceosome dataset
Last updated
Step-by-step guide for pre-catalytic spliceosome dataset
Last updated
Download EMPIAR-10180 (~126.5GB) from
Direct browser downloads are possible, yet we suggest utilizing Aspera for said downloads.
is required for downloading large datasets from . The necessary file asperaweb_id_dsa.openssh
for downloading from EMPIAR is only provided by Aspera's version less than 4.2. Hence, a latest version 4.1.3 of Aspera Connect is recommended.
If you don't have it installed, please refer to the discussion and follow the instructions to download and install it.
Run the following command to download the data, you can see more information on official .
This dataset contains 327,490 extracted particles with box size of 320 and pixel size of 1.7 A/pix. Below is the file structure of this dataset:
The particle images are saved in Micrographs_xxxxxxxx/
directories, pose and CTF parameters are saved in data/Example/consensus_data.star
.
Option 1: Using off-the-shelf file:
In this step, we provide a pre-processed PDB file ready for direct download. Use the following command to download it:
Option 2: Prepare from scratch:
Given that the rotation of the pose under cryo-EM takes the center of the reconstructed three-dimensional density as the pivot point, we need to relocate the coordinate origin of the PDB file to coincide with this center.
Download PDB structure from RCSB, (cryostar) $ wget https://files.rcsb.org/download/5NRL.cif
Use our command line tool cstar_center_origin
to alter the origin of their coordinates, the output 5nrl_centered.pdb
is what we need:
This step explains how the low_pass_bandwidth
value under data_process
in the atom_configs/10180.py
file is determined. You can skip this step if you want.
Within the atomic model-dependent training procedure, cryoSTAR makes use of low-frequency filtering to bridge the gap between Gaussian density and the consensus map density. The cutoff frequency of this low-frequency filter takes reference from the FSC curve between the Gaussian density and consensus map. We choose 23.4A (FSC=0.5) for 10180.
We provide a command-line tool called cstar_generate_gaussian_density
that can generate a coarse-grained gaussian density based on PDB files. The usage of this function is cstar_generate_gaussian_density <pdb_file_path.pdb> <shape> <apix>
. The input parameters are respectively the PDB file path, the shape of the generated density, and the pixel size in Angstrom. The result will be saved in the current directory, named with the input file name plus _gaussian.mrc
. For this tutorial, use cstar_generate_gaussian_density 5nrl.pdb 320 1.7
and you will get an output file called 5nrl_gaussian.mrc
.
CryoSTAR configures information about the dataset and training hyper-parameters through the configs/xxxxx.py
file. We structure our code into two sections: a universal Python package cryostar
and a directory for specific methods projects/star
. The training is GPU-dependent, ensure the machine is equipped with a GPU and torch
can utilize the GPU.
The atomic model based training script, train_atom.py
, can be found in the projects/star
directory. The configuration files for varying cases reside in the projects/star/atom_configs
folder, and the configuration file for this tutorial is projects/star/atom_configs/10180.py
.
In the 10180.py
file, we use the dataset_dir
key in the dataset_attr
dictionary to specify the location of the data downloaded from EMPIAR, and use ref_pdb_path
key to point to the path of the reference pdb file obtained from previous steps. Change these values according to your only file locations.
After that, run the script to start training:
Logs print on screen should be like:
and end with:
After the training is finished, the outputs will be stored in the work_dirs/atom_xxxxx
directory, and we perform evaluations every 12,000 steps. Within this directory, you'll observe sub-directories with the name epoch-number_step-number
. We choose the most recent directory as the final results.
The density based training script, train_density.py
, can be found in the projects/star
directory. The configuration files for varying cases reside in the projects/star/density_configs
folder, and the configuration file for this tutorial is projects/star/density_configs/10180.py
.
In this 10180.py
file, we use the dataset_dir
key in the dataset_attr
dictionary to specify the location of the data downloaded from EMPIAR, and use given_z
key in extra_input_data_attr
dictionary to point to the path of the previous step's output z.npy
. Change these values according to your only file locations. As an alternative, you can change their values by command line arguments.
Run the following command to start training:
If you want to change configurations in command line, an example is:
Logs print on screen should be like:
and end with:
After the training is finished, results will be saved to work_dirs/density_xxxxx
, and each subdirectory has the name epoch-number_step-number
. We choose the most recent directory as the final results.
Some tips for visualization in ChimeraX:
You can use data/Example/consensus_half1_class001.mrc
as the consensus map, use to dock the downloaded PDB into the map and save it as 5nrl.pdb
:
You can then use tool to calculate the FSC curve between the generated Gaussian density and the consensus map.
We recommend you to use for visualizing sampled pdb structures. Some helpful commands are:
Our config use mmengine
, see details on their official .