A Minimal Case
What does the input/output look like?
Last updated
What does the input/output look like?
Last updated
We will delve into step-by-step instructions illustrating how to effectively utilize cryoSTAR with a synthetic dataset. Our aim is to equip you with the skills you need to make the most of this powerful tool. Let's jump right in! 😊📖
The image below illustrates the process of generating the synthetic dataset. More details can be found in our paper.
Download the data from Google Drive (a link for download via wget
command will be provided soon):
Extract the zip file at the path: cryostar/projects/star
, the directory looks like:
The directory contains three sub-directories:
pdbs
contains 50 pdb
files, where 1akeA_{i}.pdb
is the interpolation between 1akeA_1.pdb
(pdbid: 4ake) and 1akeA_50.pdb
(pdbid: 1ake). Please note that the proteins 4ake and 1ake have identical sequences but feature different conformations. The interpolation between these two conformations is generated using PyMol's morph tool.
mrcs
contains 50 mrc
files, where 1akeA_{i}.mrc
is the density corresponding to the 1akeA_{i}.pdb
. The mrc
file is generated with EMAN2's e2pdb2mrc tool.
uniform_snr0-0001_ctf
contains the particles projected from the mrc
files. We add CTF distortions and Gaussian noises to random projections.
We also offer some reconstructed results through RELION. The density rln.mrc
is reconstructed from all particles. In contrast, rln_reconstruct/rln{i}.mrc
is uniquely reconstructed from particles generated from the i-th mrc
.
Here are some visualizations of the dataset.
You need to modify the dataset_dir
in your atom_configs/1ake.py
according to the file path where you have extracted your data. This step takes 65 minutes on a 4-card V100.
In this case, the result is saved to work_dirs/atom_1ake_0
. We will particularly focus on the 0123_0024000
folder. This folder holds the results obtained from the 123rd epoch and the 24,000th step of the training procedure.
Open the pca-1.pdb
file which contains 10 structures, sampled along the first PCA dimension of the latent space. We utilize ChimeraX for animation. Simply open the file and enter the command mseries slider all
.
Another key file is z.npy
, containing the latent codes for each particle. This deviates from traditional 3D classification which allocates a discrete label (e.g., class-1, class-2, class-3) to every particle. In contrast, cryoSTAR assigns a continuous label to each particle, taking the form of a vector (e.g., [0.1, 0.3, 0.4]). The distance amongst different latent codes serves to measure the similarity of the underlying conformation of each particle.
z.npy
is a 2-D matrix, and its shape is (num_particles x latent_dimension). In the below image, z.npy
is a matrix whose shape is 25x3 since there are 25 particles and the latent space is set to 3.
You need to modify the dataset_dir
in your density_configs/1ake.py
and change the following xxx/z.npy
to the path of the latest output z.npy
file path from step 1. This step takes about 10 minutes on a 4-card V100.
In this instance, the result is saved to work_dirs/density_1ake_0
. We are particularly interested in the 0019_0015640
folder. As indicated earlier, it's quite easy to deduce that this naming convention represents the 19th epoch and 15,640th step of the training process.
Let's take a look at the vol_pca_1_*.mrc
files! These are 10 volumes produced by cryoSTAR's density generator, using z.npy
as an additional input. Let's visualize them with Chimera again.
Wait, but if I do not have a reference pdb file?
So easy! cryoSTAR can circumvent this case! Just run the train_density
code without specifying the z.npy
!
Note that this looks similar to cryoDRGN but with some differences. For instance, cryoDRGN implements certain pre-processing measures, including pre-shifting, CTF phase-flipping, the pre-computation of Fourier Transforms, and others. CryoSTAR eliminates the need for many such preprocessing steps while maintaining both the quality and speed.