A Minimal Case

What does the input/output look like?

We will delve into step-by-step instructions illustrating how to effectively utilize cryoSTAR with a synthetic dataset. Our aim is to equip you with the skills you need to make the most of this powerful tool. Let's jump right in! 😊📖

Data Preparation

The image below illustrates the process of generating the synthetic dataset. More details can be found in our paper.

Download the data from Google Drive (a link for download via wget command will be provided soon):

tutorial_data_1ake.zipGoogle Docs

Extract the zip file at the path: cryostar/projects/star, the directory looks like:

The directory contains three sub-directories:

pdbs contains 50 pdb files, where 1akeA_{i}.pdb is the interpolation between 1akeA_1.pdb (pdbid: 4ake) and 1akeA_50.pdb (pdbid: 1ake). Please note that the proteins 4ake and 1ake have identical sequences but feature different conformations. The interpolation between these two conformations is generated using PyMol's morph tool.
mrcs contains 50 mrc files, where 1akeA_{i}.mrc is the density corresponding to the 1akeA_{i}.pdb. The mrc file is generated with EMAN2's e2pdb2mrc tool.
uniform_snr0-0001_ctf contains the particles projected from the mrc files. We add CTF distortions and Gaussian noises to random projections.
- We also offer some reconstructed results through RELION. The density rln.mrc is reconstructed from all particles. In contrast, rln_reconstruct/rln{i}.mrc is uniquely reconstructed from particles generated from the i-th mrc.

Here are some visualizations of the dataset.

Heterogenous Reconstruction

Step 1: Reconstruct Atomic Structures

You need to modify the dataset_dir in your atom_configs/1ake.py according to the file path where you have extracted your data. This step takes 65 minutes on a 4-card V100.

$ python train_atom.py atom_configs/1ake.py

Overview of the Outputs

In this case, the result is saved to work_dirs/atom_1ake_0. We will particularly focus on the 0123_0024000 folder. This folder holds the results obtained from the 123rd epoch and the 24,000th step of the training procedure.

Key Output 1: A Stack of Atomic Structures (pca-*.pdb)

Open the pca-1.pdb file which contains 10 structures, sampled along the first PCA dimension of the latent space. We utilize ChimeraX for animation. Simply open the file and enter the command mseries slider all.

Key Output 2: Latent Codes (z.npy)

Another key file is z.npy, containing the latent codes for each particle. This deviates from traditional 3D classification which allocates a discrete label (e.g., class-1, class-2, class-3) to every particle. In contrast, cryoSTAR assigns a continuous label to each particle, taking the form of a vector (e.g., [0.1, 0.3, 0.4]). The distance amongst different latent codes serves to measure the similarity of the underlying conformation of each particle.

z.npy is a 2-D matrix, and its shape is (num_particles x latent_dimension). In the below image, z.npy is a matrix whose shape is 25x3 since there are 25 particles and the latent space is set to 3.

Step 2: Reconstruct Densities

You need to modify the dataset_dir in your density_configs/1ake.py and change the following xxx/z.npy to the path of the latest output z.npy file path from step 1. This step takes about 10 minutes on a 4-card V100.

$ python train_density.py density_configs/1ake.py --cfg-options extra_input_data_attr.given_z=xxx/z.npy

Overview of the Outputs

In this instance, the result is saved to work_dirs/density_1ake_0. We are particularly interested in the 0019_0015640 folder. As indicated earlier, it's quite easy to deduce that this naming convention represents the 19th epoch and 15,640th step of the training process.

Key Output: Densities (*.mrc)

Let's take a look at the vol_pca_1_*.mrc files! These are 10 volumes produced by cryoSTAR's density generator, using z.npy as an additional input. Let's visualize them with Chimera again.

🤔 Without a PDB File?

Wait, but if I do not have a reference pdb file?

So easy! cryoSTAR can circumvent this case! Just run the train_density code without specifying the z.npy!

$ python train_density.py density_configs/1ake.py

Note that this looks similar to cryoDRGN but with some differences. For instance, cryoDRGN implements certain pre-processing measures, including pre-shifting, CTF phase-flipping, the pre-computation of Fourier Transforms, and others. CryoSTAR eliminates the need for many such preprocessing steps while maintaining both the quality and speed.

Last updated 1 year ago