> For the complete documentation index, see [llms.txt](https://byte-research.gitbook.io/cryostar/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://byte-research.gitbook.io/cryostar/a-real-case-empiar-10180.md).

# A Real Case: EMPIAR-10180

## Data preparation 📚

### Dataset acquisition

* Download EMPIAR-10180 (\~126.5GB) from <https://www.ebi.ac.uk/empiar/EMPIAR-10180/>
  * Direct browser downloads are possible, yet we suggest utilizing Aspera for said downloads.
  * [Aspera Connect](https://www.ibm.com/aspera/connect/) is required for downloading large datasets from [EMPIAR](https://www.ebi.ac.uk/empiar/). The necessary file `asperaweb_id_dsa.openssh` for downloading from EMPIAR is only provided by Aspera's version less than 4.2. Hence, a latest version 4.1.3 of Aspera Connect is recommended.
    * If you don't have it installed, please refer to the discussion [here](https://www.biostars.org/p/9528910/) and follow the instructions to download and install it.
  * Run the following command to download the data, you can see more information on official [website](https://www.ebi.ac.uk/empiar/faq#question_CLDownload).

    ```shell
    $ ascp -QT -l 200M -P33001 -k 1 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh emp_ext2@fasp.ebi.ac.uk:/10180 .
    ```
* This dataset contains 327,490 extracted particles with box size of 320 and pixel size of 1.7 A/pix. Below is the file structure of this dataset:

  ```
  data/
  ├── Example/
  |   ├── 4-bodies-tight-mask.star
  |   ├── consensus_data.star
  |   ├── consensus_half1_class001.mrc
  |   ├── consensus_half1_model.star
  |   ├── consensus_half2_class001.mrc
  |   ├── consensus_half2_model.star
  |   ├── consensus_optimiser.star
  |   ├── consensus_sampling.star
  |   └── multibody.sh
  ├── Mask-and-Ref/
  ├── Micrographs_20160622/
  ├── Micrographs_20160710/
  ├── Micrographs_20160716/
  ├── Micrographs_20160813/
  ├── Micrographs_20160820/
  ├── Micrographs_20160911/
  ├── Micrographs_thick/
  └── Micrographs_thin/
  ```
* The **particle images** are saved in `Micrographs_xxxxxxxx/` directories, **pose** and **CTF** parameters are saved in `data/Example/consensus_data.star`.

### Reference structure preparation

**Option 1: Using off-the-shelf file:**

* In this step, we provide a pre-processed PDB file ready for direct download. Use the following command to download it:

  ```
  (cryostar) $ wget https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/cryostar/datasets/tutorial_10180_centered_5nrl.pdb
  ```

**Option 2: Prepare from scratch:**

* Given that the rotation of the pose under cryo-EM takes the center of the reconstructed three-dimensional density as the pivot point, we need to relocate the coordinate origin of the PDB file to coincide with this center.
* Download PDB structure from RCSB, `(cryostar) $ wget https://files.rcsb.org/download/5NRL.cif`
* You can use `data/Example/consensus_half1_class001.mrc` as the consensus map, use [ChimeraX](https://www.cgl.ucsf.edu/chimerax/) to dock the downloaded PDB into the map and save it as `5nrl.pdb`:

<figure><img src="/files/23pXFA5To6467S0IUuAd" alt="" width="563"><figcaption><p>How to dock a PDB file into a cryo-EM map, then save it?</p></figcaption></figure>

* Use our command line tool `cstar_center_origin` to alter the origin of their coordinates, the output `5nrl_centered.pdb` is what we need:

  ```
  (cryostar) $ cstar_center_origin 5nrl.pdb consensus_half1_class001.mrc
  ```

### Optional: determine low-pass filter bandwidth

* This step explains how the `low_pass_bandwidth` value under `data_process` in the `atom_configs/10180.py` file is determined. You can skip this step if you want.
* Within the atomic model-dependent training procedure, cryoSTAR makes use of low-frequency filtering to bridge the gap between Gaussian density and the consensus map density. The cutoff frequency of this low-frequency filter takes reference from the FSC curve between the Gaussian density and consensus map. We choose 23.4A (FSC=0.5) for 10180.

<figure><img src="/files/CiUAMuJkxD1dkyh3N6cQ" alt="" width="563"><figcaption><p>EMPIAR-10180 FSC between consensus map and Gaussian density</p></figcaption></figure>

* We provide a command-line tool called `cstar_generate_gaussian_density` that can generate a coarse-grained gaussian density based on PDB files. The usage of this function is `cstar_generate_gaussian_density <pdb_file_path.pdb> <shape> <apix>`. The input parameters are respectively the PDB file path, the shape of the generated density, and the pixel size in Angstrom. The result will be saved in the current directory, named with the input file name plus `_gaussian.mrc`. For this tutorial, use `cstar_generate_gaussian_density 5nrl.pdb 320 1.7` and you will get an output file called `5nrl_gaussian.mrc`.
* You can then use [EMAN e2proc3d.py](https://blake.bcm.edu/emanwiki/EMAN2/Programs/e2proc3d) tool to calculate the FSC curve between the generated Gaussian density and the consensus map.

## CryoSTAR training 🚀

CryoSTAR configures information about the dataset and training hyper-parameters through the `configs/xxxxx.py` file. We structure our code into two sections: a universal Python package `cryostar` and a directory for specific methods `projects/star`. The training is GPU-dependent, ensure the machine is equipped with a GPU and `torch` can utilize the GPU.

### Training the atom generator

* The atomic model based training script, `train_atom.py`, can be found in the `projects/star` directory. The configuration files for varying cases reside in the `projects/star/atom_configs` folder, and the configuration file for this tutorial is `projects/star/atom_configs/10180.py`.
* In the `10180.py` file, we use the `dataset_dir` key in the `dataset_attr` dictionary to specify the location of the data downloaded from EMPIAR, and use `ref_pdb_path` key to point to the path of the reference pdb file obtained from previous steps. Change these values according to your only file locations.
* After that, run the script to start training:

  ```shell
  $ python train_atom.py atom_configs/10180.py
  ```
* Logs print on screen should be like:

  ```
  ...
  2023/10/12 12:31:10 - cryostar - INFO - epoch 0 [0/5117] | loss: -0.36890 | cryoem(gmm): -0.37057 | con: 0.00072 | sse: 0.00000 | dist: 0.00058 | clash: 0.00037 | kld: 0.00000 | kld(/dim): 3.00000
  2023/10/12 12:31:17 - cryostar - INFO - epoch 0 [50/5117] | loss: -0.41110 | cryoem(gmm): -0.41165 | con: 0.00008 | sse: 0.00000 | dist: 0.00047 | clash: 0.00000 | kld: 0.00000 | kld(/dim): 3.00000
  ...
  ```

  and end with:

  ```
  `Trainer.fit` stopped: `max_steps=96000` reached.
  ```
* After the training is finished, the outputs will be stored in the `work_dirs/atom_xxxxx` directory, and we perform evaluations every 12,000 steps. Within this directory, you'll observe sub-directories with the name `epoch-number_step-number`. We choose the most recent directory as the final results.

  ```
  atom_xxxxx/
  ├── 0000_0000000/
  ├── ...
  ├── 0112_0096000/        # evaluation results
  │  ├── ckpt.pt           # model parameters
  │  ├── input_image.png   # visualization of input cryo-EM images
  │  ├── pca-1.pdb         # sampled coarse-grained atomic structures along 1st PCA axis
  │  ├── pca-2.pdb
  │  ├── pca-3.pdb
  │  ├── pred.pdb          # sampled structures at Kmeans cluster centers
  │  ├── pred_gmm_image.png
  │  └── z.npy             # the latent code of each particle
  |                        # a matrix whose shape is num_of_particle x 8
  ├── yyyymmdd_hhmmss.log  # running logs
  ├── config.py            # a backup of the config file
  └── train_atom.py        # a backup of the training script
  ```
* We recommend you to use [ChimeraX](https://www.cgl.ucsf.edu/chimerax/) for visualizing sampled pdb structures. Some helpful commands are:

  ```
  # show as ribbon diagram
  hide atoms
  show cartoons

  # show as animation
  mseries all

  # shows a graphical interface in which the slider can be dragged
  mseries slider all
  ```

### Training the density generator

* The density based training script, `train_density.py`, can be found in the `projects/star` directory. The configuration files for varying cases reside in the `projects/star/density_configs` folder, and the configuration file for this tutorial is `projects/star/density_configs/10180.py`.
* In this `10180.py` file, we use the `dataset_dir` key in the `dataset_attr` dictionary to specify the location of the data downloaded from EMPIAR, and use `given_z` key in `extra_input_data_attr` dictionary to point to the path of the previous step's output `z.npy`. Change these values according to your only file locations. As an alternative, you can change their values by command line arguments.
* Run the following command to start training:

  ```shell
  $ python train_density.py density_configs/10180.py
  ```

  If you want to change configurations in command line, an example is:

  ```shell
  $ python train_density.py density_configs/10180.py --cfg-options extra_input_data_attr.given_z=work_dirs/atom_10180/0018_0096000/z.npy
  ```

  Our config use `mmengine`, see details on their official [documentation](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#modify-the-fields-in-command-line).
* Logs print on screen should be like:

  ```
  2023/10/12 22:00:53 - cryostar - INFO - Config:
  ...
  2023/10/12 22:04:07 - cryostar - INFO - epoch 0 [0/6823] | em: 0.08792 | kld: 0.00000
  2023/10/12 22:04:53 - cryostar - INFO - epoch 0 [100/6823] | em: 0.04898 | kld: 0.00000
  ...
  ```

  and end with:

  ```
  `Trainer.fit` stopped: `max_epochs=5` reached.
  ```
* After the training is finished, results will be saved to `work_dirs/density_xxxxx`, and each subdirectory has the name `epoch-number_step-number`. We choose the most recent directory as the final results.

  ```
  density_xxxxx/
  ├── 0004_0014470/          # evaluation results
  │  ├── ckpt.pt             # model parameters
  │  ├── vol_pca_1_000.mrc   # density sampled along the PCA axis, named by vol_pca_pca-axis_serial-number.mrc
  │  ├── ...
  │  ├── vol_pca_3_009.mrc
  │  ├── z.npy
  │  ├── z_pca_1.txt         # sampled z values along the 1st PCA axis
  │  ├── z_pca_2.txt
  │  └── z_pca_3.txt
  ├── yyyymmdd_hhmmss.log    # running logs
  ├── config.py              # a backup of the config file
  └── train_density.py       # a backup of the training script
  ```
* Some tips for visualization in ChimeraX:

  ```
  # set to same threshold, replace xxxx with desired iso-surface level
  vol all level xxxx
  vol all color cornflowerblue

  # show as animation
  mseries all
  # or
  mseries slider all
  ```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://byte-research.gitbook.io/cryostar/a-real-case-empiar-10180.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
