> For the complete documentation index, see [llms.txt](https://byte-research.gitbook.io/cryostar/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://byte-research.gitbook.io/cryostar/try-on-your-own-data.md).

# Try on Your Own Data!

## Data preparation 📚

<figure><img src="/files/E6YpWQGzeQPTkyzReuNf" alt="" width="563"><figcaption><p>Data we need to prepare for cryoSTAR</p></figcaption></figure>

What users need to prepare is a particle stack with pose information (existing in the form of a `starfile` and  `.mrcs`  files, the corresponding consensus map, and the built atomic model based on the consensus map.

### Particle stack

Our method requires properly prepared particle stacks, which carry pose and CTF information. These should exist in the form of `starfile` and `mrcs` files.

1. **Exploring Datasets from EMPIAR**
   * You could try using publicly available datasets from [EMPIAR](https://www.ebi.ac.uk/empiar). There are several datasets on EMPIAR that contain heterogeneity. Opting for preprocessed datasets that provide particle stacks, including consensus map pose, is advisable. However, if these aren't available, you'll have to process the data from micrographs by yourself.
   * If you're not accustomed to downloading datasets from EMPIAR, feel free to refer to these guidelines:
     * [Aspera Connect](https://www.ibm.com/aspera/connect/) is required for downloading large datasets from [EMPIAR](https://www.ebi.ac.uk/empiar/). The necessary file `asperaweb_id_dsa.openssh` for downloading from EMPIAR is only provided by Aspera's version less than 4.2. Hence, a latest version 4.1.3 of Aspera Connect is recommended. If you don't have it installed, please refer to the discussion [here](https://www.biostars.org/p/9528910/) and follow the instructions to download and install it.
     * An example downloading command is showing below, you can see more information on official [website](https://www.ebi.ac.uk/empiar/faq#question_CLDownload).

       <pre class="language-shell"><code class="lang-shell"><strong>$ ascp -QT -l 200M -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh emp_ext2@fasp.ebi.ac.uk:/test test_download
       </strong></code></pre>
2. **Applying CryoSTAR to Your Own Experimental Data**
   * If you are using RELION to handle your experimental data, you can easily export the job related to the consensus map.
   * If you are using cryoSPARC, we recommend exporting the results of the `homogeneous refinement` job or the `heterogeneous refinement` job.
     * Be aware that the exported data from cryoSPARC will be in the `.cs` format -- `numpy` data files. You can use the `csparc2star.py` tool provided by [pyem](https://github.com/asarnow/pyem) to convert `.cs` files into `.star` files.

Please ensure you have prepared your dataset correctly before moving towards code execution. This will facilitate smoother operations and accurate results.

### Retrieving Reference PDB <a href="#retrieving-reference-pdb" id="retrieving-reference-pdb"></a>

In order to make the most of our method, it's crucial to have a suitable reference PDB structure. Depending on the source of your data, there are various methods to obtain the structure:

1. **Using EMPIAR's Datasets:** Typically, datasets from EMPIAR have associated PDB files that can be conveniently downloaded. However, you should ensure that the PDB files offer relatively complete density modeling, especially in potentially dynamic areas. Even if the PDB structure is not entirely accurate, it's quite beneficial.
2. **On Your Own Dataset:** This pathway could necessitate the traditional model building process to create an atomic model to serve as a reference PDB structure.

Additionally, consider using structures from AlphaFold2 or some homologous sequence structures as required references.

**Note on Coordinate Origin** ⚠️

PDB structure usually determines the coordinate origin, typically corresponding to the bottom-left corner of the electron microscope density map. Applying rotation matrices directly on these coordinates is erroneous. The atomic structure's coordinate origin needs to travel to the center of the electron microscope density map.

<figure><img src="/files/aOwXkWg7xttDhBENK8v0" alt="" width="563"><figcaption><p>Illustration of coordinate origin using EMPIAR-10827</p></figcaption></figure>

Firstly, ensure that you have the consensus map constructed from the particle stack and the corresponding atomic model is docked into the map. Next, you could simply run the `cstar_center_origin` command:

<pre><code><strong>(cryostar) $ cstar_center_origin &#x3C;pdb_file_path.pdb> &#x3C;consensus_map_path.mrc>
</strong></code></pre>

The output appeared in the current directory named `<pdb_file_name>_centered.pdb` is the reference structure we need.

### Optional: Determine low-pass filter bandwidth

This is an optional step. It involves the value of `low_pass_bandwidth` in the configuration file in `atom_configs`. You can generally set it straightforwardly to 10.

Within the atomic model-dependent training procedure, cryoSTAR makes use of low-frequency filtering to bridge the gap between Gaussian density and the real density. In the practice, we determine the cut-off frequency as the resolution at FSC=0.5 by evaluating the FSC between the Gaussian density linked with the atomic model and the consensus map reconstructed given pose.

Assuming you've gotten a particle stack containing pose, the reconstruction of the consensus map can be accomplished by using the `relion_reconstruct` command supported by [RELION](https://github.com/3dem/relion). If the tasks are exported from RELION or CryoSPARC, it's straightforward to receive the consensus map straight from these tasks. Take heed to tweak the origin of these `mrc` files via earlier noted steps.

The atomic model based Gaussian density can be generated by:

```
(cryostar) $ cstar_generate_gaussian_density <pdb_file_path.pdb> <shape> <apix>
```

This command will generate a file named with `<your_input_pdb_file_name>_gaussian.mrc` in the current directory. If you don't know how to set the `shape` and `apix` arguments, check the consensus map `shape` and `apix` using `cstar_show_mrc_info <consensus_map_path.mrc>`, it will print something like:

```
The mrcfile contains volume data with:
shape (nz, ny, nx):     (480, 480, 480)
voxel_size/A (x, y, z): (1.43, 1.43, 1.43)
origin/A (x, y, z):     (0., 0., 0.)
```

For the example output above, the `shape` is `480` and `apix` is `1.43`.

You can calculate the statistics of the FSC curve by [EMAN e2proc3d.py](https://blake.bcm.edu/emanwiki/EMAN2/Programs/e2proc3d):

```python
$ e2proc3d.py consensus_map.mrc output.txt --calcfsc gaussian_density.mrc
```

Then use [`plotfsc.py`](https://github.com/cryoem/eman2/blob/master/examples/plotfsc.py) script to show the FSC curve and choose the cutoff frequency.

<figure><img src="/files/VOZy5IOPQZY25NPvHDny" alt="" width="563"><figcaption><p>EMPIAR-10073 FSC between consensus map and Gaussian density</p></figcaption></figure>

This is an optional step. It's feasible to directly use a bandwidth around 10-15A, bypassing the complicated steps of defining the cut-off frequency.

## Run CryoSTAR 🚀

The code is structured into two primary parts:

1. `cryostar`: A universal function python package. This package encompasses all the basic components that could be used in multiple projects, thus offering scalability and high reusability.
2. `projects/star`: This directory constitutes the code necessary for running cryoSTAR.

Configuration of input parameters for various data sets is made possible through `.py` files within the `projects/star/xxx_configs/` directory.

In order to run our method, you will need to navigate to the `projects/star` directory using:

```sh
$ cd projects/star
```

Alternatively, you could copy the `projects/star` directory to a location of your choice. This can be accomplished using the `cp -r` command like so:

```sh
$ cp -r projects/star <desired_directory>
```

This makes the process more flexible, as you can choose to run the method either directly from the original folder, or from a copied version in a location that suits your workflow better. Moving on to the procedure for running our method, it is framed in two core steps:

### Step 1: **Training the atom generator**

This step involves executing the `train_atom.py` script. Your primary task is to set up your own `config.py` according to your dataset and place it within the `atom_configs` directory.&#x20;

We provide you with a reference script namely `example.py` in `atom_configs/` directory, copy and modify it.

This `example.py` can serve as a starting point. You are encouraged to copy this `example.py` and modify the parameters and filename to suit your data. Remember to give your customized `config.py` a distinct name that represents your dataset, which will further help in smooth execution and debug, if needed. At this point, what you mainly need to change are the values within the `dataset_attr` dict. A dict in Python contains key-value pairs. Your task is to adjust the `values` associated with respective `keys` to suit the requirements of your data.

After that, run the script to start training:

```shell
$ python train_atom.py atom_configs/xxxxx.py
```

Logs print on screen should be like:

```
...
2023/10/12 12:31:10 - cryostar - INFO - epoch 0 [0/5117] | loss: -0.36890 | cryoem(gmm): -0.37057 | con: 0.00072 | sse: 0.00000 | dist: 0.00058 | clash: 0.00037 | kld: 0.00000 | kld(/dim): 3.00000
2023/10/12 12:31:17 - cryostar - INFO - epoch 0 [50/5117] | loss: -0.41110 | cryoem(gmm): -0.41165 | con: 0.00008 | sse: 0.00000 | dist: 0.00047 | clash: 0.00000 | kld: 0.00000 | kld(/dim): 3.00000
...
```

and end with:

```
`Trainer.fit` stopped: `max_steps=96000` reached.
```

After the training is finished, the outputs will be stored in the `work_dirs/atom_xxxxx` directory, and we perform evaluations every 12,000 steps. Within this directory, you'll observe sub-directories with the name `epoch-number_step-number`. We choose the most recent directory as the final results.

```
atom_xxxxx/
├── 0000_0000000/
├── ...
├── 0112_0096000/        # evaluation results
│  ├── ckpt.pt           # model parameters
│  ├── input_image.png   # visualization of input cryo-EM images
│  ├── pca-1.pdb         # sampled coarse-grained atomic structures along 1st PCA axis
│  ├── pca-2.pdb
│  ├── pca-3.pdb
│  ├── pred.pdb          # sampled structures at Kmeans cluster centers
│  ├── pred_gmm_image.png
│  └── z.npy             # the latent code of each particle
|                        # a matrix whose shape is num_of_particle x 8
├── yyyymmdd_hhmmss.log  # running logs
├── config.py            # a backup of the config file
└── train_atom.py        # a backup of the training script
```

We recommend you to use [ChimeraX](https://www.cgl.ucsf.edu/chimerax/) for visualizing sampled pdb structures. Some helpful commands are:

```
# show as ribbon diagram
hide atoms
show cartoons

# show as animation
mseries all

# shows a graphical interface in which the slider can be dragged
mseries slider all
```

### Step 2: **Training the density generator**

This step involves executing the `train_density.py` script. Your primary task is to set up your own `config.py` according to your dataset and place it within the `density_configs` directory.&#x20;

We provide you with a reference script namely `example.py` in `density_configs/` directory.

In this step, you need to make changes in two areas within the `example.py`: 1) Just like in Step 1, you need to modify the data-related configurations `dataset_attr` here too. However, you now have an option to use a larger `data_process.down_side_shape` to obtain a better quality of density. 2) The `given_z` in the `extra_input_data_attr` dict should be set as the path of the latest output `z.npy` file from Step 1.

Run the following command to start training:

```shell
$ python train_density.py density_configs/xxxxx.py
```

If you want to change configurations in command line, an example is:

```shell
$ python train_density.py density_configs/xxxxx.py --cfg-options extra_input_data_attr.given_z=work_dirs/atom_xxxxxx/0018_0096000/z.npy
```

Our config use `mmengine`, see details on their official [documentation](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#modify-the-fields-in-command-line).

Logs print on screen should be like:

```
2023/10/12 22:00:53 - cryostar - INFO - Config:
...
2023/10/12 22:04:07 - cryostar - INFO - epoch 0 [0/6823] | em: 0.08792 | kld: 0.00000
2023/10/12 22:04:53 - cryostar - INFO - epoch 0 [100/6823] | em: 0.04898 | kld: 0.00000
...
```

and end with:

```
`Trainer.fit` stopped: `max_epochs=5` reached.
```

After the training is finished, results will be saved to `work_dirs/density_xxxxx`, and each subdirectory has the name `epoch-number_step-number`. We choose the most recent directory as the final results.

```
density_xxxxx/
├── 0004_0014470/          # evaluation results
│  ├── ckpt.pt             # model parameters
│  ├── vol_pca_1_000.mrc   # density sampled along the PCA axis, named by vol_pca_pca-axis_serial-number.mrc
│  ├── ...
│  ├── vol_pca_3_009.mrc
│  ├── z.npy
│  ├── z_pca_1.txt         # sampled z values along the 1st PCA axis
│  ├── z_pca_2.txt
│  └── z_pca_3.txt
├── yyyymmdd_hhmmss.log    # running logs
├── config.py              # a backup of the config file
└── train_density.py       # a backup of the training script
```

Some tips for visualization in ChimeraX:

```
# set to same threshold, replace xxxx with desired iso-surface level
vol all level xxxx
vol all color cornflowerblue

# show as animation
mseries all
# or
mseries slider all
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://byte-research.gitbook.io/cryostar/try-on-your-own-data.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
