|
1 | | - |
2 | 1 | # **DeepCoil** # |
3 | | -Accurate prediction of coiled coil domains in protein sequences. |
4 | | - |
5 | | -## **Installation** ## |
6 | | -First clone this repository: |
7 | | -```bash |
8 | | -$ git clone https://github.com/labstructbioinf/DeepCoil.git |
9 | | -``` |
10 | | -Required packages to run DeepCoil are listed in the **`requirements.txt`** file. |
11 | | -We suggest running DeepCoil in the virtual environment: |
12 | | -If you don't have virtualenv installed do so: |
13 | | -```bash |
14 | | -$ pip3 install virtualenv |
15 | | -``` |
16 | | -Create virtual environment and install required packages: |
17 | | -```bash |
18 | | -$ cd virtual_envs_location |
19 | | -$ virtualenv deepcoil_env |
20 | | -$ source deepcoil_env/bin/activate |
21 | | -$ cd DEEPCOIL_LOCATION |
22 | | -$ pip3 install -r requirements.txt |
23 | | -``` |
24 | | -Test the installation: |
| 2 | +[](https://doi.org/10.1093/bioinformatics/bty1062 ) |
| 3 | + |
| 4 | + |
| 5 | +## **Fast and accurate prediction of coiled coil domains in protein sequences** |
| 6 | +### **New in version 2.0** ### |
| 7 | +- Faster inference time by applying *[SeqVec](https://github.com/rostlab/SeqVec)* embeddings instead of *psiblast* profiles. |
| 8 | +- Additional heptad predictions (*a* and *d* core positions). |
| 9 | +- No maximum sequence length limit. |
| 10 | +- Convenient interface for using *DeepCoil* within python scripts. |
| 11 | +- Automated peak detection for improved output readability. |
| 12 | +- Simplified installation with *pip*. |
| 13 | + |
| 14 | +Older DeepCoil versions are available [here](https://github.com/labstructbioinf/DeepCoil/releases). |
| 15 | + |
| 16 | +### **Requirements and installation** ### |
| 17 | +DeepCoil requires `python>=3.6.1` and `pip>=19.0`. Other requirements are specified in the `requirements.txt` file. |
| 18 | + |
| 19 | +The most convenient way to install **DeepCoil** is to use pip: |
25 | 20 | ```bash |
26 | | -$ ./run_example.sh |
| 21 | +$ pip3 install deepcoil |
27 | 22 | ``` |
28 | | -This should produce output **`example/out_pssm/GCN4_YEAST.out`** identical to **`example/out_pssm/GCN4_YEAST.out.bk`** and accordingly for the **`example/out_seq/`** directory. |
29 | | - |
30 | | -## **Usage** ## |
| 23 | + |
| 24 | +### **Usage** ### |
| 25 | + |
| 26 | +#### Running DeepCoil standalone version: |
| 27 | + |
31 | 28 | ```bash |
32 | | -python3.5 deepcoil.py [-h] -i FILE [-out_path DIR] [-pssm] [-pssm_path DIR] |
| 29 | +deepcoil [-h] -i FILE [-out_path DIR] [-n_cpu NCPU] [--gpu] [--plot] |
| 30 | + [--dpi DPI] |
33 | 31 | ``` |
34 | | -| Option | Description | |
| 32 | +| Argument | Description | |
35 | 33 | |:-------------:|-------------| |
36 | 34 | | **`-i`** | Input file in FASTA format. Can contain multiple entries. | |
37 | | -| **`-pssm`** | Flag for the PSSM-mode. If enabled DeepCoil will require psiblast PSSM files in the pssm_path. Otherwise only sequence information will be used.| |
38 | | -| **`-pssm_path`** | Directory with psiblast PSSM files. For each entry in the input fasta file there must be a PSSM file. | |
39 | | -| **`-out_path`** | Directory where the predictions are saved. For each entry one file will be saved. | |
40 | | -| **`-out_type`** | Output type. Either **'ascii'** (default), which will write single file for each entry in input or **'h5'** which will generate single hdf5 file storing all predictions. | |
41 | | -| **`-out_filename`** | Works with **"-out_type h5"** option and specifies the hdf5 output filename Overrides the **-out_path** if specified. | |
42 | | -| **`-min_residue_score`** | Number in the range <0,1>. DeepCoil will return sequences that have at least one residue with score greater than min_residue_score | |
43 | | -| **`-min_segment_length`** | Number greater than 0. DeepCoil will return sequences that contain a segment of length **-min_segment_length** or more. To be used with **-min_residue_score** | |
| 35 | +| **`-out_path`** | Directory where the predictions are saved. For each entry in the input file one file will be saved. Defaults to the current directory if not specified.| |
| 36 | +| **`-n_cpu`** | Number of CPUs to use in the prediction. By the default all cores will be used.| |
| 37 | +| **`--gpu`** | Flag for turning on the GPU usage. Allows faster inference on large datasets. Overrides **`-n_cpu`** option.| |
| 38 | +| **`--plot`** | Turns on the additional visual output of the predictions for each entry in the input. Plot files are saved in the **`-out_path`** directory.| |
| 39 | +| **`--dpi`** | DPI of the saved plots, active only with **`--plot`** option.| |
44 | 40 |
|
45 | | -Results of **`-min_residue_score`** and **`-min_segment_length`** filters are stored in directories located in **`-out_path`**. |
| 41 | +In a rare case of `deepcoil` being not available in your `PATH` after installation please look in the `$HOME/.local/bin/` or other system specific `pip` directory. |
46 | 42 |
|
47 | | -PSSM filenames should be based on the identifiers in the fasta file (only alphanumeric characters and '_'). For example if a fasta sequence is as follows: |
48 | | -``` |
49 | | ->GCN4_YEAST RecName: Full=General control protein GCN4; AltName: Full=Amino acid biosynthesis regulatory protein |
50 | | -MSEYQPSLFALNPMGFSPLD.... |
51 | | -``` |
52 | | -PSSM file should be named **`GCN4_YEAST.pssm`**. |
53 | | - |
54 | | -You can generate PSSM files with the following command (requires NR90 database): |
55 | | -```bash |
56 | | -psiblast -query GCN4_YEAST.fasta -db NR90_LOCATION -evalue 0.001 -num_iterations 3 -out_ascii_pssm GCN4_YEAST.pssm |
57 | | -``` |
58 | | -In order to generate PSSM file from multiple sequence alignment (MSA) you can use this command: |
59 | | -```bash |
60 | | -psiblast -subject sequence.fasta -in_msa alignment.fasta -out_ascii_pssm output.pssm |
| 43 | +#### Running DeepCoil within script: |
| 44 | + |
| 45 | +```python |
| 46 | +from deepcoil import DeepCoil |
| 47 | +from deepcoil.utils import plot_preds |
| 48 | +from Bio import SeqIO |
| 49 | + |
| 50 | +dc = DeepCoil(use_gpu=True) |
| 51 | + |
| 52 | +inp = {str(entry.id): str(entry.seq) for entry in SeqIO.parse('example/example.fas', 'fasta')} |
| 53 | + |
| 54 | +results = dc.predict(inp) |
| 55 | + |
| 56 | +plot_preds(results['3WPA_1'], out_file='example/example.png') |
61 | 57 | ``` |
| 58 | +`results[entry]` for an entry of sequence length `N` contains two keys: |
| 59 | +- `['cc']` - per residue coiled coil propensity (`[N, 1]` shape) |
| 60 | +- `['hept']` - per residue core positions (`[N, 3]` shape, order in the second axis is: no/other position, *a* position, *d* position) |
| 61 | + |
| 62 | +Peak detection can be performed with the `deepcoil.utils.sharpen_preds` helper function. |
| 63 | +#### Example graphical output: |
| 64 | + |
0 commit comments