Skip to content

Commit 58a3119

Browse files
authored
Merge pull request #9 from labstructbioinf/develop
DeepCoil2
2 parents 5385315 + 989bf6d commit 58a3119

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+873
-3220
lines changed

.github/workflows/main.yml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
name: deepcoil
2+
on: [push]
3+
jobs:
4+
build:
5+
runs-on: ubuntu-latest
6+
strategy:
7+
matrix:
8+
os: [ubuntu-latest, macos-latest]
9+
python-version: [3.6, 3.7, 3.8]
10+
steps:
11+
- uses: actions/checkout@v2
12+
- name: Set up Python
13+
uses: actions/setup-python@v2
14+
with:
15+
python-version: ${{ matrix.python-version }}
16+
- name: Install dependencies
17+
run: |
18+
python -m pip install --upgrade pip setuptools wheel twine pytest pytest-cov
19+
pip install -r requirements.txt
20+
- name: Set environment
21+
run: |
22+
echo "::set-env name=PYTHONPATH::/home/runner/work/DeepCoil/DeepCoil"
23+
- name: Test with pytest
24+
run: |
25+
pytest -v

.travis.yml

Lines changed: 0 additions & 9 deletions
This file was deleted.

MANIFEST.in

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
include deepcoil/weights/*.h5
2+
include deepcoil/models/seq.json

README.md

Lines changed: 54 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,61 +1,64 @@
1-
![Build Status](https://travis-ci.org/labstructbioinf/DeepCoil.svg?branch=master)
21
# **DeepCoil** #
3-
Accurate prediction of coiled coil domains in protein sequences.
4-
5-
## **Installation** ##
6-
First clone this repository:
7-
```bash
8-
$ git clone https://github.com/labstructbioinf/DeepCoil.git
9-
```
10-
Required packages to run DeepCoil are listed in the **`requirements.txt`** file.
11-
We suggest running DeepCoil in the virtual environment:
12-
If you don't have virtualenv installed do so:
13-
```bash
14-
$ pip3 install virtualenv
15-
```
16-
Create virtual environment and install required packages:
17-
```bash
18-
$ cd virtual_envs_location
19-
$ virtualenv deepcoil_env
20-
$ source deepcoil_env/bin/activate
21-
$ cd DEEPCOIL_LOCATION
22-
$ pip3 install -r requirements.txt
23-
```
24-
Test the installation:
2+
[![DOI:10.1093/bioinformatics/bty1062](https://zenodo.org/badge/DOI/10.1093/bioinformatics/bty1062.svg)](https://doi.org/10.1093/bioinformatics/bty1062 )
3+
![build](https://github.com/labstructbioinf/DeepCoil/workflows/deepcoil/badge.svg)
4+
5+
## **Fast and accurate prediction of coiled coil domains in protein sequences**
6+
### **New in version 2.0** ###
7+
- Faster inference time by applying *[SeqVec](https://github.com/rostlab/SeqVec)* embeddings instead of *psiblast* profiles.
8+
- Additional heptad predictions (*a* and *d* core positions).
9+
- No maximum sequence length limit.
10+
- Convenient interface for using *DeepCoil* within python scripts.
11+
- Automated peak detection for improved output readability.
12+
- Simplified installation with *pip*.
13+
14+
Older DeepCoil versions are available [here](https://github.com/labstructbioinf/DeepCoil/releases).
15+
16+
### **Requirements and installation** ###
17+
DeepCoil requires `python>=3.6.1` and `pip>=19.0`. Other requirements are specified in the `requirements.txt` file.
18+
19+
The most convenient way to install **DeepCoil** is to use pip:
2520
```bash
26-
$ ./run_example.sh
21+
$ pip3 install deepcoil
2722
```
28-
This should produce output **`example/out_pssm/GCN4_YEAST.out`** identical to **`example/out_pssm/GCN4_YEAST.out.bk`** and accordingly for the **`example/out_seq/`** directory.
29-
30-
## **Usage** ##
23+
24+
### **Usage** ###
25+
26+
#### Running DeepCoil standalone version:
27+
3128
```bash
32-
python3.5 deepcoil.py [-h] -i FILE [-out_path DIR] [-pssm] [-pssm_path DIR]
29+
deepcoil [-h] -i FILE [-out_path DIR] [-n_cpu NCPU] [--gpu] [--plot]
30+
[--dpi DPI]
3331
```
34-
| Option | Description |
32+
| Argument | Description |
3533
|:-------------:|-------------|
3634
| **`-i`** | Input file in FASTA format. Can contain multiple entries. |
37-
| **`-pssm`** | Flag for the PSSM-mode. If enabled DeepCoil will require psiblast PSSM files in the pssm_path. Otherwise only sequence information will be used.|
38-
| **`-pssm_path`** | Directory with psiblast PSSM files. For each entry in the input fasta file there must be a PSSM file. |
39-
| **`-out_path`** | Directory where the predictions are saved. For each entry one file will be saved. |
40-
| **`-out_type`** | Output type. Either **'ascii'** (default), which will write single file for each entry in input or **'h5'** which will generate single hdf5 file storing all predictions. |
41-
| **`-out_filename`** | Works with **"-out_type h5"** option and specifies the hdf5 output filename Overrides the **-out_path** if specified. |
42-
| **`-min_residue_score`** | Number in the range <0,1>. DeepCoil will return sequences that have at least one residue with score greater than min_residue_score |
43-
| **`-min_segment_length`** | Number greater than 0. DeepCoil will return sequences that contain a segment of length **-min_segment_length** or more. To be used with **-min_residue_score** |
35+
| **`-out_path`** | Directory where the predictions are saved. For each entry in the input file one file will be saved. Defaults to the current directory if not specified.|
36+
| **`-n_cpu`** | Number of CPUs to use in the prediction. By the default all cores will be used.|
37+
| **`--gpu`** | Flag for turning on the GPU usage. Allows faster inference on large datasets. Overrides **`-n_cpu`** option.|
38+
| **`--plot`** | Turns on the additional visual output of the predictions for each entry in the input. Plot files are saved in the **`-out_path`** directory.|
39+
| **`--dpi`** | DPI of the saved plots, active only with **`--plot`** option.|
4440

45-
Results of **`-min_residue_score`** and **`-min_segment_length`** filters are stored in directories located in **`-out_path`**.
41+
In a rare case of `deepcoil` being not available in your `PATH` after installation please look in the `$HOME/.local/bin/` or other system specific `pip` directory.
4642

47-
PSSM filenames should be based on the identifiers in the fasta file (only alphanumeric characters and '_'). For example if a fasta sequence is as follows:
48-
```
49-
>GCN4_YEAST RecName: Full=General control protein GCN4; AltName: Full=Amino acid biosynthesis regulatory protein
50-
MSEYQPSLFALNPMGFSPLD....
51-
```
52-
PSSM file should be named **`GCN4_YEAST.pssm`**.
53-
54-
You can generate PSSM files with the following command (requires NR90 database):
55-
```bash
56-
psiblast -query GCN4_YEAST.fasta -db NR90_LOCATION -evalue 0.001 -num_iterations 3 -out_ascii_pssm GCN4_YEAST.pssm
57-
```
58-
In order to generate PSSM file from multiple sequence alignment (MSA) you can use this command:
59-
```bash
60-
psiblast -subject sequence.fasta -in_msa alignment.fasta -out_ascii_pssm output.pssm
43+
#### Running DeepCoil within script:
44+
45+
```python
46+
from deepcoil import DeepCoil
47+
from deepcoil.utils import plot_preds
48+
from Bio import SeqIO
49+
50+
dc = DeepCoil(use_gpu=True)
51+
52+
inp = {str(entry.id): str(entry.seq) for entry in SeqIO.parse('example/example.fas', 'fasta')}
53+
54+
results = dc.predict(inp)
55+
56+
plot_preds(results['3WPA_1'], out_file='example/example.png')
6157
```
58+
`results[entry]` for an entry of sequence length `N` contains two keys:
59+
- `['cc']` - per residue coiled coil propensity (`[N, 1]` shape)
60+
- `['hept']` - per residue core positions (`[N, 3]` shape, order in the second axis is: no/other position, *a* position, *d* position)
61+
62+
Peak detection can be performed with the `deepcoil.utils.sharpen_preds` helper function.
63+
#### Example graphical output:
64+
![Example](example/example.png)

bin/deepcoil

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
#!/usr/bin/env python
2+
3+
import os
4+
import argparse
5+
from Bio import SeqIO
6+
from deepcoil import DeepCoil
7+
from deepcoil.utils import is_fasta, sharpen_preds, plot_preds
8+
9+
parser = argparse.ArgumentParser(description='DeepCoil')
10+
parser.add_argument('-i',
11+
help='Input file with sequence in fasta format.',
12+
required=True,
13+
metavar='FILE')
14+
parser.add_argument('-out_path',
15+
help='Output directory',
16+
default='.',
17+
metavar='DIR')
18+
parser.add_argument('-n_cpu',
19+
help='Number of CPUs to use in the prediction',
20+
default=-1,
21+
type=int,
22+
metavar='NCPU')
23+
parser.add_argument('--gpu',
24+
help='Use GPU. This option overrides -n_cpu option',
25+
action='store_true')
26+
parser.add_argument('--plot',
27+
help='Plot predictions. Images will be stored in the path defined by the -out_path',
28+
action='store_true')
29+
parser.add_argument('--dpi',
30+
help='DPI of the produced images',
31+
default=300,
32+
type=int,
33+
metavar='DPI')
34+
args = parser.parse_args()
35+
36+
# Check if input file exists
37+
if not os.path.isfile(args.i):
38+
print('ERROR: Input file does not exist!')
39+
exit()
40+
# Check if input is valid fasta file
41+
if not is_fasta(args.i):
42+
print("ERROR: Malformed fasta file. Please check input!")
43+
exit()
44+
# Check if output dir exists
45+
if not os.path.isdir(args.out_path):
46+
print("ERROR: Output directory does not exist!")
47+
exit()
48+
49+
# Verify fasta file
50+
raw_data = list(SeqIO.parse(args.i, "fasta"))
51+
data = {''.join(e for e in str(entry.id) if (e.isalnum() or e == '_')): str(entry.seq) for entry in raw_data}
52+
if not len(data) == len(raw_data):
53+
print("ERROR: Sequence identifiers in the fasta file are not unique!")
54+
exit()
55+
56+
print("Loading DeepCoil model...")
57+
dc = DeepCoil(use_gpu=args.gpu, n_cpu=args.n_cpu)
58+
59+
print('Predicting...')
60+
preds = dc.predict(data)
61+
62+
print('Writing output...')
63+
64+
inp_keys = set(data.keys())
65+
out_keys = set(preds.keys())
66+
67+
if len(out_keys) < len(inp_keys):
68+
print('WARNING: Predictions for some sequences were not calculated due to length limitations and/or other errors.' \
69+
' Inspect the warnings and results carefully!')
70+
71+
for entry in out_keys:
72+
f = open(f'{args.out_path}/{entry}.out', 'w')
73+
cc_pred_raw = preds[entry]['cc']
74+
cc_pred = sharpen_preds(cc_pred_raw)
75+
hept_pred = preds[entry]['hept']
76+
f.write('aa\tcc\traw_cc\tprob_a\tprob_d\n')
77+
for aa, cc_prob, cc_prob_raw, a_prob, d_prob in zip(data[entry], cc_pred, cc_pred_raw, hept_pred[:, 1], hept_pred[:, 2]):
78+
f.write('{0}\t{1:.3f}\t{2:.3f}\t{3:.3f}\t{4:.3f}\n'.format(aa, float(cc_prob), float(cc_prob_raw), float(a_prob), float(d_prob)))
79+
f.close()
80+
if args.plot:
81+
for entry in out_keys:
82+
plot_preds(preds[entry], out_file=f'{args.out_path}/{entry}.png', dpi=args.dpi)
83+
print("Done!")

0 commit comments

Comments
 (0)