Skip to content

Commit e9c278d

Browse files
author
u0028003
committed
Updating the README
1 parent d1c0539 commit e9c278d

File tree

1 file changed

+25
-24
lines changed

1 file changed

+25
-24
lines changed

README.md

Lines changed: 25 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,37 @@
11
# GQuery
2-
GQuery is a software tool for rapidly querying large numbers of bgzip compressed, tabix indexed genomic data files e.g. vcf, gvcf, maf, bed, bedGraph, etc. from multiple species with different genome builds without the need to develop, debug, and maintain custom file parsers for every file format and flavor. Just point the GQuery indexer at a collection of tabix indexed files and then run either the GQuery command line app or the web API to search them. GQuery is built using a fast, multi-threaded genomic range search engine with extensive junit testing. Lastly, it is free to use.
2+
GQuery is a software tool for rapidly querying large numbers of bgzip compressed, tabix indexed genomic data files e.g. vcf, maf, bed, bedGraph, etc. from multiple species with different genome builds without the need to develop, debug, and maintain custom file parsers for every file format and flavor. Just point the GQuery indexer at a collection of tabix indexed files and then run either the GQuery command line app or the web API to search them. GQuery is built using a fast, multi-threaded genomic range search engine with extensive junit testing. Lastly, it is free to use.
33

4-
The GQuery package includes three Java applications:
4+
<u>The GQuery package includes three Java applications:</u>
5+
<ol>
6+
<li>GQuery Indexer - a command line tool for building chromosome indexes that link genomic coordinates with the data files that contain intersecting records.</li>
7+
<li>GQuery CLI - a command line tool for executing queries locally on GQuery indexed data directories.</li>
8+
<li>GQuery API - a RESTful web API service for executing queries on remote servers for authenticated user groups.</li>
9+
</ol>
510

6-
GQueryIndexer - a command line tool for building chromosome indexes that link genomic coordinates with the data files that contain intersecting records.
11+
Each query triggers an intersection of each user's regions of interest against the GQuery chromosome indexes to identify data files that contain intersecting records. Regular expression filters are provided to limit which directory paths are searched and which file types are returned. Often this is all that is needed for a basic query. If requested, a second tabix search is used to fetch the actual intersecting records from the data files. These records can also be filtered using regular expressions.
712

8-
GQuery CLI - a command line tool for executing queries locally on GQuery indexed data directories.
13+
This approach of searching genomic coordinate indexes for intersecting data files combine with tabix data record retrieval is an excellent way to address the random range query problem. In our benchmarking tests, it significantly out performs both relational database (MySQL) and NoSQL (MongoDB) approaches. Moreover, use of the widely adopted, bgzip compressed, tabix indexed file format (<https://www.htslib.org/doc/tabix.html>) eliminates the need to duplicate the data source content or create and maintain custom db data file importers. If you can tabix index it, you can search it.
914

10-
GQuery API - a RESTful web API service for executing queries on remote servers for authenticated user groups.
11-
12-
Each query triggers an initial intersection of each user's regions of interest against the GQuery chromosome indexes to identify data files that contain intersecting records. Regular expression filters are provided to limit which directory paths are searched and which file types are returned. Often this is all that is needed for a basic query. If requested, a second tabix search is used to fetch the actual intersecting records from the data files. These records can also be filtered using regular expressions.
13-
14-
This approach of searching genomic coordinate indexes for intersecting data files combine with tabix data record retrieval is an excellent way to address the random range query problem. In our benchmarking tests, it significantly out performs both relational database (MySQL) and NoSQL (MongoDB) approaches. Moreover, use of the widely adopted, bgzip compressed, tabix indexed file format (https://www.htslib.org/doc/tabix.html) eliminates the need to duplicate the data source content or create and maintain custom db data file importers. If you can tabix index it, you can search it.
15-
16-
Getting up and going with GQuery is a simple three step process: download the latest jar files, build the chromosome data file indexes, and execute queries using the CLI.
15+
Getting up and going with GQuery is a **simple three step process:** download the latest jar files, build the chromosome data file indexes, and execute queries using the CLI.
1716

1817
For those looking to provide search capability via a web application, deploy the GQuery RESTful web API. This is especially useful when searching needs to be restricted to subsets of the data for particular user groups, e.g. patient, IRB restricted, or unpublished project data.
1918

2019
---
21-
# Step 1: Download the Jar Files
22-
Goto https://github.com/HuntsmanCancerInstitute/GQuery/releases and download the latest xxx.jar files. These are self contained. No other libraries are required. Open a command line terminal. Type 'java -version'. If needed, install java 1.8 or higher.
20+
## Step 1: Download the Jar Files
21+
Go to <https://github.com/HuntsmanCancerInstitute/GQuery/releases> and download the latest xxx.jar files. These are self contained. No other libraries are required. Open a command line terminal. Type 'java -version'. If needed, install java 1.8 or higher (<https://www.java.com/en/download/>). Launch the Indexer and CLI without options to pull the help menus, e.g.
22+
23+
<pre>java -jar pathToJars/GQueryIndexer.jar; java -jar pathToJars/GQueryCLI.jar</pre>
2324

2425

2526
---
26-
# Step 2: Build the Chromosome Indexes
27+
## Step 2: Build the Chromosome Indexes
2728

2829
The second step with GQuery is to build the chromosome indexes with the GQueryIndexer application. It is multi-threaded and junit tested.
2930

3031
Give some thought to how to best structure the base Data directory for your group. If you are working with multiple species and genome builds then create a sub directory named with the build for easy directory path regular expression matching (e.g. Data/B37/, Data/Hg38, Data/MM10, etc.). Likewise create directories for each major project (e.g. Data/Hg38/TCGA, Data/Hg38/AVATAR, Data/Hg38/Clinical/Foundation) and particular data types (e.g. Data/Hg38/AVATAR/Germline, Data/Hg38/AVATAR/Somatic/Vcf, AVATAR/Somatic/ReadCoverage, Data/Hg38/AVATAR/Somatic/Cnv). Keep in mind that a .GQuery chromosome index is created in each directory that contains xxx.gz.tbi files. Thus the most optimal indexing strategy is to soft link or copy over 100's to 1000's of files into the same directory. The worst strategy is to have many directories with just a few data files. Lastly, directory path regular expressions are used by GQuery to both restrict what a user can search and to speed up the searching, so create a directory structure in the way that best meets your needs.
3132

3233
<pre>
33-
> java -jar -Xmx30G ~/YourPathTo/GQueryIndexer.jar
34+
java -jar -Xmx30G ~/YourPathTo/GQueryIndexer.jar
3435

3536
**************************************************************************************
3637
** GQuery Indexer: Oct 2020 **
@@ -75,7 +76,7 @@ java -jar -Xmx115G GQueryIndexer.jar -c $d/b37Chr20-21ChromLen.bed -d $d/Data
7576

7677

7778
---
78-
# Step 3: Run Local Queries with the Command Line Interface
79+
## Step 3: Run Local Queries with the Command Line Interface
7980

8081
Run queries locally using the GQueryCLI application. It is multi-threaded and junit tested. Results are returned in JSON. Specify one or more regions of interest in bed region or vcf format. Use the path and file name regular expressions to speed up and limit what files are searched.
8182

@@ -86,7 +87,7 @@ Likewise, if you are only interested in actual vcf variants, specify a file name
8687
Lastly, fetching the actual intersecting data records from each file can be a computationally intensive process so only use the '-d' fetch data option after you have narrowed down your search with path and file name regexes. In many cases it's not needed, for example if you're only interested in identifying patients with a BRCA1 mutation, then skip the '-d' option and just parse the 'source' names from the JSON output.
8788

8889
<pre>
89-
> java -jar -Xmx30G ~/YourPathTo/GQueryCLI.jar
90+
java -jar -Xmx30G ~/YourPathTo/GQueryCLI.jar
9091

9192
**************************************************************************************
9293
** GQuery Command Line Interface: Oct 2020 **
@@ -153,7 +154,7 @@ Examples:
153154

154155

155156
---
156-
# Step 4 (Optional): Run Queries using the Web API
157+
## Step 4 (Optional): Run Queries using the Web API
157158

158159
If needed, GQueries may be executing on remote servers using a web API. It is built using the Java Jersey JAX RESTful API framework. It is JUnit tested and deployed on Apache Tomcat for optimized performance. Results are returned in JSON. Key token based digest authentication may be enabled to restrict what user may search.
159160

@@ -209,18 +210,18 @@ Note, you will likely need to encode the tabs by replacing them with %09 if past
209210
<pre>http://localhost:8080/GQuery/search?vcf=20%094163144%09.%09C%09A%09.%09.%09.&ampmatchVcf=true&ampregExFileName=vcf\.gz&ampincludeHeaders=true</pre>
210211

211212

212-
# Installing the GQuery Web App
213-
### See also https://github.com/HuntsmanCancerInstitute/GQuery/blob/master/Misc/queryNotes.txt
213+
## Installing the GQuery Web App
214+
See also <https://github.com/HuntsmanCancerInstitute/GQuery/blob/master/Misc/queryNotes.txt>
214215

215216
### Install Tomcat 7 on a large linux server (>12 cores, > 30G RAM)
216-
e.g. https://www.digitalocean.com/community/tutorials/how-to-install-apache-tomcat-7-on-centos-7-via-yum
217+
e.g. <https://www.digitalocean.com/community/tutorials/how-to-install-apache-tomcat-7-on-centos-7-via-yum>
217218

218219
Be sure to increase the Xmx and MaxPermSize params to > 30G RAM to avoid out of memory errors. Tomcat 8 likely works but hasn't been tested.
219220

220221
### Modify the latest GQuery-XX.war
221222

222223
Download, unzip, and modify the following config files to match your environment<br>
223-
https://github.com/HuntsmanCancerInstitute/GQuery/releases
224+
<https://github.com/HuntsmanCancerInstitute/GQuery/releases>
224225

225226
<pre>unzip -q GQuery-XX.war </pre>
226227

@@ -242,10 +243,10 @@ Examine the log4j log file for startup and test issues. Loading of the interval
242243
Test the server: *http://IPAddressOfMyBigServer:8080/GQuery-XX/search?fetchOptions=true* <br>
243244

244245
---
245-
# Configuring GQuery for token based digest authentication
246+
## Configuring GQuery for token based digest authentication
246247

247248
### Enable digest authentication in tomcat, see the WEB-INF/web.xml doc for an example, details:
248-
https://techannotation.wordpress.com/2012/07/02/tomcat-digestauthentication/
249+
<https://techannotation.wordpress.com/2012/07/02/tomcat-digestauthentication/>
249250

250251
Generate passwords:
251252
apache-tomcat-7.xxx/bin/digest.sh -a md5 Obama:GQuery:thankYou

0 commit comments

Comments
 (0)