A big data toolset for family-based sequencing data analysis

ETL sequencing data

Detection of mutations

Introduction

Hadoop

HBase

de novo

SeqHBase provides a number of features and functionalities. It extracts variant, variation, and coverage (read-depth) information from three commonly used file formats, including tab (or comma) delimited variant annotation files (e.g. CSV files), VCF files, and compressed BAM files.

SeqHBase is developed by Dr. Max M. He et al.. This toolset is free to academic community. Please feel free to contact Max with any questions, feedback, or bug reports at maxy dot he at gmail dot com

We are interested to hear your comments and feedback as features and improvements will be added in future releases. If you have ideas for improving SeqHBase or features you wish to add, we will be glad to hear about them.

Back to the top

Installation

As SeqHBase builds on top of Apache Hadoop and HBase, which itself relies on Hadoop and HBase for job execution, the installation requires a working Hadoop and HBase setup, such as Amazon's Elastic MapReduce service of Amazon Web Services (AWS) or an in-house Hadoop and HBase cluster. For more information about setting up an Apache Hadoop and HBase cluster, please refer to http://hadoop.apache.org/ and http://hbase.apache.org/.

Dependencies: We developed and tested SeqHBase with the following dependencies:

Hadoop version 2.6.0 + HBase version 0.98.11 -> SeqHBase1.10 (latest version)
Hadoop version 1.2.1 + HBase version 0.94.19 -> SeqHBase1.00

Availability: SeqHBase is a command-line based toolset and is distributed in executable format. SeqHBase-1.10 is the latest version. Users can choose to download either virtual machine (VM) bundling with SeqHBase package (vmSeqHBase1.10 / vmSeqHBase1.00) or pre-compiled release (SeqHBase1.10 / SeqHBase1.00) for local/AWS use. Source code of SeqHBase can be downloaded after obtaining a license agreement with Marshfield Clinic Applied Sciences.
Installing VM bundling with SeqHBase package

Download VM bundling with SeqHBase package at here
Install Oracle VM VirtualBox or similar VirtualBox
Create a VM by importing the downloaded ova file

Installing pre-compiled release

Download the latest SeqHBase release (seqhbase_1.10.tar.gz) at here
Untar the release into an installation directory of your choice; e.g.,

$ tar zxvf seqhbase_1.10.tar.gz

Set Hadoop-related variables (e.g., HADOOP_HOME) for your installation
Set HBASE_HOME to point to your HBase installation

Modifying parameters in configuration file (server_conf.xml) under folder "conf" of SeqHBase as follows:

hadoop.home: location of your Hadoop installed
hadoop.namenode: domain name of your name node
hadoop.cluster: the value can be either "multiple nodes" if your Hadoop cluster has multiple nodes or "single node" if your Hadoop cluster is single node
hbase.zookeeper.property.clientPort: property from ZooKeeper's config zoo.cfg. The port at which the clients will connect
hbase.zookeeper.quorum: comma separated list of servers in the ZooKeeper Quorum. For example, "host1.mydomain.com,host2.mydomain.com". Note: no space following comma
annotation.columns: the value can be "ProjectID" and the headers of annotated information of the variants

Back to the top

ETL sequencing data into your Hadoop and HBase cluster

SeqHBase extracts, transforms, and loads (ETL) your variant, variation and quality information from three types of files efficiently.

Variant:
annotated variant files generated by ANNOVAR are used to extract variant information, including chromosome number, start position, end position, reference allele, alternative allele, frequency in the 1000 Genome Project and/or the 6500 EPS project, ClinVar, CADD score, biological function, and multiple diverse function relevant scores, such as PolyPhen-2 score, SIFT score, and others. . When developing SeqHBase, we applied ANNOVAR to annotate variants in those sequencing data sets. However, other annotation programs, such as snpEff (http://snpeff.sourceforge.net/), can also be applied in SeqHBase by specifying "annotation.columns" with the annotation header line in your annotation files.
Command line (Note that the starting "$" symbol indicates the system prompt, do not type in it in your SeqHBase command line):
$ seqhbase.sh --memory 1024 --csv-file $ANNOTATED_FILE.csv --sample-id $FAM_ID:$IND_ID
where $FAM_ID is family ID while $IND_ID is individual ID.
[More Input Parameters]
Variation:
VCF files are used to extract variation information, including sample family ID, individual ID, called variant genotypes, coverage (read-depth), and Phred quality scores.
Type a command in a terminal as follows:
$ seqhbase.sh --memory 1024 --vcf-file $VCF_FILE.vcf --sample-id $FAM_ID:$IND_ID
[More Input Parameters]
Coverage: BAM files are used to extract data regarding coverage of each site of every sequencing sample (~3 billion sites in a WGS). In downstream analyses, the read-depth information can identify if no-call sites are reference-consistent with high quality or reference-inconsistent caused by low quality. A specific function is developed for quickly generating the read depths of each site from BAM files, similar to SAMTools (http://samtools.sourceforge.net/) pileup function. SeqHBase also supports loading coverage information into HBase from a pileup file generated by SAMTools.
Type a command in a terminal as follows:
$ seqhbase.sh --memory 4096 --pileup-file $BAM_FILE.bam --sample-id $FAM_ID:$IND_ID
or
$ seqhbase.sh --memory 1024 --pileup-file $PILEUP_FILE.gz --sample-id $FAM_ID:IND_ID
[More Input Parameters]

Back to the top

Detection of de novo, inherited homozygous or compound heterozygous mutations

De novo and autosomal recessive (or X-linked) screens - command line as follows:
$ seqhbase.sh --memory 1024 --ped-file $PEDFILE.ped --list-denovo --min-coverage-screen 20 --maf 0.01 --func-list $FUNCLIST --exonic-func-list $EXONICFUNCLIST --query --out $OUTPUT
[More Input Parameters]
Compound heterozygous screens - command line as follows:
$ seqhbase.sh --memory 1024 --ped-file $PEDFILE.ped --list-comp-het --min-coverage-screen 20 --maf 0.01 --func-list $FUNCLIST --exonic-func-list $EXONICFUNCLIST --query --out $OUTPUT
where $FUNCLIST can be "exonic and/or splicing" and $EXONICFUNCLIST can be started with "nonsynonymous,stopgain,stoploss".
[More Input Parameters]

Back to the top

$FUNCLIST

exonic,splicing

$EXONICFUNCLIST

nonsynonymous,stopgain,stoploss

ANNOVAR

$FUNCLIST

Value	Default precedence	Explanation
exonic	1	variant overlaps a coding exon
splicing	1	variant is within 2-bp of a splicing junction (use -splicing_threshold to change this)
ncRNA	2	variant overlaps a transcript without coding annotation in the gene definition (see Notes below for more explanation)
UTR5	3	variant overlaps a 5' untranslated region
UTR3	3	variant overlaps a 3' untranslated region
intronic	4	variant overlaps an intron
upstream	5	variant overlaps 1-kb region upstream of transcription start site
downstream	5	variant overlaps 1-kb region downtream of transcription end site (use -neargene to change this)
intergenic	6	variant is in an intergenic region

Back to the top

ANNOVAR

$EXONICFUNCLIST

Annotation	Precedence	Explanation
frameshift insertion	1	an insertion of one or more nucleotides that causes frameshift changes in a protein coding sequence
frameshift deletion	2	a deletion of one or more nucleotides that causes frameshift changes in a protein coding sequence
frameshift block substitution	3	a block substitution of one or more nucleotides that causes frameshift changes in a protein coding sequence
stopgain	4	a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that leads to the immediate creation of stop codon at the variant site. For frameshift mutations, the creation of a stop codon downstream of the variant will not be counted as "stopgain"!
stoploss	5	a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that leads to the immediate elimination of a stop codon at the variant site
nonframeshift insertion	6	an insertion of 3 or multiples of 3 nucleotides that does not cause frameshift changes in a protein coding sequence
nonframeshift deletion	7	a deletion of 3 or mutltiples of 3 nucleotides that does not cause frameshift changes in a protein coding sequence
nonframeshift block substitution	8	a block substitution of one or more nucleotides that does not cause frameshift changes in a protein coding sequence
nonsynonymous SNV	9	a single nucleotide change that causes an amino acid change
synonymous SNV	10	a single nucleotide change that does not cause an amino acid change
unknown	11	unknown function (due to various errors in the gene structure definition in the database file)

Back to the top

Command parameters for detecting mutations:

--hadoop-host: specify domain name of Hadoop name node
--hadoop-home: specify location of your Hadoop installed
--project: specify a project name
--polyphen-no-B: phred-based PolyPhen2 score is not annotated to be "B" (benign)
--sift-no-T: phred-based sift score is not annotated to be "T" (tolerant)
--ped-file: specify a pedigree file of a family
--query: analyse a data set with specified pedigree information
--list-denovo: list de novo and autosomal recessive (or X-linked) mutations in nuclear families
--list--comp-het: list compound heterozygous mutations in nuclear familiesi
--min-coverage-screen {20}: specify a minimum coverage (read depth) for screens; the default value is 20
--maf {0.01}: specify a maximum variant allele frequency in 1000 Genome Project and Exome Project Server; the default value is 0.01
--func-list: By default, it is "exonic,splicing"
--exonicFunc-list: By default, it is started with any of the "nonsynonymous,stopgain,stoploss,frameshift"
--sample-id: specify a sample id by a combination of $FAM_ID + ":" + $IND_ID, where $FAM_ID is family ID while $IND_ID is individual ID

Back to the top

Citations

He et al., SeqHBase: a big data toolset for family-based sequencing data analysis. Journal of Medical Genetics, 2015, 52, 282-288. [View PubMed]

Back to the top

A big data toolset for family-based sequencing data analysis

Introduction

Installation

ETL sequencing data into your Hadoop and HBase cluster

Detection of de novo, inherited homozygous or compound heterozygous mutations

Citations