SeqHBase

A big data toolset for family-based sequencing data analysis

Introduction | Installation | ETL sequencing data | Detection of mutations | Command Parameters | What's New | Citations
  1. Introduction
  2. Installation
  3. ETL sequencing data
  4. Detection of mutations
  5. Command parameters
  6. Citations
  1. Introduction

  2. SeqHBase is a big data toolset developed based on Apache Hadoop and HBase infrastructure. It is designed for analyzing family-based sequencing data to detect de novo, inherited homozygous or compound heterozygous mutations. SeqHBase takes as input BAM files (for coverage of 3 billion sites of a genome), VCF files (for variant calls) and functional annotations (for variant prioritization). SeqHBase works through distributed and completely parallel manner over multiple data nodes. We applied SeqHBase to a 5-member nuclear family and a 10-member three-generation family with whole genome sequencing (WGS) data, as well as a 4-member nuclear family with whole exome sequencing (WES) data. Analysis times were linearly scalable with the number of data nodes. With 20 data nodes, SeqHBase took about 5 seconds for analyzing WES familial data and approximately 1 minute for analyzing the 10-member WGS familial data. These results demonstrated SeqHBase's high efficiency and scalability. In addition, it is distributed, customizable, and scalable based on the needs with available data volume. As more data become available, addition of more data nodes is possible, making the system very nimble. The newly added data nodes can be seamlessly incorporated with the existing system.

    SeqHBase provides a number of features and functionalities. It extracts variant, variation, and coverage (read-depth) information from three commonly used file formats, including tab (or comma) delimited variant annotation files (e.g. CSV files), VCF files, and compressed BAM files.

    SeqHBase is developed by Dr. Max M. He et al..  This toolset is free to academic community.  Please feel free to contact Max with any questions, feedback, or bug reports at maxy dot he at gmail dot com

    We are interested to hear your comments and feedback as features and improvements will be added in future releases. If you have ideas for improving SeqHBase or features you wish to add, we will be glad to hear about them.

    Back to the top
  3. Installation

  4. As SeqHBase builds on top of Apache Hadoop and HBase, which itself relies on Hadoop and HBase for job execution, the installation requires a working Hadoop and HBase setup, such as Amazon's Elastic MapReduce service of Amazon Web Services (AWS) or an in-house Hadoop and HBase cluster. For more information about setting up an Apache Hadoop and HBase cluster, please refer to http://hadoop.apache.org/ and http://hbase.apache.org/.

    • Dependencies: We developed and tested SeqHBase with the following dependencies:
      • Hadoop version 2.6.0 + HBase version 0.98.11 -> SeqHBase1.10 (latest version)
      • Hadoop version 1.2.1 + HBase version 0.94.19 -> SeqHBase1.00
    • Availability: SeqHBase is a command-line based toolset and is distributed in executable format. SeqHBase-1.10 is the latest version. Users can choose to download either virtual machine (VM) bundling with SeqHBase package (vmSeqHBase1.10 / vmSeqHBase1.00) or pre-compiled release (SeqHBase1.10 / SeqHBase1.00) for local/AWS use. Source code of SeqHBase can be downloaded after obtaining a license agreement with Marshfield Clinic Applied Sciences.
    • Installing VM bundling with SeqHBase package
      • Download VM bundling with SeqHBase package at here
      • Install Oracle VM VirtualBox or similar VirtualBox
      • Create a VM by importing the downloaded ova file
    • Installing pre-compiled release
      • Download the latest SeqHBase release (seqhbase_1.10.tar.gz) at here
      • Untar the release into an installation directory of your choice; e.g.,
      • $ tar zxvf seqhbase_1.10.tar.gz
      • Set Hadoop-related variables (e.g., HADOOP_HOME) for your installation
      • Set HBASE_HOME to point to your HBase installation
    • Modifying parameters in configuration file (server_conf.xml) under folder "conf" of SeqHBase as follows:
      • hadoop.home: location of your Hadoop installed
      • hadoop.namenode: domain name of your name node
      • hadoop.cluster: the value can be either "multiple nodes" if your Hadoop cluster has multiple nodes or "single node" if your Hadoop cluster is single node
      • hbase.zookeeper.property.clientPort: property from ZooKeeper's config zoo.cfg. The port at which the clients will connect
      • hbase.zookeeper.quorum: comma separated list of servers in the ZooKeeper Quorum. For example, "host1.mydomain.com,host2.mydomain.com". Note: no space following comma
      • annotation.columns: the value can be "ProjectID" and the headers of annotated information of the variants

    Back to the top
  5. ETL sequencing data into your Hadoop and HBase cluster

  6. SeqHBase extracts, transforms, and loads (ETL) your variant, variation and quality information from three types of files efficiently.

    • Variant:

      annotated variant files generated by ANNOVAR are used to extract variant information, including chromosome number, start position, end position, reference allele, alternative allele, frequency in the 1000 Genome Project and/or the 6500 EPS project, ClinVar, CADD score, biological function, and multiple diverse function relevant scores, such as PolyPhen-2 score, SIFT score, and others. . When developing SeqHBase, we applied ANNOVAR to annotate variants in those sequencing data sets. However, other annotation programs, such as snpEff (http://snpeff.sourceforge.net/), can also be applied in SeqHBase by specifying "annotation.columns" with the annotation header line in your annotation files.

      Command line (Note that the starting "$" symbol indicates the system prompt, do not type in it in your SeqHBase command line):
      $ seqhbase.sh --memory 1024 --csv-file $ANNOTATED_FILE.csv --sample-id $FAM_ID:$IND_ID

      where $FAM_ID is family ID while $IND_ID is individual ID.

      [More Input Parameters]

    • Variation:

      VCF files are used to extract variation information, including sample family ID, individual ID, called variant genotypes, coverage (read-depth), and Phred quality scores.

      Type a command in a terminal as follows:
      $ seqhbase.sh --memory 1024 --vcf-file $VCF_FILE.vcf --sample-id $FAM_ID:$IND_ID

      [More Input Parameters]

    • Coverage: BAM files are used to extract data regarding coverage of each site of every sequencing sample (~3 billion sites in a WGS). In downstream analyses, the read-depth information can identify if no-call sites are reference-consistent with high quality or reference-inconsistent caused by low quality. A specific function is developed for quickly generating the read depths of each site from BAM files, similar to SAMTools (http://samtools.sourceforge.net/) pileup function. SeqHBase also supports loading coverage information into HBase from a pileup file generated by SAMTools.

      Type a command in a terminal as follows:
      $ seqhbase.sh --memory 4096 --pileup-file $BAM_FILE.bam --sample-id $FAM_ID:$IND_ID

      or

      $ seqhbase.sh --memory 1024 --pileup-file $PILEUP_FILE.gz --sample-id $FAM_ID:IND_ID

      [More Input Parameters]

    Back to the top
  7. Detection of de novo, inherited homozygous or compound heterozygous mutations

    • De novo and autosomal recessive (or X-linked) screens - command line as follows:
      $ seqhbase.sh --memory 1024 --ped-file $PEDFILE.ped --list-denovo --min-coverage-screen 20 --maf 0.01 --func-list $FUNCLIST --exonic-func-list $EXONICFUNCLIST --query --out $OUTPUT

      [More Input Parameters]

    • Compound heterozygous screens - command line as follows:
      $ seqhbase.sh --memory 1024 --ped-file $PEDFILE.ped --list-comp-het --min-coverage-screen 20 --maf 0.01 --func-list $FUNCLIST --exonic-func-list $EXONICFUNCLIST --query --out $OUTPUT

      where $FUNCLIST can be "exonic and/or splicing" and $EXONICFUNCLIST can be started with "nonsynonymous,stopgain,stoploss".

      [More Input Parameters]

    Back to the top

    By default, $FUNCLIST is "exonic,splicing" and $EXONICFUNCLIST is "nonsynonymous,stopgain,stoploss". According to ANNOVAR, the $FUNCLIST can be one or more of the following values:
    Value Default precedence Explanation
    exonic 1 variant overlaps a coding exon
    splicing 1 variant is within 2-bp of a splicing junction (use -splicing_threshold to change this)
    ncRNA 2 variant overlaps a transcript without coding annotation in the gene definition (see Notes below for more explanation)
    UTR5 3 variant overlaps a 5' untranslated region
    UTR3 3 variant overlaps a 3' untranslated region
    intronic 4 variant overlaps an intron
    upstream 5 variant overlaps 1-kb region upstream of transcription start site
    downstream 5 variant overlaps 1-kb region downtream of transcription end site (use -neargene to change this)
    intergenic 6 variant is in an intergenic region

    Back to the top

    According to ANNOVAR, the $EXONICFUNCLIST can be one or more of the following annotations:
    Annotation Precedence Explanation
    frameshift insertion 1 an insertion of one or more nucleotides that causes frameshift changes in a protein coding sequence
    frameshift deletion 2 a deletion of one or more nucleotides that causes frameshift changes in a protein coding sequence
    frameshift block substitution 3 a block substitution of one or more nucleotides that causes frameshift changes in a protein coding sequence
    stopgain 4 a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that leads to the immediate creation of stop codon at the variant site. For frameshift mutations, the creation of a stop codon downstream of the variant will not be counted as "stopgain"!
    stoploss 5 a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that leads to the immediate elimination of a stop codon at the variant site
    nonframeshift insertion 6 an insertion of 3 or multiples of 3 nucleotides that does not cause frameshift changes in a protein coding sequence
    nonframeshift deletion 7 a deletion of 3 or mutltiples of 3 nucleotides that does not cause frameshift changes in a protein coding sequence
    nonframeshift block substitution 8 a block substitution of one or more nucleotides that does not cause frameshift changes in a protein coding sequence
    nonsynonymous SNV 9 a single nucleotide change that causes an amino acid change
    synonymous SNV 10 a single nucleotide change that does not cause an amino acid change
    unknown 11 unknown function (due to various errors in the gene structure definition in the database file)

    Back to the top
  8. Command parameters for detecting mutations:
    • --hadoop-host: specify domain name of Hadoop name node
    • --hadoop-home: specify location of your Hadoop installed
    • --project: specify a project name
    • --polyphen-no-B: phred-based PolyPhen2 score is not annotated to be "B" (benign)
    • --sift-no-T: phred-based sift score is not annotated to be "T" (tolerant)
    • --ped-file: specify a pedigree file of a family
    • --query: analyse a data set with specified pedigree information
    • --list-denovo: list de novo and autosomal recessive (or X-linked) mutations in nuclear families
    • --list--comp-het: list compound heterozygous mutations in nuclear familiesi
    • --min-coverage-screen {20}: specify a minimum coverage (read depth) for screens; the default value is 20
    • --maf {0.01}: specify a maximum variant allele frequency in 1000 Genome Project and Exome Project Server; the default value is 0.01
    • --func-list: By default, it is "exonic,splicing"
    • --exonicFunc-list: By default, it is started with any of the "nonsynonymous,stopgain,stoploss,frameshift"
    • --sample-id: specify a sample id by a combination of $FAM_ID + ":" + $IND_ID, where $FAM_ID is family ID while $IND_ID is individual ID

    Back to the top
  9. Citations

  10. He et al., SeqHBase: a big data toolset for family-based sequencing data analysis. Journal of Medical Genetics, 2015, 52, 282-288. [View PubMed]

    Back to the top