Germline genotyping and sequencing are used in disease gene discovery research and clinical genetic testing. Detection of germline variants typically requires sequencing of exonic regions through a commercial capture panel kit, either across all transcripts or a subset (small gene panel). However, sequencing costs are dropping enough that whole-genome sequencing is becoming more practical. If a capture panel is used, then a BED file of genomic coordinates for the capture regions is required. There are typically two files: a BED file of the exons targeted by the capture panel, and a padded BED file that extends the capture targets by an additional 20 bp or so. Sequencing depth should be at least 20X for viable calls; calls can be made with less depth but should be used with caution and independently verified.
For gene discovery, we perform linkage analysis, association tests, and gene network analysis, and we combine their results in a quantitative fashion for gene prioritization, variant prioritization, candidate gene selection, and statistical tests.
- Trim adapter sequences.
- Align with BWA.
- Mark duplicates with Picard MarkDuplicates.
- Re-calibrate base qualities.
- Re-align insertions and deletions. This is not absolutely required but is helpful when inspecting alignments at called variants.
- Generate raw genomic VCF files for joint genotyping.
High-Level Processing: Gene Discovery Research
The first step is quality control (QC) of data.
- At genotype level, bad genotypes are removed based on call quality score, read depth, and ploidy.
- At variant level, bad variants are removed by VQSLOD score or hard filters, the FILTER field, missing rate, missing rate discrepancy between cases and controls, Mendelian errors, Hardy-Weinberg disequilibrium in independent healthy controls in each population, genotype discordance rate, the mean read depth, and the proportion of heterozygous genotypes among non-missing haploidy individuals.
- At sample level, bad samples are removed by discrepancy between reported sex and genetic sex, pedigree structure error, Het/non-ref-hom ratio, Ti/Tv ratio, missing rate, Mean GQ and DP across all variants, cryptic relatedness, and population outlier in principle component analysis.
Subsequent analyses depend on the study design and hypothesis.
- If it is hypothesized that the disease is caused by de novo mutation, and the sequenced samples included both parents, we will perform variant filtering that focuses on de novo mutations only. The resulting variants will be prioritized based on the biological function of the gene and the functional consequence of the variants.
- If the disease of interest is Mendelian (either a Mendelian disease or a Mendelian form of a complex disease) and the sequenced samples involve a small number of pedigrees with distantly related cases, we will perform linkage analysis and association tests. Genes will be prioritized based on the test statistics combined with the biological relevance of the gene to the disease. If the cases are connected through a common ancestor from several generations before, we will also perform a share-genomic-segment analysis to see whether the cases carry a chromosome region that is longer than expected.
- If the disease of interest is a complex disease and the sequenced samples included sporadic cases, single index cases from each pedigree, or multiple closely related cases from each pedigree (such as siblings or parent-offspring pairs), then linkage analysis will not be helpful. In this case, we will perform association tests with individuals weighted by correlation. It is better to use all cases rather than just the index case, because it is likely that causal variants do not perfectly cosegregate with a complex disease; thus, using only an index case may lead to a false negative. Again, genes will be prioritized based on the test statistics combined with the biological relevance of the gene to the disease.
Normally, variants for these analyses will be first filtered by allele frequency, deleteriousness score, and ClinVar classification. Based on the hypothesis of the disease of interest, the filters can be adjusted to be more relaxed or more stringent. We can also perform anlaysis with loss-of-function variants only.
Variants from RNASeq
In theory, RNASeq should be ideal as only exonic regions of the genome are sequenced, skipping the requirement for exon capture. In practice, this is frought with numerous limitations:
- Genes are expressed in a highly development- and tissue-specific manner, necessitating matching affected tissue with available source tissue.
- Genes are expressed at widely varying levels: thousands of copies per cell versus extremely rare one or two copies per cell. Capturing sufficient read depth coverage over any individual base for calling variants is difficult.
- RNASeq alignments have notoriously high duplication levels, regardless whether due to PCR or biological duplication events. Removing duplicate alignments is essential in order to avoid artifacts, limiting coverage depth.
- RNASeq coverage over any given gene is expected to be non-uniform, even though in bulk "average gene" profiles may show consistent coverage depth.
- Reverse transcriptase used to generate cDNA copies during library construction has notoriously poor fidelity compared to DNA sequencing libraries, leading to higher rates of random noise. With sufficient depth, this can usually be safely ignored by the variant detection algorithms.
With the above caveats in mind, variants may be detected from RNASeq data. Ideally, long (100 bp or longer) paired-end reads should be used. The reads must be processed with an alternative alignment pipeline, which handles the spliced alignments properly before post-processing. See our RNASeq alignment pipeline as an example.