Detection and identification of methylated residues requires sequencing of bisulfite-treated DNA (or RNA), which converts unmethylated cytosine residues to uracil (and interpreted as a thymine in sequencing results), while methylated cytosine residues are unaffected. Sequence reads are therefore aligned to essentially two genomes in silico, converted and unconverted versions. Methylation occurs primarily at CpG contexts, but also less frequently at non-CpG contexts (mCHG or mCHH). Results are typically reported as percent methylation (number of unconverted C over total C) at any given position, but this should be thresholded to a minimum number of reads (usually 8 to 10 reads) before a call can be determined. Whole-genome bisulfite sequencing therefore requires at least 10X coverage (preferably much higher) at a significant cost. Commercial capture panels for known sites of methylation (CpG islands, known differentially methylated regions) can often yield much better coverage at these sites for lower cost and higher rate of results. One important control should be the inclusion of completely unmethylated DNA (typically lambda phage DNA), which can be used to measure the efficiency of bisulfite conversion: > 97% is good, while < 90% is very poor.
Bisulfite treatment is very damaging to DNA, so a lower rate of alignment is normally expected compared to traditional DNA sequencing. Furthermore, a higher percentage of Illumina PhiX control DNA is usually included to maintain reasonably expected levels of base composition, resulting in lower target genome yields.
We have had the best experience with Novocraft novoalign for aligning bisulfite in bisulfite mode "-b 2". We have pre-built indices available for the common genome versions through our pysano service. Note that novoalign, while yielding a high rate of alignment, is quite slow in bisulfite alignment. It’s recommend to split the Fastq files into five million reads apiece and run each as separate jobs; the resulting alignments may be merged afterwards.
Besides novoalign, other aligners exist for bisulfite-treated DNA, including the popular Bismark tool. We have tested this alignment wrapper with Bowtie2 (generally recommended due to better alignment rates and indel handling). While alignment speed is considerably better than novoalign, the alignment rates are not as high.
You may find alignment templates for both novoalign and Bismark in our methylseq workflow repository.
Methylseq analysis is typically done at two different levels: at base level (individual C nucleotide) and at regional level (a cluster of several C nucleotides within a defined window length, e.g. 5 Cs in 250 bp). While methylation at a specific residue may impact a specific binding site, changes in methylation over a region are thought to be more biologically significant.
- Run NovoalignBisulfiteParser to parse alignments into point data.
- Run BisStat to generate a number statistics and summary graph files. Statistics are printed to standard out.
- Filter CpG context point data with FilterPointData.
- Use ScoreMethylatedRegions and DefinedRegionBisSeq to identify particular regions of methylation or differential methylation.
- Use BisSeq to scan the genome for novel regions of differential methylation.
The Bismark package includes its own tools for processing alignments. It cannot use novoalign-generated alignment files.
Identified regions of interest may be analyzed in similar manner to ChIPSeq in terms of annotation, etc.