A Statistical Method on DNA Methylation Calling and its Application with Next generation sequencing technique

Cited 0 time in Web of Science Cited 0 time in Scopus


자연과학대학 통계학과
Issue Date
서울대학교 대학원
DNA MethylationNext Generation Sequencing (NGS)Bisulfite treatmentMethylation binary callingDifferential Methylation Region (DMG) test
학위논문 (박사)-- 서울대학교 대학원 : 통계학과, 2015. 8. 박태성.
Epigenomics is the study of biological factors that induce mitotically and/or meiotically heritable changes in gene functions, except by changes in deoxyribonucleic acid (DNA) sequence. Representative mechanisms that make such changes are DNA methylation and histone modification. These two mechanisms control gene expression via changing affinities of transcription factor and/or altering patterns of DNA packing, rather than altering the underlying DNA sequence. Those epigenetic processes play roles in imprinting, gene silencing, X chromosome inactivation, position effect, maternal effects, and the progress of carcinogenesis. Therefore, the importance of the epigenetic process is rapidly increasing.
Especially, DNA methylation is one of the mostly interesting and vigorously studied phenomena. Methylation process is defined as addition of a methyl group to a substrate, or the substitution of an atom or molecular group by a methyl group. In DNA methylation process, methylation arises in the cytosine, which is one component of the four nucleotides: guanine (G), adenine (A), thymine (T), and cytosine (C). After methylation process on cytosines, those would be 5-Methylcytosine. This process especially actively occurs in cytosines that are located next to a guanine nucleotide in the DNA sequence. It is defined as CpG sites.
There have been many biochemical techniques to measure level of methylation. Firstly developed techniques are based on immunoprecipitation techniques. The techniques uses phenomenon of immunoprecipitation using florescent antibodies that combines to 5-Methylcytosines (Methylated DNA immunoprecipitation: MeDIP). Whole genomic region are divided into many small region such as genes and those divided DNA sequences are located in their own location in microarray panel. After hybridization of the sequences with florescent antibodies to reference sequences in the microarray panel, we measure light intensities of all microarray spots and the intensities would be regarded as methylation intensities of genomic regions (MeDIP-chip technique).
The MeDip-chip techniques are widely used to get methylation level information of genomic regions. However, the microarray based approaches have limitation because they use pre-defined sequences. To overcome the problem, Next-Generation sequencing (NGS) techniques were combined to MeDIP approach (MeDIP-seq). After DNA sequence fragments with florescent antibodies are sequenced and mapped to the reference whole genome, we can obtain methylated CpG cytosine regions and its intensities by sequencing read depth. Then we can tell whether some genomic regions are methylated or non-methylated. However, the MeDIP-seq techniques also have low resolution (several dozen base pairs) because they only count numbers of mapped DNA fragments that are methylated as methylation intensities.
To overcome these limits of MeDIP, a new method is recently developed based on NGS and Bisulfite treatment. In order to discriminate non-methylated cytosines and methylated cytosines, the technique included bisulfite treatment which converts non-methylated cytosine into thymine and we then estimate methylation level by measuring ratio between numbers of cytosines and thymines in each CpG cytosine site. Because we can obtain information of base-pair resolution methylation from the technique, new statistical methods are needed to handle this new type data. The first issue is to develop method which classifies binary methylation status (methylation calling) and the second issue is to develop method which detects differentially methylated region (DMR).
For those two issues, we proposed two statistical approaches. For the binary methylation binary calling issue, we propose a new classification tool using bayes classifier and local information (Bis-Class). This method used the biological phenomenon that methylation status are spatially correlated, therefore the method performs better than binomial test using false discovery rate (FDR) especially on the condition of low coverage depth and low methylation level. We showed advantages of our method through simulation and real data analysis using honeybee dataset. For the differential methylated region (DMR) detecting method, we proposed a modified Cochran?Mantel-Haenszel (CMH) statistic.
The original CMH statistic was proposed to test conditional independence between two variables with stratification. However, because there is substantial and spatial correlation of methylation level between adjacent CpG cytosine sites, we additionally included spatial correlation structure and impose biological importance weights on the binary called base-pair resolution information. Moreover, the method has advantages that it can be applied to more various situations such as analysis of ordinal or multinomial response. We compared our method with Fishers exact test that has been used for binary called bisulfite sequencing data. Using the modified CMH test, we can avoid type 1 error inflation and handle multiple biological replicated samples in each experimental group. We also conducted simulation study and real data analysis using honeybee bisulfite sequencing dataset to detect differential methylated region.
We expect that our proposed methods to handle bisulfite sequencing data via NGS techniques are widely used to elucidate biological relationships between epigenetic data and many biological endpoints such as cancers, aging, gene silencing, etc.
Files in This Item:
Appears in Collections:
College of Natural Sciences (자연과학대학)Dept. of Statistics (통계학과)Theses (Ph.D. / Sc.D._통계학과)
  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.