Statistical analysis for large-scale sequencing dataset using pathway information

이성영

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Statistical analysis for large-scale sequencing dataset using pathway information : 패스웨이 정보를 이용한 대용량 유전체 자료의 통계적 분석

DC Field	Value	Language
dc.contributor.advisor	박태성	-
dc.contributor.author	이성영	-
dc.date.accessioned	2018-05-28T17:14:08Z	-
dc.date.available	2018-05-28T17:14:08Z	-
dc.date.issued	2018-02	-
dc.identifier.other	000000151138	-
dc.identifier.uri	https://hdl.handle.net/10371/141165	-
dc.description	학위논문 (박사)-- 서울대학교 대학원 : 자연과학대학 협동과정 생물정보학전공, 2018. 2. 박태성.	-
dc.description.abstract	hence our method considers the correlation of pathways and handles an entire dataset in a single model. In addition, PHARAOH-multi further extends the original model into multivariate analysis, while keeping the advantages of our previous approach. We extend PHARAOH to enable analysis of multiple traits using hierarchical components of genetic variants. In addition, PHARAOH-multi can identify associations between multiple phenotypes and multiple pathways, with a single model, in the presence of subsequent genes within pathways, as a hierarchy. Through simulation studies, PHARAOH was shown to have higher statistical power than the existing pathway-based methods. In addition, a detailed simulation study for PHARAOH-multi demonstrated advantages of multivariate analysis, compared to univariate analysis, and comparison studies showed the proposed approach to outperform existing multivariate pathway-based methods. Finally, we conducted an analysis of whole-exome sequencing data from a Korean population study to compare the performance between the proposed methods with the previous pathway-based methods, using validated pathway databases. As a result, PHARAOH successfully discovered 13 pathways for the liver enzymes, and PHARAOH-multi identified 8 pathways for multiple metabolic traits. Through a replication study using an independent, large-scale exome chip dataset, we replicated many pathways that were discovered by the proposed methods and showed their biological relationship to the target traits.	-
dc.description.abstract	In the past two decades, rapid advances in DNA sequencing technology have enabled extensive investigations into human genetic architecture, especially for the identification of genetic variants associated with complex traits. In particular, genome-wide association studies (GWAS) have played a key role in identifying genetic associations between Single Nucleotide Variants (SNVs) and many complex biological pathologies. However, the genetic variants identified by many successful GWAS have explained only a modest part of heritability for most of phenotypes, and many hypotheses have been proposed to address so-called missing heritability issue, such as rare variant association, gene-gene interaction or multi-omics integration. Methods for rare variants analysis arose from extending individual variant-level approaches to those at the gene-level, and extending those at the gene level to multiple phenotypes. In this trend, as the number of publicly available biological resources is increasing, recent methods for analyzing rare variants utilize pathway knowledge as a priori information. In this respect, many statistical methods for pathway-based analyses using rare variants have been proposed to analyze pathways individually. However, neglecting correlations between multiple pathways can result in misleading solutions, and pathway-based analyses of large-scale genetic datasets require massive computational burden. Moreover, while a number of methods for pathway-based rare-variant analysis of multiple phenotypes have been proposed, no method considers a unified model that incorporate multiple pathways. In this thesis, we propose novel statistical methods to analyze large-scale genetic dataset using pathway information, Pathway-based approach using HierArchical components of collapsed RAre variants Of High-throughput sequencing data (PHARAOH) and PHARAOH-multi. PHARAOH extends generalized structural component analysis, and implements the method based on the framework of generalized linear models, to accommodate phenotype data arising from a variety of exponential family distributions. PHARAOH constructs a single hierarchical model that consists of collapsed gene-level summaries and pathways, and analyzes entire pathways simultaneously by imposing ridge-type penalties on both gene and pathway coefficient estimates	-
dc.description.tableofcontents	Introduction 1 1.1. The background on genetic association studies 1 1.1.1. Genome-wide association studies and the missing heritability 1 1.1.2. Rare variant analyses 6 1.2. The purpose of this study 10 1.3. Outline of the thesis 12 An overview of existing methods 13 2.1. Review of pathway-based methods 13 2.2.1. Competitive and self-contained tests: WKS and DRB 16 2.2.2. Self-contained test: aSPU 19 2.2.3. Self-contained test: MARV 21 2.3. Generalized structured component analysis 23 2.3.1. The model 23 2.3.2. Parameter estimation 25 Pathway-based approach using rare variants 27 3.1. Introduction 27 3.2. Methods 29 3.2.1. Notations and the model 29 3.2.2. An exemplary structure 32 3.3.3. Parameter estimation 33 3.4. Simulation study 37 3.4.1. The simulation dataset 37 3.4.2. Comparison of methods using simulation dataset 38 3.5. Application to analysis of liver enzymes 44 3.5.1. Whole exome sequencing dataset for pathway discovery 44 3.5.2. Replication study using exome chip dataset 53 3.6. Discussion 56 Multivariate pathway-based approach using rare variants 60 4.1. Introduction 60 4.2. Methods 61 4.2.1. Notations and the model 61 4.2.2. An exemplary structure 63 4.2.3. Parameter estimation 66 4.2.4. Significance testing 69 4.2.5. Multiple testing correction 75 4.3. Simulation study 77 4.3.1. The simulation model 74 4.3.2. Evaluation with simulated data 88 4.4. Application to the real datasets 88 4.4.1. Real data discovery from whole-exome sequencing dataset 95 4.4.2. Replication study using independent exome chip dataset 98 4.5. Discussion 99 Summary & Conclusions 104 Bibliography 108 초 록 127	-
dc.format	application/pdf	-
dc.format.extent	3735320 bytes	-
dc.format.medium	application/pdf	-
dc.language.iso	en	-
dc.publisher	서울대학교 대학원	-
dc.subject	유전체	-
dc.subject	패스웨이	-
dc.subject	빅데이터	-
dc.subject	통계분석	-
dc.subject	GWAS	-
dc.subject	NGS	-
dc.subject	rare variants	-
dc.subject	multivariate analysis	-
dc.subject	generalized linear model	-
dc.subject.ddc	574.8732	-
dc.title	Statistical analysis for large-scale sequencing dataset using pathway information	-
dc.title.alternative	패스웨이 정보를 이용한 대용량 유전체 자료의 통계적 분석	-
dc.type	Thesis	-
dc.contributor.AlternativeAuthor	Sungyoung Lee	-
dc.description.degree	Doctor	-
dc.contributor.affiliation	자연과학대학 협동과정 생물정보학전공	-
dc.date.awarded	2018-02	-

Appears in Collections:

College of Natural Sciences (자연과학대학)
- Program in Bioinformatics (협동과정-생물정보학전공)
  - Theses (Ph.D. / Sc.D._협동과정-생물정보학전공)

Files in This Item:

000000151138.pdf 3.56 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share