Analysis strategy and method to improve accuracy of imputation on rare variants
희귀변이의 임퓨테이션 정확도 향상을 위한 분석 전략 및 방법 연구

DC Field Value Language
dc.description학위논문 (박사)-- 서울대학교 대학원 : 협동과정 생물정보학전공, 2015. 8. 박태성.-
dc.description.abstractRare variants have gathered much attention as an alternative source of missing heritability. Rapid development in high throughput sequencing technology has enabled us to discover a large number of rare variants. Although next-generation sequencing technology is becoming a powerful tool in genomics, it is not yet feasible to perform a large-scale population based genome study due to its high cost and required high computing power. Alternatively, two approaches, imputation and customized chips such as exome array and Metabochip, have been widely used in large scale genome studies. Imputation is a cost-effective approach that imputes rare variants into existing genotype data. Generally imputation analysis requires two panels as input: reference panel is the template for predicting untyped markers and genotype panel is the target for imputation analysis. After imputation analysis, the information of genotype panel contains previously experimentally genotyped information and predicted genotypes based on reference panel information. However, imputing rare variants is very challenging due to low accuracy of imputed rare variants. Moreover, low accuracy of imputed rare variants would mislead the results of region-based association tests. Customized chips are designed to contain rare variants yet those chips are designed only for the specific targets. Therefore, new analysis strategy and method for obtaining rare variants are urgently in need.
In this study, we developed two novel rare variant imputation approaches, combined approach and pre-collapsing imputation approach. We also applied two approaches to real data analysis. Imputation based association study was performed on liver enzyme traits.
First, we proposed combined approach that imputes genotype panel consists of combined data of GWAS chip and exome array. The effectiveness and performance of combined approach were demonstrated using reference panel comprising exome sequencing, exome array, and GWAS chip of 848 identical samples and 5,349 samples of genotype panel consisting of exome array and GWAS chip. As a result, the combined approach increased about 11% in imputation accuracy and enhanced about two times of genomic coverage for rare variants (MAF < 1%) compared to imputation results of genotype panel with GWAS chip alone. Regardless of samples size of reference panel, combined approach showed better imputation performance. Also combined approach outperformed previously reported two-step imputation approach.
Second, we developed new method, pre-collapsing based imputation approach (PreCimp), to increase imputation accuracy in forms of collapsed variables. Unlike with previously introduced imputation approaches, PreCimp only requires computational cost. PreCimp consists of two steps. In the first step, collapsed variables are generated using rare variants in the reference panel and new reference panel is constructed by inserting pre-collapsed variables (PCVs) into the reference panel. Next, typical imputation analysis with the new reference provides the imputed genotypes of collapsed variables. We demonstrated the performance of PreCimp on 5,349 genotyped samples using a Korean population specific reference panel including 848 samples of exome sequencing, Affymetrix 5.0, and exome chip. PreCimp outperformed a traditional post-collapsing method that collapses imputed variants after single rare variant imputation analysis. Although PreCimp poorly performed for genes sized larger than 200kb (about 3% of all genes), PreCimp approach by split large-sized genes into small sub-regions could control the poor performance issues. PreCimp approach was shown to increase imputation accuracy about 3.4 ~ 6.3% (dosage r2 0.6 ~ 0.8), 10.9 ~ 16.1% (dosage r2 0.4 ~ 0.6), and 21.4 ~ 129.4% (dosage r2 below 0.4) compared with the results of post-collapsing method.
Two imputation approaches were applied to real data analysis. We performed imputation based association analysis on liver enzymes. Using whole-exome reference panel, imputation analysis was performed on 8,529 samples of combined data consisting of GWAS chip and exome chip. Subsequent association analysis on about half million imputed and genotyped variants revealed 20 associated loci responsible for the variation of liver enzymes (P < 5x10-6). Among them, 7 novel loci including two missense variants were discovered.
Taken together, two novel rare variant imputation approach were developed and applied to real data analysis. Imputation based association analysis on liver enzyme discovered several novel findings. This study proposed efficient analysis approaches for enhancing imputation accuracy of rare variants. Additionally, in application to real data analysis, discovered variants will be valuable resource for understanding rare variants and its association to various phenotypes.
dc.description.tableofcontentsAbstract i
List of Tables viii
List of Figures ix

Chapter 1 Introduction 1
1.1 Background and Motivation 1
1.1.1 Geome-wide association study 1
1.1.2 Genotype imputation 4
1.1.3 Missing heritability 9
1.1.4 Rare variant imputation 9
1.2 Objective of the research 10

Chapter 2 Imputation approach using combined data 12
2.1 Introduction 12
2.2 Materials and Methods 14
2.2.1 Overview of combined approach 14
2.2.2 Exome sequencing 18
2.2.3 GWAS and exome chip genotyping 18
2.2.4 Building reference panel 19
2.2.5 Building genotype panel 21
2.2.6 Two-step imputation approach 21
2.2.7 Statistical analysis 22
2.3 Results 22
2.3.1 Selecting MAF threshold for non-imputable variants 22
2.3.2 Comparison of imputation accuracy among genotype panels 25
2.3.3 Comparison of genomic coverage among genotype panels 28
2.3.4 Sample size effect of reference panel and comparison analysis 30
2.4 Discussion 35

Chapter 3 Pre-collapsing Imputation approach 38
3.1 Introduction 38
3.2 Materials and Methods 46
3.2.1 Subjects 46
3.2.2 Exome sequencing 46
3.2.3 GWAS and exome chip genotyping 47
3.2.4 Building the population specific exome reference panel 48
3.2.5 Pre-collapsing and post-collapsing based imputation 48
3.2.6 Comparison of imputation performance 50
3.2.7 Statistical analysis 50
3.3 Results 51
3.3.1 PostC vs. PreCimp methods 51
3.3.2 PreCimp with additional information 60
3.3.3 Effect of PCV position on imputation performance 64
3.3.4 Example of PreCimp and PostC in association study 66
3.4 Discussion 70

Chapter 4 Imputation based association analysis on liver enzyme traits 73
4.1 Introduction 73
4.2 Materials and Methods 74
4.2.1 Subjects 74
4.2.2 GWAS and exome chip genotyping 75
4.2.3 Building the population specific exome reference panel 76
4.2.4 Statistical analysis 76
4.3 Results 77
4.4 Discussion 92

Chapter 5 Summary and Conclusion 93

References 96
Abstract (Korean) 109
dc.format.extent3577182 bytes-
dc.publisher서울대학교 대학원-
dc.subjectrare variant-
dc.subjectgenome-wide association study-
dc.titleAnalysis strategy and method to improve accuracy of imputation on rare variants-
dc.title.alternative희귀변이의 임퓨테이션 정확도 향상을 위한 분석 전략 및 방법 연구-
dc.contributor.AlternativeAuthorYoung Jin Kim-
dc.citation.pagesx, 112-
dc.contributor.affiliation자연과학대학 협동과정 생물정보학전공-
Appears in Collections:
College of Natural Sciences (자연과학대학)Program in Bioinformatics (협동과정-생물정보학전공)Theses (Ph.D. / Sc.D._협동과정-생물정보학전공)
Files in This Item:
  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.