S-Space College of Natural Sciences (자연과학대학) Program in Bioinformatics (협동과정-생물정보학전공) Theses (Ph.D. / Sc.D._협동과정-생물정보학전공)
Analysis strategy and method to improve accuracy of imputation on rare variants
희귀변이의 임퓨테이션 정확도 향상을 위한 분석 전략 및 방법 연구
- 자연과학대학 협동과정 생물정보학전공
- Issue Date
- 서울대학교 대학원
- 학위논문 (박사)-- 서울대학교 대학원 : 협동과정 생물정보학전공, 2015. 8. 박태성.
- Rare variants have gathered much attention as an alternative source of missing heritability. Rapid development in high throughput sequencing technology has enabled us to discover a large number of rare variants. Although next-generation sequencing technology is becoming a powerful tool in genomics, it is not yet feasible to perform a large-scale population based genome study due to its high cost and required high computing power. Alternatively, two approaches, imputation and customized chips such as exome array and Metabochip, have been widely used in large scale genome studies. Imputation is a cost-effective approach that imputes rare variants into existing genotype data. Generally imputation analysis requires two panels as input: reference panel is the template for predicting untyped markers and genotype panel is the target for imputation analysis. After imputation analysis, the information of genotype panel contains previously experimentally genotyped information and predicted genotypes based on reference panel information. However, imputing rare variants is very challenging due to low accuracy of imputed rare variants. Moreover, low accuracy of imputed rare variants would mislead the results of region-based association tests. Customized chips are designed to contain rare variants yet those chips are designed only for the specific targets. Therefore, new analysis strategy and method for obtaining rare variants are urgently in need.
In this study, we developed two novel rare variant imputation approaches, combined approach and pre-collapsing imputation approach. We also applied two approaches to real data analysis. Imputation based association study was performed on liver enzyme traits.
First, we proposed combined approach that imputes genotype panel consists of combined data of GWAS chip and exome array. The effectiveness and performance of combined approach were demonstrated using reference panel comprising exome sequencing, exome array, and GWAS chip of 848 identical samples and 5,349 samples of genotype panel consisting of exome array and GWAS chip. As a result, the combined approach increased about 11% in imputation accuracy and enhanced about two times of genomic coverage for rare variants (MAF < 1%) compared to imputation results of genotype panel with GWAS chip alone. Regardless of samples size of reference panel, combined approach showed better imputation performance. Also combined approach outperformed previously reported two-step imputation approach.
Second, we developed new method, pre-collapsing based imputation approach (PreCimp), to increase imputation accuracy in forms of collapsed variables. Unlike with previously introduced imputation approaches, PreCimp only requires computational cost. PreCimp consists of two steps. In the first step, collapsed variables are generated using rare variants in the reference panel and new reference panel is constructed by inserting pre-collapsed variables (PCVs) into the reference panel. Next, typical imputation analysis with the new reference provides the imputed genotypes of collapsed variables. We demonstrated the performance of PreCimp on 5,349 genotyped samples using a Korean population specific reference panel including 848 samples of exome sequencing, Affymetrix 5.0, and exome chip. PreCimp outperformed a traditional post-collapsing method that collapses imputed variants after single rare variant imputation analysis. Although PreCimp poorly performed for genes sized larger than 200kb (about 3% of all genes), PreCimp approach by split large-sized genes into small sub-regions could control the poor performance issues. PreCimp approach was shown to increase imputation accuracy about 3.4 ~ 6.3% (dosage r2 0.6 ~ 0.8), 10.9 ~ 16.1% (dosage r2 0.4 ~ 0.6), and 21.4 ~ 129.4% (dosage r2 below 0.4) compared with the results of post-collapsing method.
Two imputation approaches were applied to real data analysis. We performed imputation based association analysis on liver enzymes. Using whole-exome reference panel, imputation analysis was performed on 8,529 samples of combined data consisting of GWAS chip and exome chip. Subsequent association analysis on about half million imputed and genotyped variants revealed 20 associated loci responsible for the variation of liver enzymes (P < 5x10-6). Among them, 7 novel loci including two missense variants were discovered.
Taken together, two novel rare variant imputation approach were developed and applied to real data analysis. Imputation based association analysis on liver enzyme discovered several novel findings. This study proposed efficient analysis approaches for enhancing imputation accuracy of rare variants. Additionally, in application to real data analysis, discovered variants will be valuable resource for understanding rare variants and its association to various phenotypes.