Gene set analysis for Genome-Wide Association Study and Next Generation Sequencing data
전장유전체연관성분석자료 및 차세대시퀀싱자료에 대한 유전자 집합 분석
- 자연과학대학 통계학과
- Issue Date
- 서울대학교 대학원
- Genome-wide association study (GWAS); Next Generation Sequencing (NGS); Gene-set analysis (GSA); rare variant association test; co-mutation
- 학위논문 (박사)-- 서울대학교 대학원 : 통계학과, 2013. 2. 박태성.
- Genome-wide association study (GWAS) has successfully identified thousands of common genetic variants, mainly common single nucleotide polymorphisms (SNPs), associated with complex traits, including many common diseases. In general, many GWA methods only consider association of a single SNP and provide the list of the most significant SNPs or related genes due to computational burden.
However, complex diseases often result from compound action of multiple risk factors and therefore the single-SNP-based analysis may miss the genetic variants that affect risk effects jointly but have scarce individual effects. Also, it has been suggested that the associated variants can explain only a small fraction of the heritability of most common traits.
To resolve these issues, it was suggested to utilize prior biological knowledge or known pathway information, and thus to incorporate a set of related SNPs, which leads to a smaller number of tests. This approach was motivated by the gene-set analysis (GSA), widely used in the analysis of microarray data. GSA focuses on gene-sets rather than individual genes, and combines weak signals from a number of individual genes in a set, when individual genes are weakly associated with the traits. Considering the multiple SNPs jointly within a gene-set in GWAS can increase power.
When multiple SNPs are jointly considered, the corresponding SNP-level association measures are likely to be correlated due to the linkage disequilibrium (LD) among SNPs. We proposed SNP-based parametric robust analysis of gene-set enrichment (SNP-PRAGE) method which handles correlation adequately among association measures of SNPs, and minimizes computing effort by the parametric assumption. SNP-PRAGE first obtains gene-level association measures from SNP-level association measures by incorporating the size of corresponding (or nearby) genes and the LD structure among SNPs. Afterward, SNP-PRAGE acquires the gene-set level summary of genes that undergo the same biological knowledge. This two-step summarization makes the within-set association measures to be independent from each other, and therefore the central limit theorem can be adequately applied for the parametric model.
In addition, rare variants study is another breakthrough to limitation of GWAS. Rare variants are defined as variants with minor allele frequency less than one percent. There are growing evidences that rare variants contribute to the etiology of complex disease. It has been argued that collections of rare variants could fill the missing heritability of common traits. Since frequencies of rare variants are very low, even with high penetrance, it will be difficult to detect association with any single rare variant. Hence, the most popular statistical test for GWAS based on testing single SNPs is not expected to perform well.
Recently, there are growing methods has emerged to overcome this limitation. Three main strategies in this field are collapsing methods, weighting/prioritizing methods and distribution-based methods. However, assessing the association between rare variants and complex disease is still a challenging task and no single method gives consistently acceptable power across the range of these relationships, even in a large sample size.
For the consistently powerful association method under the various scenarios of act of rare variants, we propose some quadratic tests (QTests
Q1, Q2, and Q3 for gene-level for quantitative traits. These methods are computationally efficient and have relatively high power in various patterns of disease rare variants. Also, in order to increase power to detect the genetic association from rare variants, we propose the QTest for the gene-set (QGS) by extending the unit of analysis from genes to gene-sets. When combining the gene-level statistic, we used a co-mutation based weight. The logic behind it is that highly interacted genes with other neighbor genes usually play an important role within the gene-set.
These association tests can cover the broad range of scenarios for joint action of rare variants including the existence of common disease variants. We demonstrate the performance of the proposed methods comparing with other gene-level and gene-set-level association methods in various simulation setting. These include collapsing methods (GRANVIL
Variable Threshold (VT) method
Weighted sum statistic (WSS)), a distribution-based approach (SKAT, SKAT-O) as gene-level association methods and GLOSSI and GlobalTest as traditional gene-set-level association methods.
We also applied our methods to real data. SNP-PRAGE is applied to two GWA data sets: hypertension data of 8,842 samples from the Korean population and bipolar disorder data of 4,806 samples from the Wellcome Trust Case Control Consortium (WTCCC). The quadratic test (QTest) is applied to the sequence data and alanine aminotransferase (ALT) phenotype of 1058 samples from Korean population.