Dimension Reduction Methods for Multi-source and Manifold-valued Data

Abstract: 본 학위논문에서는 비유클리드 공간에서의 차원축소법에 대해 다룬다. 비유클리드 공간에선 곡률로 인해 피타고라스 정리와 같은 유클리드 공간에서 널리 사용되는 방법론을 활용하기 힘들다. 자료의 구조를 파악하기 위해 주어진 비유클리드 공간의 기하학적 성질을 이해하는 것이 중요하다. 이 논문에서는 주성분분석이나 인자분석과 같이 널리 사용되는 다변량 자료분석을 비유클리드 공간에 일반화하고자 다음 두 방법론을 제안한다.

제 2장에서는 다중 근원 자료를 분석하고자 Principal Structure Identification (PSI) 방법론을 제안한다.

다중 근원 자료는 같은 관찰 대상으로부터 여러 변수 집단에 대해 채취된 자료이다. 요즘 각광받는 다중 오믹스 자료가 좋은 예이다. 다중 근원 자료를 통합적인 관점에서 분석하기 위해, 자료들 사이의 공통 스코어가 있는지 확인하도록 한다. 이때 공통 스코어는 모든 자료 집단들이 공유하는 것일수도, 일부 자료집단들이 공유하는 것일수도, 혹은 하나의 자료집단에만 귀속되는 것일수도 있다.

우리 방법론이 가지는 가장 큰 특징 중 하나는, 각 자료가 가지는 인자 스코어 선형부공간을 자료 간 연관관계를 밝히는 기하학적 기본 요소로 사용하는 것이다. 공통 스코어 선형부공간을 알아내는데 있어 잡음으로부터 생기는 부정확함을 회피하고자, 자료집단의 인자 스코어 선형부공간의 1차원 깃발 공간을 계산한다. 이 1차원 깃발 공간과 근접한 선형부공간들을 골라내어 스코어를 공유하는 자료집단을 묶어주는 절차적 알고리즘을 제안한다.

제 3장에서는 초구면체 위의 자료를 분석하는 Penalized Principal Nested Spheres (PenPNS) 방법론을 제안한다.

Analysis of Principal Nested Spheres (Jung, 2012)는 초구면체 위의 자료를 분석하는 차원축소 방법론이다. PNS에서는 차원축소가 절차적 과정을 통해 이루어지며, 불필요한 차원이 하나씩 축소된다. 특히 작은 구로 자료를 가늠하여 비측지선 추정을 가능케한 것이 장점이다.

하지만 PNS에게는 과적합이란 단점이 있다. 과적합이란 큰 구를 따라 생성된 데이터가 지나치게 작은 구로 추정되는 현상을 말한다. 우리는 두 종류의 과적합을 다룬다. (1) 모수로서 반지름은 $0$에서 $\pi/2$까지의 범위를 가지게 되므로, 추정된 반지름이 $\pi/2$보다 큰 경우 추정된 축은 구면의 반대편의 것으로 대체되며 반지름 추정도 그에 따른다. 따라서, 추정된 반지름의 분포는 실제 축과 점들 사이의 각도의 분포가 반으로 접힌 모양을 가지게 된다. 이와 같은 이유로 반지름 추정의 값은 $\pi/2$보다 작을 수 밖에 없다. (2) 만약 자료가 큰 구를 따라 생성되었지만 매우 짧은 구간에서 생성되는 경우, 자료점 구름의 모양은 접시 모양과 가까워지게 된다. 이 경우 접시 모양 구름 안에 들어있는 반지름이 아주 작은 작은 구로 자료가 추정된다.

PenPNS는 과적합 현상을 교정하는 PNS의 개량형이다. 첫번째 과적합 현상을 교정하기 위해, PenPNS는 추정 과정에서 반지름을 정규화한다. 이때 반지름이 $\pi/2$보다 작아질수록 벌점화 항목의 값이 커진다. 두번째 과적합 현상을 교정하기 위해, PenPNS는 교차 검증 오차에 벌점화 항목을 반영한다. 이 벌점화 항목은 퍼짐지표에서 유래하는데, 퍼짐지표는 접시모양 데이터 구름이 작은 반지름의 구로 추정되는 상황에서 그 값이 커진다. 이와 같은 과적합 해결 방법의 효과를 모의실험과 실제 데이터 분석을 통해 보이겠다.
In this thesis, we discuss dimension reduction methods in non-Euclidean space. Due to non-zero curvature, we cannot make use of the traditional techniques like Pythagorean theorem in building a statistical method in non-Euclidean space. To capture the structure of data set, it is necessary to understand the geometric nature of a given non-Euclidean space. We propose the following two dimension reduction methods, generalizing popular multivariate data analysis methods, factor analysis and PCA, to non-Euclidean settings.

In Chapter 2, we propose Principal Structure Identification (PSI) for multi-source dataset.

Analysis of multi-source dataset, where data on the same objects are collected from multiple sources, is of rising importance in many fields, most notably in multi-omics biology. We propose a novel framework and algorithms for integrative decomposition of such multi-source data, to identify and sort out common factor scores in terms of whether the scores are relevant to all data sources (fully joint), to some data sources (partially joint), or to a single data source.

The key difference between our proposal and existing approaches is that we utilize raw source-wise factor score subspaces in the identification of the partially-joint block-wise association structure. To identify common score subspaces, which may be partially joint to some of data sources, from noisy observations, our proposed algorithm sequentially computes one-dimensional flag means among source-wise score subspaces, then collects the subspaces that are close to the mean.

In Chapter 3, we propose Penalized Principal Nested Spheres (PenPNS) for dataset on the hypersphere surface.

Analysis of Principal Nested Spheres (PNS) (Jung, 2012) is a flexible dimension reduction method for dataset on the hypersphere, e.g. directional data (Fisher,1993; Fisher et al.,1993; Mardia and Jupp,2000) and shape data (Kendall,1984; Dryden and Mardia, 1998). In PNS, the dimension reduction is an iterative procedure for discarding unimportant dimensions. It is specifically designed to capture a certain type of non-geodesic variation by fitting a small sphere.

However, PNS suffers from overfitting, a phenomenon where data points are fitted with a small sphere even though they are generated from along a great sphere. We consider two types of overfitting phenomena. (1) When the estimated radius $\widehat{r}$ is over $\pi/2$, the estimated axis $\widehat{v}$ is flipped to $-\widehat{v}$ and we take the radius as $\pi - \widehat{r}$, since the radius as parameter ranges from $0$ to $\pi/2$. Then the distribution of the estimated radius becomes a folded version of the distribution between the true axis and the data points. Thus the expectation of the estimated radius is less than $\pi/2$. (2) When data points are generated along a great sphere but within a short interval, the data point cloud has a disc shape and is usually fitted by a small sphere with a very small radius.

PenPNS is an improvement of PNS that overcomes the overfitting phenomena in small sphere fitting. To deal with the first type of overfitting phenomenon, PenPNS regularize radius in estimation, where the value of the penalty term grows larger as radius decreases departing from $\pi/2$. For the second type, PenPNS gives a penalizing term on the cross-validation error in choosing tuning parmeter. The penalizing term, called Index of Dispersion, has a larger value in the case a disc-shaped distribution is fitted with a small radius. In Simulation Study and Real Data analysis, we demonstrate that PenPNS successfully mitigates the overfitting phenomena.

Language: eng

URI: https://hdl.handle.net/10371/194386

https://dcollection.snu.ac.kr/common/orgView/000000176064

Files in This Item:

000000176064.pdf 5.80 MB

Appears in Collections:

College of Natural Sciences (자연과학대학)
- Dept. of Statistics (통계학과)
  - Theses (Ph.D. / Sc.D._통계학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share