Deep Learning from Structured Data

Abstract: When data possess some structure, a framework implementing the known structures of data can alleviate prominent challenges of deep learning such as robustness, generalizability, and explainability. This dissertation proposes deep learning frameworks for structured data in two tasks. The first task is to develop a representation learning model to simulate nested data. For example, the VGGFace2 dataset consists of more than 300 portraits per person on average. Interpreting such data with a nested structure as i.i.d. observations of a random process provides a fruitful viewpoint on disentangling representations. In this point of view, this thesis proposes the Ornstein auto-encoder (OAE), a promising new family of models for representation learning when data have a nested structure. The key attraction of OAE is its ability to generate samples nested within an observational unit, even if the unit is unknown to the model. This feature distinguishes OAE from conditional models. Furthermore, when the data exhibit exchangeability, OAE's reparametrization of Ornstein's d-bar distance, an infinite-dimensional optimal transport distance on which the OAE framework lies, produces a tractable learning algorithm. OAE has successfully demonstrated high performance in the three types of tasks that have been advocated in assessing the quality of generative models, namely exemplar generation, style transfer, and unit generation. This performance implies that the framework using the structures of data can handle the generalizability issues of deep learning.

The second part of this dissertation includes a study for learning a predictive model for capturing a hierarchical correlation in microbiome taxonomic abundance data. Since bacteria are classified at a hierarchy of taxonomic levels, microbiome abundance data have a hierarchical correlation structure. DeepBome is a deep-neural-network-based predictive model for capturing microbiome signals at different phylogenetic depths. By leveraging the phylogenetic information, DeepBome relieves the heavy burden of tuning for the optimal deep learning architecture, avoids overfitting, and most importantly enables visualizing the path from microbiome counts to disease. The second part contributes to the development of the software for DeepBome. Comprehensive simulation experiments have demonstrated the ability of the software. The DeepBome model trained with the developed software shows better generalizability than other deep learning models. For both regression and classification tasks, compared to sparse regression and other deep learning models, DeepBome has competitive performance particularly when microbiome taxa associated with the outcome are clustered at different phylogenetic levels. More importantly, DeepBome enables an explainable visualization of the microbiome-phenotype association network. In real-life data analysis, DeepBome software shows the ability to train a high-performance predictive model and select taxa that are related to the disease according to previous clinical research.
자료의 구조를 알고 있는 경우, 이 구조를 활용한 프레임워크는 심층 학습에서 마주하는 강건성, 일반화 및 설명가능성 등의 중요한 이슈를 해결하는 데 도움을 줄 수 있다. 본 학위 논문에서는 구조적 자료를 활용한 심층 학습 방법을 두 가지 문제에 대해 다룬다. 첫 번째 문제는 중첩 구조 자료를 생성할 수 있는 표현 학습 모형의 개발이다. 본 연구에서는 상관성이 있는 자료를 위한 표현 학습 모형 Ornstein auto-encoder (OAE)를 제안한다. 많은 실제 자료는 그룹화된 측정에서 얻어지므로 중첩 구조를 가진다. 예를 들어, VGGFace2 자료는 한 사람당 평균 300개의 이미지로 구성된 자료이다. 이러한 자료는 정상 확률 과정의 i.i.d. 샘플로 구성된 것으로 볼 수 있다. 이를 통해, 두 정상 확률 과정 사이의 최적 수송 거리 (optimal transport distance, Orstein's d-bar distance)를 이용하는 OAE 방법을 제안한다. OAE 방법은 훈련에 사용되지 않은 관측 유닛에 대해서도 해당 유닛의 새로운 이미지를 생성할 수 있다는 점에서 기존의 조건부 모형과 구별되는 고유한 특징을 가진다. 이는 자료의 구조를 활용한 프레임 워크로 심층 신경망 모형의 일반화 성능을 향상시킬 수 있음을 보여준다. 또한, 자료가 교환 가능한 수열 (exchangeable sequence)인 경우, OAE는 훈련 가능한 알고리즘을 제공한다. OAE 방법은 생성 모형의 성능을 나타내는 전형 생성(exemplar generation), 스타일 이전 (style transfer), 관측 유닛 생성 (unit generation) 문제에서 모두 높은 성능을 보여준다. 또한 불균형 자료에 대해서도 소수 집단에 속하는 유닛의 이미지 생성에 기존의 조건부 방법보다 강건한 결과를 보여준다.

본 학위 논문은 또한 미생물의 분류별 개수 자료 (microbiome taxonomic abundance data)의 계층적인 상관 구조를 포착할 수 있는 예측 모형 개발에 대한 내용을 담고 있다. 미생물 개수 자료는 많은 질병을 예측할 수 있는 지표이지만, 계통 발생학 관점에서 계층적인 상관 구조를 가지고 있어 이를 반영한 분석이 필요하다. DeepBome은 심층 학습 기반의 예측 모형으로 계통 발생 정보를 활용해 심층 신경망의 과적합을 막고, 질병과 미생물 개수 자료 간의 관계를 설명한다. 훈련된 모형은 일반화 및 설명 가능성 면에서 기존 심층 신경망보다 좋은 성능을 보여준다. 본 논문은 이 연구에서 DeepBome 소프트웨어 개발에 대한 내용을 담고 있다. 개발한 소프트웨어의 성능은 시뮬레이션 실험을 통해 확인한다. 회귀 문제와 분류 문제에서, 예측 성능 및 질병과 관련된 미생물 분류 선택 모두 기존의 희소 회귀 방법과 심층 학습 방법보다 DeepBome이 우수한 성능을 보이는 것을 확인할 수 있다. 또한 DeepBome 소프트웨어는 질병과 미생물 개수 자료 간의 관계에 대해 설명 가능한 심층 신경망의 시각화 자료를 제공한다.

Language: eng

URI: https://hdl.handle.net/10371/170742

http://dcollection.snu.ac.kr/common/orgView/000000162969

Files in This Item:

000000162969.pdf 11.83 MB

Appears in Collections:

College of Natural Sciences (자연과학대학)
- Dept. of Statistics (통계학과)
  - Theses (Ph.D. / Sc.D._통계학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share