Learning from Limited Data: Deep Generative Models and Applications

Abstract: Deep learning has made remarkable progress in automatically extracting valuable insights from data. However, when it comes to real-world applications, the scarcity of training data in the real world often presents challenges such as missing values, class and attribute imbalance, and incomplete labels. These challenges introduce discrepancies between real-world data and the assumptions of deep learning models, necessitating labor-intensive preprocessing steps that heavily rely on domain expertise. To this end, it is crucial to develop robust artificial intelligence systems that can effectively learn from limited data.

One promising approach to address these challenges is the utilization of deep generative models (DGMs). DGMs employ deep learning to estimate the underlying data distribution and have made significant advancements in recent years. Building upon the potential of DGMs, particularly generative adversarial networks (GANs), this dissertation aims to address the issues associated with imperfect datasets by devising novel methods and applications in the context of learning from limited data.

Specifically, this dissertation focuses on three key research topics: (1) GANs for real-world classification, (2) GANs for unsupervised conditional generation, and (3) the applications of DGMs in the domain of electronic health records.

In the first research topic, we focus on real-world classification, which aims to develop robust classification models that can effectively cope with missing values, class imbalance, and missing label problems in datasets. Previous studies have addressed each of these problems separately by applying machine learning-based preprocessing methods before training classifiers. However, we define these three problems as an "imputation", and propose a new GAN-based framework named HexaGAN that considers the interconnection between these problems.

In the second research topic, we focus on unsupervised conditional generation (UCG), which aims to perform conditional generation in a completely unsupervised manner. Despite significant advancements in DGMs, conditional generation still requires a large amount of labeled data, which is often not available in real-world datasets. To address this problem, UCG methods identify salient attributes of a dataset and generate data containing those attributes. However, existing UCG models assume that the attributes are balanced and fail to learn imbalanced attributes. To overcome this limitation, we propose Stein Latent Optimization for GAN (SLOGAN), which can robustly learn datasets with imbalanced attributes.

In the last research topic, we applied DGMs to biomedical data to address issues with imperfect datasets such as missing data, class imbalance, and missing labels. Firstly, we present a DGM to predict the amyloid positivity of cognitively normal individuals from proxy measures including structural MRI scans, demographic variables and cognitive scores instead of invasive measurements. Our approach can not only provide inexpensive, non-invasive and accurate diagnostics for preclinical Alzheimers disease, but also meet real-world requirements for clinical translation of deep learning models including transferability and interpretability. Secondly, we construct HexaGAN with a hint mechanism to predict the survival and clinical interventions such as intubation and supplemental oxygen for COVID-19 patients. Our method outperforms combinations of existing techniques for limited data problems.

Throughout this dissertation, we aim to bridge the gap between deep learning models and real-world applications by focusing on learning from limited data and leveraging the potential of DGMs to address challenges in real-world scenarios. Therefore, it is expected that this dissertation will provide valuable insights into DGMs and contribute to future research on learning from limited data across various fields.
딥러닝은 데이터로부터 유용한 인사이트를 자동으로 추출하는 데에 있어서 주목할 만한 발전을 이루어왔다. 그러나 실세계에서는 종종 학습 데이터가 제한적이며, 결측값, 클래스 및 속성 불균형, 라벨 결측 등의 문제가 발생할 수 있다. 이러한 문제들은 실세계 데이터와 딥러닝 모델의 가정 사이에 불일치를 초래하며, 도메인 지식과 인력을 필요로 하는 전처리 단계를 요구한다. 따라서, 제한된 데이터를 효과적으로 학습할 수 있는 강건한 인공지능 시스템을 개발하는 것이 중요하다.

이러한 문제들에 대응하기 위해, 우리는 딥 생성 모델 (DGM)을 활용하였다. DGM은 딥러닝을 통해 데이터 분포를 추정하는 기법으로 최근 상당한 발전을 이루었다. 본 논문은 DGM의 일종인 생성적 적대 신경망 (GAN)을 중심으로 제한된 데이터셋과 관련된 문제들을 해결하기 위한 새로운 방법을 고안하고 응용한다. 본 논문의 주요 연구 주제는 실세계 분류를 위한 GAN, 비지도 조건부 생성을 위한 GAN, 의생명 데이터를 위한 DGM 응용이다.

첫 번째 연구 주제는 실세계 분류이며, 결측값, 클래스 불균형 및 레이블 결측과 같은 문제들을 효과적으로 처리할 수 있는 강건한 분류기를 학습하는 것을 목표로 한다. 이전 연구에서는 분류기를 훈련하기 전에 각각의 문제를 해결하기 위한 머신러닝 기반의 전처리 방법을 적용하였다. 하지만, 우리는 이 문제들을 "imputation"이라는 하나의 키워드로 재정의하고, 문제들 간의 상호 연관성을 고려한 새로운 GAN 기반의 프레임워크인 HexaGAN을 제안한다.

두 번째 연구 주제에서는 라벨이 없는 데이터로부터 조건부 생성을 수행하는 비지도 조건부 생성 (UCG)에 집중한다. DGM의 발전에도 불구하고, 조건부 생성은 여전히 대량의 라벨링된 데이터를 필요로 한다. 하지만, 실세계 데이터셋에는 라벨이 없는 경우가 흔하다. 이 문제를 해결하기 위해 데이터의 중요한 속성을 식별하고, 해당 속성을 포함하는 데이터를 생성하는 UCG 방법이 제안되었다. 그러나 기존의 UCG 모델은 속성이 균형적으로 분포되어 있다고 가정하여 불균형한 속성 학습에 실패하는 문제가 있다. 이를 극복하기 위해, 우리는 불균형한 속성을 강건하게 학습할 수 있는 SLOGAN을 제안한다.

마지막 연구 주제에서는 DGM을 의생명 데이터에 적용하여 결측 데이터, 클래스 불균형 및 레이블 부재와 같은 문제들을 해결한다. 첫 번째로, 우리는 침습적인 측정 대신 MRI 스캔, 인구통계학적 변수 및 인지 점수와 같은 측정을 통해 전임상 알츠하이머병을 예측하는 DGM을 제안한다. 이 접근법은 저렴하고 비침습적인 진단을 가능하게 하는 동시에, 병원 간 전이 가능성과 해석 가능성을 포함하는 딥러닝 모델의 임상적 응용에 필요한 요구 사항을 충족시킨다. 두 번째로, 우리는 HexaGAN과 힌트 메커니즘을 사용하여 COVID-19 환자의 생존과 임상적 개입을 예측한다. 이러한 방법은 한정된 데이터 문제에 대한 기존 기법의 조합보다 우수한 성능을 보여준다.

본 논문은 DGM의 잠재력을 활용하여 실세계 시나리오에서의 제한된 데이터 문제들을 해결함으로써 딥러닝 모델과 실제 응용 분야 사이의 간극을 좁히고자 한다. 또한, 다양한 분야의 제한된 데이터로부터 학습하는 미래 연구에 통찰력을 제공하기를 기대한다.

Language: eng

URI: https://hdl.handle.net/10371/196437

https://dcollection.snu.ac.kr/common/orgView/000000178246

Files in This Item:

000000178246.pdf 17.06 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Ph.D. / Sc.D._전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share