Data Augmentation and Filtering for Supervised Learning using Splash Data Preprocessor

Abstract: Splash is a graphical user interface programming framework designed to support artificial intelligence application development. Artificial intelligence experts in various fields including data, modeling, control engineers can easily develop artificial intelligence applications without profound programming knowledge through Splashs programming abstraction. To further increase Splashs functionality for supporting artificial intelligence application development, we are adding a language construct in Splash for data preprocessing. This language construct provides an easy-to-use data augmenter and data filter, which are the main tasks of data preprocessing for data engineers in supervised learning.
Data augmentation and filtering are particularly important tasks in supervised learning because the training dataset's quality and quantity directly affect the accuracy of the model. Datasets such as MNIST and datasets prepared in person have data with accurate labels yet lack an amount of data and labels, so the datasets need augmentation for an increase in dataset quantity. When using a data label platform such as crowdsourcing or an automated label program to utilize numerous datasets for training, the datasets need filtering because they often include noisy labels. In this thesis, we implement basic data augmentation and filtering techniques as a Splash language construct, called data preprocessor, to support data engineers.
Data augmentation function in Splash data preprocessor increases dataset quantity by using seven augmentation techniques: horizontal and vertical shift, horizontal and vertical flip, random rotation, random brightness, and random zoom. The data filtering function finds duplicated images with different and same labels, then removes those images to improve the quality of the training dataset. To demonstrate the feasibility of using Splash data preprocessor and to confirm the correctness of the data preprocessor implementation, we trained the CIFAR-10 dataset as an experiment using Splash data preprocessor. This experiment shows that training data filtering and augmentation can be easily performed using the Splash data preprocessor.
Splash는 인공 지능 응용 개발을 지원하기 위해 만들어진 GUI 프로그래밍 프레임워크이다. Splash는 프로그래밍 추상화를 통해 데이터, AI 모델링, 제어 엔지니어를 포함한 여러 분야 전문가들이 프로그래밍적 지식 없이도 손쉽게 사용할 수 있도록 만들어졌다. 인공 지능 응용 개발을 지원하는 Splash의 기능을 더욱 향상시키기 위하여 데이터 전처리 기능을 Splash의 언어 구조로 추가하였다. 이 언어 구조는 데이터 엔지니어의 주요 업무인 데이터 전처리 중 데이터 필터링과 증강 기능을 지원한다.
지도 학습(supervised learning)에서 데이터 필터링과 증강은 특히 중요한 작업이다. 지도학습을 위해서는 레이블이 되어있는 데이터가 필요한데, 쉽게 구할 수 있는 MNIST와 같은 학습 데이터셋이나 직접 레이블링 한 데이터셋은 수가 한정적이다. 따라서 데이터의 수를 증가시키기 위하여 데이터 증강 기술이 필요하다. 많은 수의 데이터셋을 활용하기 위해서 크라우드소싱 같은 데이터 레이블 플랫폼이나 자동 레이블 프로그램을 이용하는 경우, 레이블이 잘못되어 있는 경우가 많기 때문에 이를 필터링해야 한다. 본 논문에서는 지도 학습에서 필요한 기본적인 데이터 필터링 기법과 데이터 증강 기법을 Splash에 구현하여 데이터 엔지니어가 손쉽게 이용할 수 있도록 한다. Splash 데이터 전처리 연산자는 이미지의 중복성을 판단하여 필터링하고, 일곱 가지 방법으로 이미지를 증강시킨다. 우리는 Splash 데이터 전처리 연산자를 사용하여 지도 학습 데이터 필터링 및 증강을 쉽게 수행 할 수 있음을 보였다.

Language: eng

URI: https://hdl.handle.net/10371/177565

https://dcollection.snu.ac.kr/common/orgView/000000167685

Files in This Item:

000000167685.pdf 2.98 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Program in Bioengineering (협동과정-바이오엔지니어링전공)
  - Theses (Master's Degree_협동과정-바이오엔지니어링전공)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share