Publications

Detailed Information

Intelligent Data Selection and Semi-Supervised Learning for Support Vector Regression : Support Vector Regression을 위한 지능적 데이터 선택 및 Semi-Supervised Learning

Cited 0 time in Web of Science Cited 0 time in Scopus
Authors

김동일

Advisor
조성준
Major
공과대학 산업공학과
Issue Date
2013-02
Publisher
서울대학교 대학원
Keywords
데이터마이닝데이터 선택Support Vector MachineSemi-Supervised Learning
Description
학위논문 (박사)-- 서울대학교 대학원 : 산업공학과, 2013. 2. 조성준.
Abstract
Support Vector Regression (SVR), a regression version of Support Vector
Machines (SVM), employing Structural Risk Minimization (SRM) principle
has become one of the most spotlighted algorithms with the capability
of solving nonlinear problems using the kernel trick. Despite of the great
generalization performance, there still exist open problems for SVR to overcome.
In this dissertation, two major open problems of SVR are studied:
(1) training complexity and (2) Semi–Supervised SVR (SS–SVR).
Since the training complexity of SVR is highly related to the number
of training data n: O(n3), training time complexity and O(n2), the training
memory complexity, it makes SVR difficult to be applied to big–sized
real–world datasets. In this dissertation, a data selection method, Margin
based Data Selection (MDS), was proposed in order to reduce the training
complexity. In order to overcome the training complexity problem, reducing
the number of training data is an effective approach. Data selection
approach is designed to select important or informative data among all
training data. For SVR, the most important data are support vectors. By
ε–loss foundation and the maximum margin learning, all support vectors of
SVR are located on or outside the ε–tube. With multiple sample learning,
MDS estimated the margin for all training data, efficiently. MDS selected
a subset of data by comparing the margin and ε. Through the experiments
conducted on 20 datasets, the performance of MDS was better than the
benchmark methods. The training time of SVR including running time of
MDS was with 38% ∼ 67% of training time of original datasets. At the
same time, the accuracy loss was 0% ∼ 1% of original SVR model.
Recently, the size of dataset is getting larger, and data are collected
from various applications. Since collecting the labeled data is expensive
and time consuming, the fraction of the unlabeled data over the labeled
data is getting increased. The conventional supervised learning method
uses only labeled data to train. Recently, Semi–Supervised Learning (SSL)
has been proposed in order to improve the conventional supervised learning
by training the unlabeled data along with the labeled data. In this dissertation,
a data generation and selection method for SS–SVR training is
proposed. In order to estimate the label distribution of the unlabeled data,
Probabilistic Local Reconstruction method (PLR) was employed. In order
to get robustness to noisy data, two PLRs (PLRlocal and PLRglobal) were
employed and the final label distribution was obtained by the conjugation
of 2–PLR. Then, training data were generated from the unlabeled data
with their the estimated label distribution. The data generation rate was
differed by uncertainty of the labeling. After that, MDS was employed to
reduce the training complexity increased by the generated data. Through
the experiments conducted on 18 datasets, the proposed method could improve
about 10% of the accuracy than the conventional supervised SVR,
and the training time of the proposed method including the construction
of final SVR was less than 25% of benchmark methods.
Two applications are analyzed. For response modeling, SVR based
two–stage response modeling, identifying respondents at the first stage and
then ranking them according to expected profit at the second stage, was
proposed. And MDS was employed in order to reduce the training complexity
of two–stage response modeling. The experimental results showed
that SVR employed two–stage response model could increase the profit
than the conventional response model. MDS reduced the training complexity
of SVR to about 60% of original SVR with minimum profit loss.
For Virtual Metrology (VM), the proposed SS–SVR method was applied
to a real–world VM dataset by using the unlabeled data with the labeled
data for training. Data were collected from two pieces of equipment of the
photo process. The experimental results showed the proposed SS–SVR
method could improve the accuracy about 8% on average than that of the
conventional VM model. The accuracy of proposed method was better
than benchmark method while the training time of the proposed method
was relatively small than benchmark methods.
Language
English
URI
https://hdl.handle.net/10371/118235
Files in This Item:
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share