Publications

Detailed Information

Estimating minority class proportion for class imbalance and class overlap : 클래스 불균형과 오버랩에서의 소수 클래스 비율 추정

Cited 0 time in Web of Science Cited 0 time in Scopus
Authors

선미선

Advisor
조성준
Major
공과대학 산업공학과
Issue Date
2015-08
Publisher
서울대학교 대학원
Keywords
Data MiningClass ImbalanceClass OverlapMinority ProportionCalibrated ProbabilityPrevalence
Description
학위논문 (박사)-- 서울대학교 대학원 : 산업공학과 데이터마이닝전공, 2015. 8. 조성준.
Abstract
In the class imbalance and class overlap problem, the direct application of the predictions of the classification model may cause a substantial error. Most classification algorithms aptly focus on the classification of the majority instances while ignoring or misclassifying the minority instances in class imbalance problems. Moreover, emerging challenges, such as an overlap between classes, complicate the solution of the class imbalance to an even greater extent. In situations in which there is an imbalance and overlap between classes where an imperfect classification task can cause great bias in predicting classes, an analytical framework capable of estimating the minority proportion can serve as a more suitable tool, because it provides more reliable information to identify the real-world probability of an instance being the minority. A method for estimating minority proportions can secure its feasibility by providing a good estimate of these proportions for the test, provided a training set exists. The estimation of the minority proportion can be divided into two categories depending on whether the approach is detailed or aggregated: (1) obtaining the calibrated probability and (2) predicting the prevalence. First, this study proposes a robust calibration technique, Receiver Operating Characteristics (ROC) Binning, to obtain the calibrated probability accurately even if the prevalence of minorities differs in the test and training sets. The new technique uses the True Positive Rate (TPR) and False Positive Rate (FPR), which are insensitive to class skews, and directly reflect the prevalence of minorities in the test set. This method distinguishes between the nature of data distribution within a class value and the effect of the class prevalence by using an ROC curve
thus, it delivers a robust performance with class skews. The effectiveness of the ROC Binning technique was verified by evaluating it together with well-known calibration methods in terms of the Brier Score and Calibration Loss. Among the evaluated calibration methods, the proposed ROC Binning technique was the outstanding front-runner across entire binary-class datasets. An application of the proposed method, ROC Binning and its modification, namely TPR Binning, are used to assess the real-world Military Personality Inventory (MPI) data of the ROK in an attempt to identify those conscripts who are maladjusted. This process is necessary to determine who among them would qualify for exemption from active military service or need special attention. The MPI presents a kind of class imbalance and overlap problem, wherein the majority fulfills active service and the minority is maladjusted, and a conventional classification model is likely to perform poorly. As an alternative, this study applies both ROC and TPR Binning to estimate the calibrated probability, or the maladjusted proportion of persons sharing similar MPI test results. The results we obtained with the real-world MPI dataset confirmed that the suggested method performs well. Second, this study led to the proposal of a simple and effective prevalence estimation method, Similarity-Based Adjusted Count (SAC), to predict prevalence accurately even if the test and training sets have different class distributions. The SAC method is based on an AC method and uses TPR and FPR values of the training instances that are similar to those of the test instances. The proposed SAC adaptively uses part of the whole training set for the estimation of TPR and FPR, whereas conventional Adjusted Count (AC) uses the whole training set for this purpose. The effectiveness of the SAC method was verified by evaluating it and the other AC-based quantification methods using Support Vector Machine or proportionally weighted k-NN as base classification learner in terms of Absolute Error. Among the evaluated prevalence estimation methods, the proposed SAC stood out as the front-runner across all binary-class datasets. Besides, the prevalence estimate was applied in the form of a class prior for obtaining calibrated probability and classifying instances for Bayes classifiers. In Bayesian classification algorithms, prior probability for a class value directly affects the classification decision. The class prior is intrinsically identical to class prevalence in this study. The experimental results indicated that the use of a prevalence estimate enhances to a greater extent the accuracy of the calibrated probabilities and class predictions that were generated compared to other approaches that rely on the class prevalence of the training set, provided that a training set exists of which the positive proportion differs from that of the test set. Additionally, we proposed a correlation-based Gaussian Bayesian network (CGBN), which is a hybrid (filter- wrapper) Bayesian classification algorithm that considers both classification accuracy and the intrinsic dependence between attributes.
Language
English
URI
https://hdl.handle.net/10371/118245
Files in This Item:
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share