Publications

Detailed Information

Estimating minority class proportion for class imbalance and class overlap : 클래스 불균형과 오버랩에서의 소수 클래스 비율 추정

DC Field Value Language
dc.contributor.advisor조성준-
dc.contributor.author선미선-
dc.date.accessioned2017-07-13T06:04:12Z-
dc.date.available2017-07-13T06:04:12Z-
dc.date.issued2015-08-
dc.identifier.other000000049829-
dc.identifier.urihttps://hdl.handle.net/10371/118245-
dc.description학위논문 (박사)-- 서울대학교 대학원 : 산업공학과 데이터마이닝전공, 2015. 8. 조성준.-
dc.description.abstractIn the class imbalance and class overlap problem, the direct application of the predictions of the classification model may cause a substantial error. Most classification algorithms aptly focus on the classification of the majority instances while ignoring or misclassifying the minority instances in class imbalance problems. Moreover, emerging challenges, such as an overlap between classes, complicate the solution of the class imbalance to an even greater extent. In situations in which there is an imbalance and overlap between classes where an imperfect classification task can cause great bias in predicting classes, an analytical framework capable of estimating the minority proportion can serve as a more suitable tool, because it provides more reliable information to identify the real-world probability of an instance being the minority. A method for estimating minority proportions can secure its feasibility by providing a good estimate of these proportions for the test, provided a training set exists. The estimation of the minority proportion can be divided into two categories depending on whether the approach is detailed or aggregated: (1) obtaining the calibrated probability and (2) predicting the prevalence. First, this study proposes a robust calibration technique, Receiver Operating Characteristics (ROC) Binning, to obtain the calibrated probability accurately even if the prevalence of minorities differs in the test and training sets. The new technique uses the True Positive Rate (TPR) and False Positive Rate (FPR), which are insensitive to class skews, and directly reflect the prevalence of minorities in the test set. This method distinguishes between the nature of data distribution within a class value and the effect of the class prevalence by using an ROC curve-
dc.description.abstractthus, it delivers a robust performance with class skews. The effectiveness of the ROC Binning technique was verified by evaluating it together with well-known calibration methods in terms of the Brier Score and Calibration Loss. Among the evaluated calibration methods, the proposed ROC Binning technique was the outstanding front-runner across entire binary-class datasets. An application of the proposed method, ROC Binning and its modification, namely TPR Binning, are used to assess the real-world Military Personality Inventory (MPI) data of the ROK in an attempt to identify those conscripts who are maladjusted. This process is necessary to determine who among them would qualify for exemption from active military service or need special attention. The MPI presents a kind of class imbalance and overlap problem, wherein the majority fulfills active service and the minority is maladjusted, and a conventional classification model is likely to perform poorly. As an alternative, this study applies both ROC and TPR Binning to estimate the calibrated probability, or the maladjusted proportion of persons sharing similar MPI test results. The results we obtained with the real-world MPI dataset confirmed that the suggested method performs well. Second, this study led to the proposal of a simple and effective prevalence estimation method, Similarity-Based Adjusted Count (SAC), to predict prevalence accurately even if the test and training sets have different class distributions. The SAC method is based on an AC method and uses TPR and FPR values of the training instances that are similar to those of the test instances. The proposed SAC adaptively uses part of the whole training set for the estimation of TPR and FPR, whereas conventional Adjusted Count (AC) uses the whole training set for this purpose. The effectiveness of the SAC method was verified by evaluating it and the other AC-based quantification methods using Support Vector Machine or proportionally weighted k-NN as base classification learner in terms of Absolute Error. Among the evaluated prevalence estimation methods, the proposed SAC stood out as the front-runner across all binary-class datasets. Besides, the prevalence estimate was applied in the form of a class prior for obtaining calibrated probability and classifying instances for Bayes classifiers. In Bayesian classification algorithms, prior probability for a class value directly affects the classification decision. The class prior is intrinsically identical to class prevalence in this study. The experimental results indicated that the use of a prevalence estimate enhances to a greater extent the accuracy of the calibrated probabilities and class predictions that were generated compared to other approaches that rely on the class prevalence of the training set, provided that a training set exists of which the positive proportion differs from that of the test set. Additionally, we proposed a correlation-based Gaussian Bayesian network (CGBN), which is a hybrid (filter- wrapper) Bayesian classification algorithm that considers both classification accuracy and the intrinsic dependence between attributes.-
dc.description.tableofcontentsAbstract i
Contents viii
List of Tables x
List of Figures xiii
Chapter 1 Introduction 1
1.1 Class imbalance and class overlap problem 4
1.2 Estimating minority class proportion 7
1.3 Overview of this dissertation 11
1.4 Structure of this dissertation 14
Chapter 2 ROC Binning method 17
2.1 Background 17
2.2 Related work 21
2.2.1 Calibration method based on the existing classifier 21
2.2.2 Evaluation measures for calibration methods 24
2.2.3 ROC curve 26
2.3 ROC Binning 30
2.4 Performance on benchmark datasets 37
2.4.1 Experiment settings 37
2.4.2 Experiment results 45
2.5 Summary 57
Chapter 3 Application of ROC-based Binning for predicting
maladjusted soldiers 61
3.1 Background 61
3.2 Military Personality Inventory Data 63
3.3 Performance of classification algorithms using the MPI dataset 67
3.4 Use of ROC curve to obtain calibrated probability 69
3.4.1 Generation of ROC curve 70
3.4.2 Obtaining calibrated probability 72
3.5 Summary 84
Chapter 4 Similarity-based Adjusted Count (SAC) method 87
4.1 Background 87
4.2 Related work 90
4.3 Similarity-based adjusted count 95
4.4 Performance on benchmark datasets 101
4.4.1 Experiment settings 101
4.4.2 Experiment results 103
4.5 Summary 113
Chapter 5 Application of prevalence estimate into Bayesian
classifier 117
5.1 Background 117
5.2 Related work 119
5.2.1 BN learning from the undirected graphical model 119
5.2.2 Filter and wrapper approaches 126
5.2.3 Covariance constraint for Gaussian distribution 128
5.3 Finding the best CGBN 130
5.3.1 Step 1: Calculate R2c to measure the dependence 131
5.3.2 Step 2: Generate CGBNs using SLHC based on R2c 134
5.3.3 Step 3: Search for bCGBN using z-SEI and FS 136
5.3.4 Applying BSE to bCGBN 139
5.4 Performance on benchmark datasets 141
5.4.1 Experimental settings 141
5.4.2 Experimental results 142
5.5 Summary 144
Chapter 6 Conculsion 147
6.1 Contributions 147
6.2 Future work 151
Bibliography 155
Appendix A: First and second PCs of the benchmark datasets 175
Appendix B: BS distribution per binary dataset by changing
prevalence 183
Appendix C: AE distribution per binary dataset by changing
prevalence 191
Appendix D: Application of SAC into MPI dataset 199
국문초록 203
감사의 글 207
-
dc.formatapplication/pdf-
dc.format.extent12305793 bytes-
dc.format.mediumapplication/pdf-
dc.language.isoen-
dc.publisher서울대학교 대학원-
dc.subjectData Mining-
dc.subjectClass Imbalance-
dc.subjectClass Overlap-
dc.subjectMinority Proportion-
dc.subjectCalibrated Probability-
dc.subjectPrevalence-
dc.subject.ddc670-
dc.titleEstimating minority class proportion for class imbalance and class overlap-
dc.title.alternative클래스 불균형과 오버랩에서의 소수 클래스 비율 추정-
dc.typeThesis-
dc.contributor.AlternativeAuthorSun, Meesun-
dc.description.degreeDoctor-
dc.citation.pagesxiii, 205-
dc.contributor.affiliation공과대학 산업공학과-
dc.date.awarded2015-08-
Appears in Collections:
Files in This Item:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share