심층인공신경망 모형을 활용한 기초학력 미달 학생 예측연구

Abstract: Low academic achievers are students below basic academic ability, and their learning deficit tends to gradually accumulate and deepen over time. The learning deficit causes negative emotions about studies or makes it difficult to adapt to school life. Since the causes of learning deficits are very diverse and complex, individualized and preventive support are needed to provide practical help to low academic achievers. Therefore, if a system that can predict the future academic performance of individual students can be established, it will be helpful to provide educational support before problems arise or at an early stage.
However, the previous studies that predicted the academic achievement of low academic achievers focus on examining the relationship between educational context variables and academic achievement rather than predicting individual students' academic achievement. This is because educational research aims to diagnose the current situation by examining major variables that affect academic achievement.
Also, among the studies that predicted academic achievement, there were few studies focused on low academic achievers. At the school or office of education level, many studies have investigated variables affecting the rate of low academic achievement, or analyzed students with below basic level by integrating with low-education students who were not below basic. One of the reasons for this is that low achievers are a very small proportion to the total number of students, and they are statistically imbalanced data. As a result, it is not easy to meet basic statistical assumptions, and it is difficult to secure the number of cases that can be used for the analysis.
In order to predict low academic achievers, a Deep Neural Network (DNN) model can be applied. The DNN does not assume a specific distribution of the data, but predicts each case by capturing the patterns within the analyzed data. In addition, it can prevent the omission of information on minority groups while training predictive models by sampling or cost-sensitive techniques.
Therefore, this study aimed to explore the DNN model that predicts low academic achievers belonging to a minority group. The education panel data that has been systematically collected over the years is used. Specifically, Korean Education Longitudinal Study 2013 was used, and the model for predicting low academic achievers in the subjects of Korean, English, and mathematics in the third year of middle school was detailed using students information in the second year of middle school.
Although DNN is a method that can utilize many variables as predictors, when variables not related to academic achievement are used, the performance of the predictive model can be lowered. Considering this, before actually constructing the DNN predictive model, we investigated the variables highly related to academic achievement among the 361 available variables. Based on these results, we tried to figure out the predictive models' performance by increasing the number of predictive variables.
Next, in order to construct the predictive model, the prediction performance were compared according to the methods of reflecting minority group information as much as the majority group during predictive model training and the DNN types. Random over sampling, Borderline-Synthetic Minority Over-sampling Technique (B-SMOTE), Adaptive Synthetic sampling approach (ADASYN), and cost-sensitive methods were applied to reflect minority group information. Multi-Layer Perceptron (MLP) and Convolutional Neural Network (CNN) were considered for DNN types. Based on the comparison results, the final predictive model for each subject was identified, and ways to improve the predictive models' performance were sought.
The main research results are as follows.
First, significant explanatory variables were identified using the Least Absolute Shrinkage and Selection Operator (LASSO), which can estimate the coefficients of each predictor while identifying the dependent variable among several variables. As a result of the analysis, it was found that 91 variables in Korean, 84 variables in English, and 83 variables in mathematics out of a total of 361 variables were related to academic achievement in each subject. However, as a result of checking the coefficients of the selected variables, it was confirmed that variables judged to have a low correlation with the dependent variables are still included. Therefore, a predictive model was constructed by increasing the number of predictors centering on variables with high absolute values of coefficients.
Next, to predict low academic achievers in Korean, English, and mathematics, the DNN prediction model is implemented according to the number of predictors, the method of reflecting minority group information, and the type of DNN, and the results are as follows. First, when the number of predictor variables in all subjects increased by more than a certain level, the performance of the predictive model tended to decrease. In particular, a relatively significant decrease in performance was observed in Korean, where the degree of imbalance is large. This demonstrates the need to consider the selection of appropriate predictors when constructing the DNN prediction model.
Second, the Original model, which did not adjust the degree of imbalance, showed inappropriate results for predicting low academic achievers. In Korean, almost all students were predicted to be non-low academic achievers. Relatively, when the DNN model was trained by reflecting the information of the minority group as much as the majority group, AUC, F2-score, G-mean, and specificity tended to increase. In addition, comparing the performance according to the method reflecting minority group information, the model applying the cost-sensitive method than the sampling method performed better overall. In particular, as the number of predictors increased to 49 or more, the difference between the sampling and cost-sensitive methods increased. Based on this, it is confirmed that it is necessary to apply a method to adjust the degree of imbalance when predicting imbalanced data, and it is suggested that a cost-sensitive method can be applied when implementing a prediction model using unbalanced data.
Third, the prediction performance of the MLP model was better than that of the CNN model, but it was difficult to judge that the difference was considerable. However, in the CNN model, regardless of sampling and cost-sensitive method, the model's performance was less degraded even if the number of predictors increased. Therefore, the CNN model can be considered in a situation where many predictors are used, but considering the characteristics of the panel data and the difficulty of setting the CNN model, it was discussed that the MLP model could be applied preferentially.
Finally, the MLP model, which used 25 predictors in all subjects and applied the cost-sensitive method, showed the best performance. However, when the accuracy was 0.704 to 0.768, the sensitivity was 0.703 to 0.762, so it was not easy to judge that the prediction performance was sufficient to use the prediction model. Therefore, based on this, we made suggestions for the future use and research of the DNN model, along with matters to be considered in the DNN model using panel data.
기초학력 미달(이하 기초미달) 학생의 학습 결손은 학년을 거듭할수록 심화되는 특성이 있으며, 누적된 학습 결손은 학업에 대한 부정적 정서를 초래하거나 학교생활 적응을 어렵게 하는 원인이 되기도 한다. 아울러 학습 결손이 나타나는 원인은 매우 다양하고 복잡하기 때문에, 기초미달 학생에게 실질적인 도움을 주기 위해서는 예방적 차원의 지원과 함께 개별화된 지원을 필요로 한다. 따라서 개별 학생의 향후 학업성취를 예측할 수 있는 시스템을 구축할 수 있다면 문제상황이 발생하기 이전 또는 초기 단계에 교육적 지원을 제공하는 데 도움이 될 것이라 예상할 수 있다.
하지만 학습부진 학생들의 학업성취도를 예측한 기존 선행연구를 살펴보면, 개별 학생의 학업성취도 예측보다는 교육 맥락 변인을 활용하여 학업성취도를 예측하는 모형을 설정하고, 이 예측모형에서 예측변인과 학업성취도 간 관계를 살펴보는 데 초점을 두는 경향이 있었다. 이는 교육연구에서는 개별학생 예측보다는 현 상황을 진단하고 교육적 지원을 제공하는 것이 중요하게 여겨졌고, 학업성취도에 영향을 미치는 주요 변인을 확인함으로써 교육과정 운영 및 정책적 시사점을 도출하고자 하는 데 연구의 목적이 있었기 때문이다.
또한 학생들의 학업성취를 예측한 연구 중 기초미달 학생에게 집중된 연구가 상대적으로 적은 모습을 확인할 수 있었다. 학교 또는 교육청 수준에서 기초미달 비율에 영향을 미치는 변인을 살펴보거나, 학업성취 하위 학생을 기초미달 수준의 학생과 통합하여 분석하는 모습이 나타났다. 이는 기초미달 학생이 전체 학생에 비해 매우 적은 비율에 속하는 학생들로서, 통계적으로는 불균형 자료(imbalanced data)에 해당되기 때문이다. 이로 인해 통계적 기본가정 충족이 쉽지 않고, 분석에 활용 가능한 사례 수를 확보하기 어렵다는 점에서 기초미달 학생에 초점을 둔 연구가 활발하게 이루어지지 못한 면이 있었다.
이에 기초미달 학생 예측모형을 구축하기 위한 연구방법으로 심층인공신경망(Deep neural network) 모형을 고려해볼 수 있다. 심층인공신경망 모형은 개별 학생을 정확하게 예측하는 모형을 설정하는 데 목적이 있는 연구방법이며, 자료의 특정 분포를 가정하지 않고 주어진 자료에서 확인할 수 있는 패턴을 학습하여 예측모형을 구축한다는 특징이 있다. 또한 예측모형을 훈련시키는 과정에서 불균형 정도에 따른 소수집단 정보가 누락되지 않도록 하는 여러 기법을 다양하게 적용할 수 있는 방법이기도 하다.
한편, 학업성취도 연구에서 많이 활용되는 교육패널자료는 학생들의 학습 환경과 관련된 다양한 변인에 대한 정보가 있으며, 매년 동일한 학생을 추적조사하여 변화에 대한 정보가 있는 자료이다. 교육분야에서는 개인정보 보호 등의 이유로 평가 결과나 상담 자료 또는 수업 상황에서의 음성, 문자 자료 등의 활용이 쉽지 않다. 상대적으로 패널자료는 국가 또는 시도 수준에서 학생들의 발달적 정보를 확인하고 교육 운영에 활용하기 위해 수년 간 구축된 자료이다. 따라서 교육패널자료를 활용하여 기초미달 학생 예측모형의 구축 가능성을 탐색해볼 필요가 있다.
이 연구에서는 수년에 걸쳐 체계적으로 수집되어온 교육패널 자료를 활용하여 소수집단에 속하는 기초미달 학생을 예측하는 심층인공신경망 모형을 탐색하는 데 목적을 두고 연구를 진행하였다. 분석에는 한국교육종단연구2013를 활용하였으며, 중학교 2학년 정보를 활용하여 중학교 3학년 시기의 국어, 영어, 수학 기초미달 여부를 예측하는 모형을 구현해보고자 하였다.
실질적인 심층인공신경망 예측모형을 구축하기 이전에, 심층인공신경망이 많은 수의 변인을 한 모형에 활용할 수 있는 방법이긴 하나, 학업성취와 관련이 없는 변인의 활용은 오히려 예측모형의 성능을 떨어뜨릴 수 있다는 점을 고려하였다. 따라서 활용 가능한 361개 변인 중 학업성취와 관련이 높은 변인과 그 정도를 확인하였고, 이 결과를 바탕으로 예측변인의 개수를 늘려가며 모형 성능을 확인하고자 하였다.
이후 기초미달 학생 예측모형 구축에서는 예측모형 훈련 시 소수집단 정보를 다수집단만큼 반영해주는 방법과 심층인공신경망 종류에 따른 예측 성능을 비교하였다. 구체적으로 소수집단 정보 반영방법은 임의과대표집, Borderline-SMOTE, ADASYN, 비용민감방법을 적용하였으며, 심층인공신경망 종류는 다층퍼셉트론(Multi-layer perceptron, 이하 MLP)과 합성곱신경망(Convolutional neural network, 이하 CNN)으로 예측모형을 구현해보고자 하였다. 비교한 결과를 토대로 과목별 최종 예측모형을 확인하였으며, 기초미달 학생 예측모형의 성능을 향상시킬 수 있는 방안을 모색하였다.
주요 연구 결과를 요약하여 제시하면 다음과 같다.
우선, 다수의 변인 중 종속변인과 관련 있는 변인을 확인하는 동시에 각 변인의 계수 추정이 가능한 LASSO(Least Absolute Shrinkage and Selection Operator) 분석 방법을 활용하여 주요 설명변인을 확인한 결과, 총 361개 변인 중 국어에서는 91개, 영어에서는 84개, 수학에서는 83개 변인이 각 과목의 학업성취도와 관련이 있는 설명변인으로 나타났다. 하지만 선정된 변인의 계수를 확인하였을 때 여전히 종속변인과 관련성이 낮다고 판단되는 변인들이 포함된 모습이 나타났고, 이에 따라 계수의 절대값이 높은 상위 변인을 중심으로 예측에 활용하는 변인의 수를 늘려가며 예측모형을 구축하였다.
다음으로 기초미달 비율이 다른 국어, 영어, 수학 과목의 기초미달 학생을 예측하기 위하여, 예측변인의 개수, 소수집단 정보 반영방법, 심층인공신경망 종류에 따라 심층인공신경망 예측모형을 상세화하고 그 성능을 살펴본 결과는 다음과 같다. 첫째, 모든 과목에서 예측변인의 개수가 일정 수준 이상 늘어나면 예측모형의 성능이 낮아지는 경향이 나타났으며, 국어에서는 상대적으로 성능 감소가 큰 모습을 보였다. 이는 심층인공신경망 예측모형을 구축할 때 적절한 예측변인의 선정을 고민할 필요가 있음을 보여준다.
둘째, 불균형 정도를 조정하지 않은 Original 모형에서는 기초미달 학생 예측보다 기초이상 학생 예측에 적합한 모습이 나타났다. 특히 집단별 불균형 정도가 높은 국어에서는 대부분의 학생을 기초이상 학생으로 예측하고 있었다. 상대적으로 모형 훈련 시 소수집단 정보를 다수집단만큼 반영해준 경우에는 AUC, F2-score, G-mean, 민감도가 상승하는 경향을 보였다. 또한 소수집단 정보 반영방법 간 성능을 확인하였을 때에는 표집방법에 비해 비용민감방법을 적용한 경우의 모형 성능이 더 좋은 경향을 보였으며, 특히 예측변인이 49개 이상으로 커짐에 따라 차이가 커지는 모습이 모든 과목에서 유사하게 나타났다. 따라서 불균형 자료를 예측하는 경우에는 불균형 정도를 조정해주는 방법을 적용하는 것이 필요함을 확인하였으며, 이와 함께 불균형 자료 예측에서 비용민감방법의 적용을 제안하였다.
셋째, CNN 모형에 비해 MLP 모형의 예측 성능이 더 좋게 나타났으나, 그 차이가 크다고 판단하긴 어려웠다. 하지만 CNN 모형에서는 소수집단 정보 반영방법에 관계없이 예측변인 개수에 따른 모형 성능 감소폭이 적은 경향이 나타났다. 이를 바탕으로 많은 예측변인을 활용할 때에는 CNN 모형 적용을 고려해볼 수 있으나, 패널자료의 특성과 심층인공망 종류에 따른 모형 설정의 한계 및 자료 변환 등을 고려할 때 MLP 모형을 우선적으로 적용해볼 수 있다는 점을 논하였다.
최종적으로 모든 과목에서 예측변인 25개를 활용하고, 비용민감방법을 적용한 MLP 모형의 성능이 가장 좋게 나타났다. 하지만 정분류율이 0.704~0.768일 때 민감도 0.703~0.762 수준으로 예측모형을 실질적으로 활용할 수 있을 정도의 예측성능을 보이진 않았다. 따라서 이를 바탕으로 패널자료를 활용한 심층인공신경망 모형에서 고려해야할 사항과 함께 향후 심층인공신경망 모형의 활용 및 연구에 대한 제언을 하였다.

Language: kor

URI: https://hdl.handle.net/10371/188073

https://dcollection.snu.ac.kr/common/orgView/000000173338

Files in This Item:

000000173338.pdf 2.54 MB

Appears in Collections:

College of Education (사범대학)
- Dept. of Education (교육학과)
  - Theses (Ph.D. / Sc.D._교육학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share