머신 러닝 기법을 활용한 영어 에세이 자동채점 방안 연구

Abstract: Machine learning is a technique for predicting new data by extracting hidden rules or patterns in data, and it has increasingly received attention in various areas. Recently, in the field of Education, it is used in various fields such as Education policy, Educational outcome, student motivation, and predicting academic achievement based on Education-related data.
With the development of data collection, computing, and information processing technology, the development of an automatic scoring model for English essays using machine learning techniques has been continuously discussed. English essays have the advantage of allowing assessors to evaluate a variety of factors such as learners ability to use language, cognitive ability and argumentation ability through a single essay. However, compared to short-answer questions, the assessors fatigue is higher, and there is a limitation in that it takes a lot of cost and time for training scorers. Accordingly, research related to automatic scoring, which automates the scoring of English essays, has been conducted, but existing studies have been mainly conducted by evaluation research institutes, such as ETS (Educational Testing Service). Therefore, the information related to the automatic scoring of English essays is limited, and a systematic and specific research is needed in order to be used in a wider range of fields such as schools and college admissions.
This study is conducted to search for an automatic scoring method using machine learning techniques for English essays, and the main research questions are as follows.

First, what are the components of features in the model for automated English essay scoring?
Second, how does the predictive power of the automated scoring model for the grade (or score) vary according to the characteristics of the English essay data?
Third, how does the predictive power of the automated scoring model vary according to the number of classes of the English essay grades (or score)?
Fourth, how does the predictive power of the automated scoring model vary according to the scoring characteristics that were put into the automated scoring model for English essays?

The data used in this study are YELC (Yonsei English Learners' Corpus) and ASAP (Automated Student Assessment Project). YELC is an English essay data about randomly distributed topics of six, and it was written by Korean college freshmen (a total of 3,236 cases, including final scores). The ASAP was written by 8th grade students in the United States (1,783 cases, including scoring results) on the same topic across students. The main findings are as follows.
First, the components of features in the automatic scoring model for English essays are classified into the level of word-sentence and the level of paragraph or higher. This study composed the level of word-sentence of lexical diversity, lexical sophistication, and grammatical and mechanical error. Also, this study composed the level of paragraph or higher of essay length, similarity with high-scoring answers, and readability. The textual features of the essay belonging to each sub-area were also selected. The selection of the features was based on the literature review related to the textual qualities that affect the final grade (score) of an essay. Using the natural language process and text analysis techniques, individual textual features were extracted.
Second, as a result of the analysis of the automated scoring model for English essays, the predictive power was different according to the characteristics of the essay data, and there were also differences in the importance of the scoring characteristics. As a result of random forest analysis, when the number of scoring categories and sampling techniques were not applied, the accuracy of YELC was 0.41 and the accuracy of ASAP was 0.53. In addition, as a result of analysis of the feature importance, in YELC, grammatical and mechanical errors, readability, and lexical sophistication demonstrated high importance. In the case of ASAP, readability, essay length, and lexical diversity showed high importance.
Third, as a result of analyzing the predictive power according to the number of classes in the automatic scoring model for English essays, the predictability of the model increased as the number of classes decreased. In the case of YELC, when the number of scoring categories was 7, the predictability was 0.41. But it improved as the number of scoring categories decreased, and when the number of scoring categories was 2, it improved to 0.75. In the case of ASAP, when the number of scoring categories was 10, the predictability was 0.53. But it improved as the number of scoring categories decreased, and when the number of scoring categories was 2, it improved to 0.82.
Fourth, as a result of analyzing the predictive power according to the scoring features of the automated scoring model for English essays, there was little difference. However, the imbalance of predictive power for each class was somewhat resolved when both the characteristics of the word-sentence level and the characteristics of the paragraph or higher level were input. For example, it was found that the scoring class, which failed prediction when only the word-sentence level was input, showed improvement in the predictive power when additional features above the paragraph level were added. This can be interpreted that it has an effect on resolving the imbalance in predictive power between scoring categories.
To summarize, this study specifically proposes the components of features in the automated scoring model for English essays, and the predictive power according to the item characteristics, the number of scoring classes, and scoring features for the automated scoring model using a machine learning technique. It is meaningful that it suggested the direction for the development of automated English essay scoring. The features of the automated English essay scoring model presented in this study and the analysis technique using machine learning can be used in further related studies. However, this study mainly suggests a model for automated scoring methods for English essays, so needs to be elaborated through improvement of features and machine learning techniques, and expansion of target data.
머신 러닝(machine learning)은 데이터 내 숨겨진 규칙이나 패턴을 추출하여 새로운 데이터에 대한 예측을 하는 기법으로, 머신 러닝에 대한 관심은 각 분야에서 지속적으로 증가하고 있다. 최근 교육학 분야에서도 교육 관련 데이터를 기반으로 교육정책, 교육성과, 학생의 동기, 학업 성취 예측 등 다양한 분야에서 활용되고 있다.
데이터 수집과 컴퓨팅 및 정보처리 기술이 발달하면서 머신 러닝 기법을 활용한 영어 에세이 자동채점 모델 개발은 지속적으로 논의되어 왔다. 영어 에세이는 학습자의 복합적인 언어 사용 능력을 포함하여, 사고력 및 논증력과 같은 다양한 요인들을 한 편의 글을 통해 평가할 수 있다는 장점이 있다. 그러나 단답형 문항에 비하여 채점자의 피로도가 높은 편이며 채점자 훈련에 대한 비용 및 시간이 많이 소요된다는 한계가 있다. 이에 영어 에세이 채점을 자동화하는 자동채점과 관련된 연구가 수행되어 왔으나, 주로 ETS(Educational Testing Service) 등 평가연구기관 차원에서 연구가 수행되었다. 따라서 영어 에세이 자동채점과 관련된 정보는 제한적인 편으로, 일선 학교나 대학 입학 전형 등 보다 폭넓은 분야에서 활용하기 위해서는 이와 관련된 체계적이고 구체적인 연구가 필요하다.
이 연구는 영어 에세이를 대상으로 머신 러닝 기법을 활용한 자동채점 방안 탐색을 위한 것으로, 주요 연구 문제는 다음과 같다.

첫째, 영어 에세이 자동채점을 위한 모형의 특성 구성요소는 무엇인가?
둘째, 영어 에세이 데이터의 특징에 따라 피험자의 등급(또는 점수)에 대한 자동채점 모형의 예측력은 어떻게 달라지는가?
셋째, 영어 에세이 등급(또는 점수)의 범주 수에 따라 피험자의 등급(또는 점수)에 대한 자동채점 모형의 예측력은 어떻게 달라지는가?
넷째, 영어 에세이 자동채점 모형에 투입한 채점 특성에 따라 피험자의 등급(또는 점수)에 대한 예측력은 어떻게 달라지는가?

이 연구에서 활용한 데이터는 YELC(Yonsei English Learners Corpus)와 ASAP(Automated Student Assessment Project)로 YELC는 국내 대학 신입생이 작성한 영어 에세이(총 3,236건, 채점 결과 포함)로 총 여섯 개의 주제에 대하여 무작위로 배정을 받아 작성된 것이며, ASAP는 미국 내 8학년 학생이 작성한 에세이(총 1,783건, 채점 결과 포함)로 제시된 지문을 참고하여 동일한 주제에 관하여 작성된 것이다. 주요 연구 결과는 다음과 같다.
첫째, 영어 에세이 자동채점 모형의 특성 구성요소를 단어-문장의 수준과 문단 이상의 수준으로 분류하고, 단어-문장 수준에는 어휘 다양성(lexical diversity), 어휘 정교성(lexical sophistication), 문법적·기술적 오류(grammatical·mechanic error), 문단 이상 수준에는 글의 길이(essay length), 고득점 답안과의 유사도(similarity), 글의 수준(readability)으로 세부 영역을 구성하였으며, 각각의 세부 영역에 속하는 에세이의 텍스트적 특성을 선정하였다. 이를 위하여 분석 대상 데이터의 특성을 파악하고 에세이 텍스트의 수준 및 세부 영역에 대한 정의 및 분류를 한 뒤, 각각의 세부 영역에 대하여 선행 연구를 토대로 에세이의 채점 결과에 영향을 줄 수 있는 특성을 선정하였다. 선정한 특성에 대하여 자연어 처리 및 텍스트 분석 기법을 활용하여 에세이 데이터에서 각각의 특성을 추출하였으며, 최종적으로 분석에 투입할 수 있는 데이터셋(dataset)으로 구성하였다.
둘째, 영어 에세이 자동채점 모형의 분석 결과 에세이 데이터의 특징에 따라 예측력이 달랐으며, 채점 특성 중요도에 있어서도 차이가 있었다. 랜덤 포레스트(random forest) 분석 결과 채점 범주 수의 조정과 샘플링 기법을 적용하지 않았을 때 YELC의 정확도는 0.41이었으며, ASAP의 정확도는 0.53이었다. 또한, 자동채점 모델에 투입한 채점 특성의 중요도 분석 결과 YELC 경우 문법적·기술적 오류(grammatical·mechanic error), 글의 수준(readability), 어휘 정교성(lexical sophistication)과 관련된 특성의 중요도가 높았으나, ASAP의 경우 글의 수준(readability), 글의 길이(essay length), 어휘 다양성(lexical diversity)과 관련된 특성의 중요도가 높았다.
셋째, 영어 에세이 자동채점 모형의 채점 범주 수에 따른 예측력을 분석한 결과, 채점 범주의 수가 적을수록 모델의 예측도가 높아졌다. YELC의 경우 채점 범주가 7개일 때는 0.41이었으나, 채점 범주의 수가 적어질수록 향상되어 채점 범주의 수가 2개 일 때 0.75까지 향상되었다. ASAP의 경우 채점 범주가 8개일 때 0.53이었으나, 채점 범주의 수가 적어질수록 향상되어 채점 범주의 수가 2개일 때 0.82까지 향상되었다.
넷째, 영어 에세이 자동채점 모형의 채점 특성에 따른 예측력을 분석한 결과, 단어-문장 수준의 특성과 문단 이상 수준의 특성을 모두 투입하였을 때 예측력의 차이는 거의 없었으나, 각 범주별 예측력의 불균형은 다소 해소되는 경향이 있었다. 예컨대, 단어-문장 수준만 투입했을 때 예측에 실패했던 채점 범주가 문단 수준 이상 특성을 추가 투입하였을 때 예측력이 향상되는 등 채점 범주 간 예측력의 불균형이 해소되는 것에 영향을 주는 것으로 나타났다.
요컨대, 이 연구는 영어 에세이 자동채점 모형의 특성 구성요소가 무엇인지 구체적으로 제안하고, 머신 러닝 기법을 활용한 자동채점 모형에 대하여 문항 특징, 채점 범주의 수, 채점 특성에 따른 예측력을 구체적·체계적으로 분석하여 추후 영어 에세이 자동채점의 개발 방향을 제시하였다는 것에 의의가 있다. 이 연구에서 제시한 영어 에세이 채점 모형의 특성 구성요소와 머신 러닝을 활용한 분석 기법은 추후 관련 연구에 활용될 수 있다. 다만, 이 연구는 영어 에세이 자동채점 방안을 제시한 연구로, 특성 구성요소 및 머신 러닝 기법에 대한 개선 및 보완, 대상 데이터의 확장 등을 통해 더욱 정교하게 발전시킬 필요가 있다.

Language: kor

URI: https://hdl.handle.net/10371/176658

https://dcollection.snu.ac.kr/common/orgView/000000165418

Files in This Item:

000000165418.pdf 2.48 MB

Appears in Collections:

College of Education (사범대학)
- Dept. of Education (교육학과)
  - Theses (Ph.D. / Sc.D._교육학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share