Improving Multiple-Choice Distractor Generation via Enhanced Reading Comprehension and Human Feedback

Abstract: 학생들의 학습 성과를 온전히 반영하지 못한다는 지속적인 비판에도 불구하고, 선다형 문제는 여전히 실용성, 재생산성, 채점의 용이함 등의 장점으로 가장 널리 쓰이는 평가 방식이다. 선다형 문제를 설계할 때, 가장 시간이 오래 소요되며 문제의 질에 가장 크게 영향을 미치는 것이 바로 오답 선지이고, 이를 고안하는 데 가장 많은 시간과 정신적 자원이 필요하다. 따라서 자동 오답 선지 생성(automatic distractor generation)은 선다형 문제를 만드는 데 필요한 자원을 줄여주고, 교사와 학생 모두에게 풍부한 문제를 제공할 수 있을 것이라 기대되지만, 자동 오답 선지 생성은 자연어 처리에서 여전히 연구가 미흡한 분야 중 하나이다. 유사 분야인 질문 생성이 상당히 많은 주목을 받았던 것과는 비교되는 지점이다. 해당되는 데이터셋이 거의 존재하지 않아서 규모 있는 데이터셋을 통한 지도 학습이 불가하다는 사실 또한 연구 진척의 장애물이었다.
그러나 대규모 언어 모델(large language model)의 맥락 속 학습 (in-context learning) 능력 덕분에 오답 선지 생성에서 비지도 학습이 가능해졌다. 모델 학습 시에 오답 선지를 생성하는 태스크를 보지 못했음에도, 몇 가지의 선다형 문제를 주입하는 것만으로도 무난한 오답 선지를 생성할 수 있다. 그러나, 최적의 오답 선지를 생성하는 데에는 LLM이 더 높은 질의 오답 선지를 생성하는 데는 언어 모델이 주어진 맥락을 더욱 깊이 이해할 수 있는 것이 필요하다. 이 연구에서는 그러한 지점을 해결하기 위한 새로운 ICL방법을 제시한다. ReCP (Resolving Reading Comprehension through Prediction; 예측을 통한 독해 해결) 방법은 실제 교육 환경에서 학생들이 추론을 통해 문자 그대로의 의미를 넘어 독해할 수 있도록 돕는 교수법에서 착안하여, 대규모 언어 모델이 맥락을 해석 하는 과정에서 이전의 맥락을 토대로 다음 내용을 예측하게 하는 질문을 삽입한다. 예측 질문은 맥락을 말 그대로 해석해서는 답하기가 어렵기 때문에 모델이 독해과정에서 추론을 할 수 있게 유도한다. 모델이 독해에 더 적극적으로 개입하는 방식인 것이다. 이 방식의 맥락을 주입받은 모델은 더욱 맥락에 충실한 오답 선지를 생성하는 반면, 그렇지 않은 모델은 학습 중 획득한 기본적인 배경 지식으로만 오답 선지를 생성하는 경우가 많았다.
나아가 인간 피드백을 통한 강화 학습(reinforcement learning from human feedback)을 통해 언어 모델이 생성한 오답 선지가 인간의 선호를 반영할 수 있게 하여, 질의 상승과 좋은 오답 선지의 특성을 반영하는 것을 도모한다. 지도 학습을 위한 데이터가 적은 상황에서, RLHF는 거대 지도 데이터셋으로 학습하는 것을 대체할 수 있으며, 교사가 선호하는 오답 선지의 특성을 간접적으로 코딩할 수 있다. 따라서, ReCP와 베이스라인이 각각 생성한 오답 선지 세트 중 인간이 선택하여 개발한 데이터셋으로 인간이 선호하는 오답 선지를 판단할 수 있도록 규정을 훈련했다.
ReCP와 RLHF의 효과를 검증하기 위해 3가지 실험을 진행했다. 실험1, 2는 LLM이 오답 선지 생성 과제에 적응하는 방식을 문제 샘플의 개수를 달리하며 살펴보고, 실험3에서는 RLHF가 모델 성능에 미치는 영향을 검토한다. 본 연구는 오답 선지 자동 생성 분야에서의 대규모 언어 모델의 가능성을 확인하고, 새로운 프롬프팅 방법과 RLHF 도입을 통해 오답 선지 생성뿐 아니라 깊이 있는 독해가 필요한 다른 분야에도 응용될 수 있는 가능성을 제시한다.
Despite persistent criticism that multiple-choice assessments fail to fully capture students academic performance, they retain widespread use due to their practicality, reproducibility, and ease of scoring. The overall quality of multiple-choice question (MCQ) items depends heavily on the quality of distractors (incorrect options), and crafting them poses the most significant investment of time and energy in designing MCQ items. Automatic distractor generation (ADG) offers remarkable potential for reducing the required resources to produce MCQs and enriching educational resources for both teachers and students, yet remains underexplored within natural language processing (NLP) despite a vibrant field of question generation research that largely neglects distractors. Additionally, the scarcity of relevant datasets has posed a significant barrier to supervised language model training, further hindering development in the area. However, the emergence of large language models (LLM) with their exceptional in-context learning (ICL) capabilities has paved the way for unsupervised distractor generation. Feeding LLMs with even a few samples of multiple-choice questions enables them to produce acceptable distractors despite the novelty of the task. Yet, generating optimal distractors hinges upon the LLMs deeper comprehension of the provided context. This study proposes a novel method to reinforce ICL, ReCP (Resolving Reading Comprehension through Prediction), inspired by pedagogical approaches to foster students inference-based reading comprehension, which goes beyond the literal meaning. ICL with ReCP incorporates predictive questions that are not answerable extractively from the given context, encouraging the LLM to engage in inferential reasoning while reading. The competence of the approach stems from reinforcing the given context to make reading more engaging. It demonstrably yields distractors more firmly rooted in the context, surpassing the general knowledge-based distractors generated by regular means of feeding context. Furthermore, by employing reinforcement learning from human feedback (RLHF), ReCP allows the alignment of generated distractors with human preferences, facilitating quality improvement and better agreement with desired distractor characteristics. In the context where labeled data is scarce, RLHF functions as a valuable alternative for large, supervised datasets. In addition, RLHF helps code implicit properties of preferred distractors by teachers. Consequently, a human-annotated dataset was collected, where a better set of distractors was chosen between those generated through ReCP and those by the baseline. Then a policy was trained to identify human-preferred distractors. Three experiments were conducted to confirm the efficacy of ReCP and RLHF; two experiments with a varied number of sample questions that investigated how well the LLM adapts to the novel task of generating distractors, and one experiment to test the influence RLHF has on the models performance. This work marks a pivotal step towards realizing the potential of LLMs in ADG, with promising applications extending beyond any domain requiring in-depth textual understanding.

Language: eng

URI: https://hdl.handle.net/10371/209995

https://dcollection.snu.ac.kr/common/orgView/000000182307

Files in This Item:

000000182307.pdf 3.68 MB

Appears in Collections:

College of Humanities (인문대학)
- Linguistics (언어학과)
  - Theses (Master's Degree_언어학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share