베이지안 추정을 활용한 위계적 채점자 모형의 확장 및 적용 연구

박민호

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

베이지안 추정을 활용한 위계적 채점자 모형의 확장 및 적용 연구 : Extension and Application of Hierarchical Rater Model Using Bayesian Estimation

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 박민호

Advisor: 박현정

Issue Date: 2021

Publisher: 서울대학교 대학원

Keywords: 위계적 채점자 모형 ; 베이지안 추정 ; MCMC ; 깁스 표본 추출법 ; 채점자 효과 ; hierarchical rater model ; bayesian estimation ; Gibbs sampling ; rater effect

Abstract: 이 연구의 목적은 대규모 검사의 구성형 문항 채점에서 나타나는 채점자 특성에 따른 영향이 문항에 따라 다르게 나타나는 것을 정확하게 파악하기 위해서 기존의 위계적 채점자 모형(HRM)을 확장한 모형을 제안하고, 이 모형의 정확성과 효율성을 검증하는 것이다. 최근에 평가에서 널리 활용되고 있는 수행평가의 단순한 형태인 구성형 문항은 채점자의 평정을 거치기 때문에 측정학적 측면에서 채점 신뢰성이 확보되어야한다. 이를 위해서는 채점자의 전문성, 엄격성, 일관성 등이 갖춰져야 하므로 채점 기준 개발에 많은 노력을 기울이고, 채점에 앞서 채점자 훈련을 실시하지만 완전한 통제가 어려운 것이 사실이다. 때문에 통계적 분석 방법으로 채점자 특성의 영향을 제거하고 피험자의 능력을 정확히 추정하기 위한 문항반응이론 모형들이 등장하였고, 본 연구에서 다루는 HRM도 등장하게 되었다.
전통적인 문항반응이론 모형들부터 HRM에 이르기까지 문항반응이론 모형들은 대부분 채점자의 특성이 채점에 미치는 영향이 문항과 무관하게 일정한 것으로 모형화하였다. 그러나 채점자가 문항과 무관한 채점 특성을 가지는 채점자-문항 독립성 가정은 현실적으로 지켜지기 어렵고, 분산분석에 바탕을 둔 일반화가능도 연구에서 이를 위배하는 사례들이 흔하게 보고된다. 이러한 상황에서 논술형 문항 채점에 많이 활용되는 신호탐지이론-위계적 채점자 모형(HRM-SDT)에 채점자-문항 독립성 가정에 따른 제약을 제거한 연구가 등장하였으나 모형의 복잡성으로 소수의 논술형 문항 채점에 주로 활용되는 데 그쳤고, 다수의 구성형 문항으로 이루어진 검사에는 적용되지 못하고 있다.
이에 본 연구에서는 채점자의 일관성을 무선효과로 모형화하여 간명성 측면에서 장점을 가지는 HRM에서 채점자-문항 독립성 제약을 완화하여 재모수화하는 모형을 탐색하고, 실제 자료에 적용하는 연구와 모형의 정확성 및 효율성을 검증하는 모의실험 연구를 진행하고자 하였다. 개발, 검증, 적용을 거치는 일반적인 모형 개발 연구와는 달리 본 연구는 기존 모형에 바탕을 두고 있기 때문에 채점자-문항 독립성 제약 완화에 따라 가능한 모든 모형을 먼저 실제 자료에 적용하여 모형의 적용 가능성을 확인하고, 이후에 정확성과 효율성을 검증하는 절차를 거치는 순서로 연구를 진행하였다.
적용 연구에서는 HRM에 포함된 채점자 엄격성 모수와 채점자 분산성 모수의 채점자-문항 독립성 제약 완화에 따라 총 4가지 모형을 분석하였다. 이들 모형을 베이지안 기반의 MCMC 방법으로 국제비교 대규모 성취도 검사인 PIRLS 2006 자료에 적용함으로써 실제 분석 가능성을 탐색하였고, 제안 모형의 정확성과 효율성을 검증하는 모의실험 연구를 진행하였다. 적용 연구 및 모의실험 연구의 결과를 제시하면 다음과 같다.
적용 연구 결과, 채점자의 일관성이 문항에 따라 다름을 가정하고 분산성 모수에서만 채점자-문항 독립성 제약을 완화한 E-HRM(Extended HRM)이 MCMC 연쇄 수렴과 해석가능성에서 가장 나은 결과를 보여 제안 모형으로 타당함을 확인하였다. 그리고 HRM 분석에서는 채점자 특성의 영향이 전체 문항에서는 평균적으로 일관된 것처럼 보이더라도 E-HRM 분석을 통해 채점자 특성의 영향이 특정 문항에서는 일관되지 않을 수 있다는 것을 확인하였고, 상세하고 유의미한 채점자 정보를 추가로 획득할 수 있음을 증명하였다. 또한 이질적인 채점 분포를 가지는 문항들이 혼재된 채점 자료를 분석하였을 때, HRM은 추정이 불안정하거나 이질성을 놓치는 취약점을 보였지만, E-HRM은 안정적인 추정으로 채점 특성의 이질성을 분석할 수 있다는 장점을 보였다. 반면, 이론적으로 1개의 관찰값만 있더라도 추정이 가능하다는 MCMC 방법에서도 채점자-문항당 채점건수가 어느 정도 이상 확보되지 못하면 E-HRM이 적절하게 수렴하지 못하는 것으로 나타났다.
모의실험 연구에서는 정상범위 채점자 비율, 채점자-문항당 채점건수의 2개 요인으로 12개의 조건을 구성하여 E-HRM의 정확성과 효율성을 평가하였다. 모의실험 결과로 첫째, 채점건수가 적을수록 E-HRM의 모수 추정 정확도와 효율성이 낮아지고 전반적 오차가 증가하는 경향이 나타났다. 채점자 엄격성 모수에 대한 추정치 사후분포가 95% 신뢰구간 내에 참값을 포함하는 비율은 12개 실험설계 조건에서 중 하나의 조건을 제외하고는 모두 95% 수준으로 상당히 높은 정확성을 보였다. 그리고 분산성 모수에 대한 추정치의 사후분포가 95% 신뢰구간 내에 참값을 포함하는 비율은 채점자-문항당 채점건수 200건에서 정상범위 채점자 비율 100%, 70%, 40% 조건들에 대해 각각 .953, .948, .944로 95%에 근사한 수치를 보였다. 하지만 채점건수가 감소할수록 줄어들어 25건에서는 정상범위 채점자 비율 100%, 70%, 40% 조건들에 대해 각각 .881, .894, .922로 다소 낮은 것으로 나타났다.
둘째, 정상범위 기준을 벗어나는 채점자 모수에 대한 판별 측면에서 E-HRM은 정상범위 외 채점자 분산성 모수에 대한 높은 판별 정확도를 보였다. 채점건수별로 E-HRM의 민감도와 특이도를 분석한 결과, 채점건수가 200건일 때는 민감도 .953, 특이도 .988로 정상범위 채점 특성과 정상범위 외 채점 특성을 매우 높은 확률로 판별해내었다. 그리고 채점건수 100건인 경우에도 민감도 .903, 특이도 .985로 90% 이상의 판별 정확성을 보였지만, 채점건수가 감소함에 따라 민감도가 낮아져 50건과 25건인 경우 민감도는 각각 .864, .794로 다소 낮았다.
셋째, E-HRM은 채점 특성의 영향을 정확히 판별하고 이 영향을 적절히 보정하여 엄격(혹은 관대)하거나 신뢰도가 낮은 채점자 특성이 피험자 모수와 문항 모수의 정확성 및 효율성에 미치는 영향을 통제할 수 있는 것으로 나타났다. 그리고 채점자 모수 추정치의 사후분포가 다봉 분포가 뚜렷한 경우가 빈번해 HRM의 수렴 비율과 정확성이 떨어지는 상황에서는 E-HRM은 추정에 큰 문제가 없고, HRM보다 더 나은 모수 복원력을 보였다.
이 연구는 베이지안 추정을 활용한 E-HRM을 하나의 실제 자료에 적용하였고, 채점 자료의 다양한 설계를 모두 포괄하지 못하는 한정적인 조건하에서만 E-HRM의 성능을 평가하였다는 제한점이 있다. 그러나 다양한 채점 설계를 가지는 PIRLS 2006의 24개국 자료를 베이지안 추정을 활용한 E-HRM으로 분석하여 E-HRM의 추정 및 해석 가능성을 확인하고, 모의실험을 통해 특정 조건하에서 E-HRM의 정확성과 효율성을 검증하였다는 점에 의의가 있다. 그리고 E-HRM 유용성을 확인하였다는 점에서 E-HRM이 제공하는 채점자-문항 단위의 상세한 채점자 특성 정보는 채점자의 전문성 향상을 위한 중요한 진단적 정보로 활용되고, 대규모 검사의 채점자 재훈련 및 재채점에 들어가는 노력, 시간, 비용을 크게 절감하는 데에도 매우 유용하게 활용될 수 있으리라고 기대한다.
The purpose of the study was to propose an extended hierarchical rater model(HRM) for identifying differential rater effects on rating construction items of a large-scale test and evaluating its accuracy and efficiency. Construction items, a simple performance evaluation form used widely in recent years, are scored by the rater, so rating reliability is crucial to their effectiveness. To ensure rating reliability, the rubric must be clearly made during the test development process so that they are not interpreted differently depending on the raters. Since the raters must be professional, rigorous, and consistent, there must be sufficient training prior to rating process. Despite the efforts of these test development and rating process, it is difficult to completely control the rater effects. Rater effect models including HRM based on item response theory have emerged to eliminate rater effects and accurately evaluate examinee performance.
Most item response theory models, from the traditional models to HRM, assumed the influence of rater effect was constant regardless of the items. It is however difficult to keep the rater-item independence assumption in reality, and it is frequently violated in generalizability studies that use ANOVA. In this context, studies that removed the contraint due to the assumption of rater-item independence were performed on the Hierarchical Rater Signal Detection Model(HRM-SDT), a method commonly used to analysis rating data of essay-type items. However because of their complexity, those studies were only used in a few essay-type items, not in tests comprising a large number of construction items.
Thus, in this study, to explore the reparameterize model by mitigating the rater-item independence constraints of HRM, which has advantages in regards to parsimony, as the consistency of the rater will be modeled as a random effect. Then application study and simulation study were examined to verify the accuracy and efficiency of the model. Instead of general research that develops, verifies, and applies models, this study leverages an existing model and first applies all possible models to actual data, following a mitigation of the rater-item independence constraints, followed by a procedure to verify accuracy and efficiency.
A total of four models were analyzed using the rater-item independence constraints mitigation of rater severity parameters and rater variability parameters included in HRM. By applying these models to the PIRLS 2006 data, an international comparison large-scale performance test, and using the Bayesian-based MCMC method, the applicability of model was explored. In addition, the simulation study was conducted to verify the accuracy and efficiency of the proposed model.
Based on the results of the application study, E-HRM(Extended HRM), assuming the rater's consistency varies across items and reducing the rater-item independence constraints only for the variance parameters, showed the best results of convergence in MCMC chains and interpretability. The analysis of E-HRM revealed that rater effects could be inconsistent in certain items even though they seem to be consistent on average in overall items. The analysis also found that further meaningful information of differential rater effects on each item can be acquired through E-HRM. In addition, E-HRM was better for analyzing data consist of items with heterogeneous rating distributions compared to HRM. However, even with MCMC, E-HRM could not adequately converge without enough number of rating case per item by rater.
In the simulation study, 12 conditions with two factors (normal range rater's rate, number of raters for each rater item) were constructed to evaluate the accuracy and efficiency of E-HRM. The simulation study revealed the following results. First, the study showed that the parameter estimation was less accurate and efficient when there were fewer number of rating per rater-item, as well as larger errors. The proportion of rater severity parameters having true values within the 95% confidence interval showed significantly high accuracy, except for one simulation design condition. The proportion of rater variability parameters having true values within the 95% confidence interval is calculated under the conditions of the normal range rater rate 100%, 70%, and 40%, for number of rating per rater-item 200 cases .953, .948, and .944 respectively, which is about 95%. Nevertheless, as the number of rating per rater-item decreased, the corresponding rates for 100%, 70% and 40% decreased, at .881, .894, and .922, respectively.
Second, regarding the discriminate abnormal rating, E-HRM demonstrated high sensitivity on rater variability parameters. The sensitivity and specificity of E-HRM by the number of rating per rater-item were analyzed. The results showed that with 200 cases, sensitivity and specificity were .953 and .988, respectively, and different properties between normal raters and abnormal raters could be discerned with a high degree of probability. With 100 cases, the sensitivity and specificity were .903 and .985, indicating more than 90% accuracy in discrimination. However the sensitivity decreased as the number of rating per rater-item decreased. There was slightly lower sensitivity for 50 cases and 25 cases at .864 and .794, respectively.
Last E-HRM demonstrated its capability to discriminate rater effects clearly and to appropriately compensate for the effects of rater severity or rater inconsistency on the accuracy and efficiency of examinee and item parameters. Further, whereas HRM showed poor convergence rate and accuracy due to the multimodal posterior distributions of the rater parameter estimates, E-HRM showed stable estimation and better recovery of parameter than HRM.
The study is not without limitation. The evaluation of E-HRM using Bayesian estimation was conducted only for an actual dataset and covered limited simulation designs. However, the data from 24 countries of PIRLS 2006 with varying rating designs were analyzed with E-HRM to confirm its performance and interpretability. Moreover, through the simulation study, the accuracy and efficiency of E-HRM were evaluated under specific conditions. Since the usefulness of E-HRM has been confirmed, detailed information of rater effects by each rater-item offers a benefit to raters in terms of improving the professionality of raters and significantly reduce the cost, effort, and time of re-training and re-scoring in large-scale tests.

Language: kor

URI: https://hdl.handle.net/10371/178529

https://dcollection.snu.ac.kr/common/orgView/000000168587

Files in This Item:

000000168587.pdf 7.62 MB

Appears in Collections:

College of Education (사범대학)
- Dept. of Education (교육학과)
  - Theses (Ph.D. / Sc.D._교육학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share