한국어 쓰기 평가의 채점자 요인 연구

Abstract: 본고의 목적은 한국어 쓰기 평가의 채점 결과에 나타난 채점자 영향을 살피고, 채점자의 채점 수행에서의 채점자 언어 프로토콜 분석하여 채점 양상과 그에 따른 채점자 요인을 규명하여 채점자 교육적 시사점을 제시하는 것이다.
기존의 채점자 연구들은 주로 채점자 배경 요인에 따라 그룹을 설정한 후, 채점자의 고유한 경향성(tendency)인 엄격성과 채점자의 적합도에 대한 MFRM 분석의 logit값을 비교하는 연구가 많았다. 하지만 채점자 배경 요인을 동일하게 통제한 상태에서도 여전히 채점자 간의 엄격성은 통계적으로 유의미한 차이가 있다고 보고되며(김지영·원미진, 2018), 채점 상황에서 실제로 작용하는 채점자 요인 중에는 채점자 배경 요인 뿐만 아니라 개별적인 고려가 필요한 사회·문화적인 요인도 있다(BaRaoiu, 2007). 이는 채점자 배경 요인 외에도 채점자의 채점 수행에 영향을 미치는 추가적인 채점자 요인들이 있다는 뜻이다. 최근에는 채점자의 채점 수행을 구체적으로 살펴 채점 과정에서 작용하는 요인들과 함께 종합적으로 고려할 때 채점자 배경 요인에 대한 해석도 가능해진다는 주장이 힘을 얻고 있다(이성준, 2019). 본 연구에서는 그 구체적인 실천 중의 하나로 쇼와 위어(Shaw & Weir, 2007) 등의 연구에서 언급된 채점자 인지·심리적 요인에 대한 고려가 필요함을 제안하였다. 이를 위하여 채점자 언어 프로토콜을 수집하여 MFRM 분석 결과 보고된 채점자 영향의 원인이 되는 채점자의 채점 수행 양상을 살펴보고, 그에 대한 채점자 인지·심리적 요인을 분석하였다. 이후 본 연구의 분석 결과를 채점자 배경 요인과 함께 종합적으로 논의하며, 채점자 교육적 시사점을 도출하였다.
먼저 이를 위한 이론적 토대를 구축하였다. Ⅱ장에서는 제 2언어 쓰기 평가의 특징과 쓰기 평가에 관여하는 요인들을 밝힌 후, 쓰기 평가에서의 채점자 요인의 중재적 성격을 논하였다. 기존의 논의를 바탕으로 쓰기 평가에서의 가상의 채점자 중재 과정 모형을 구축하였으며, 채점 과정에서 채점자가 기타 채점 요인과의 상호작용을 일으키며 쓰기 답안과 채점 기준 사이를 중재하는 채점의 주체자임을 논하였다. 다음으로 쓰기 평가 채점자의 채점 과정을 설명하기 위한 이론적 모형을 구축하였다. 이는 쓰기 평가 채점자의 인지 과정에 따라서 채점에 관여하는 요인들이 쓰기 평가에 어떻게 영향을 주는지에 관한 가설적인 모형이며, 채점자 언어 프로토콜의 코딩 및 분석을 위한 이론적 근거를 제공한다. 본 연구에서는 채점자가 수험자 쓰기 답안의 이미지 형성 및 표상화 단계와 점수 결정 단계의 두 단계를 오가며 채점을 수행한다고 보았다.
다음으로 구체적인 한국어 쓰기 평가의 채점 상황 속에서 채점자의 채점 수행을 살펴보기 위해 먼저 중·고급 학습자 대상의 한국어 자유 작문 쓰기 평가 문항을 구안하였다. 이후 국내 대학의 한국어 교육 기관과 학부 소속 한국어 학습자를 각각 5명씩 모집하여 한국어 쓰기 답안을 수집하였다. 또한 채점자에게 수험자 쓰기 답안을 채점하도록 하기 위해 평가 구인을 선정하고, 분석적 채점 방식과 5점 리커트 척도(likert scale)를 활용한 채점 기준표를 제작하였다. 또한 기술어(discriptor)를 제시함으로서 본 연구의 채점자들에게 채점 기준을 완성하였다.
채점 자료 수집을 위해서는 대학 기관에서 근무하는 한국어 교사 채점자 6명을 모집하였다. 채점자들의 경력은 1~2년차 3명, 4~5년 차 2명, 10년차 이상 1명으로, 본고에서 구안한 쓰기 문항에 응시한 수험자의 쓰기 답안 10개를 대상으로 완전 교차형 채점을 수행하였다. 채점자들에게 제공한 채점 기준표는 총 5개 평가 범주에 해당하는 11개의 채점 기준으로 이루어져 있으므로, 총 660개의 채점 수행 결과가 수집되었다. 자료의 통합적 분석을 위해 채점자들은 채점을 수행하며 동시에 채점 과정에서 일어나는 인지 과정을 언어 프로토콜 보고법(verbal protocol report, Ericsson & Simon, 1993)을 수행하도록 하였고, 약 10시간 45분 분량의 녹음 자료를 수집하였다. 녹음 자료는 모두 전사되었으며, 채점자 요인 분석에 적합한 형태로 코딩되었다.
쓰기 평가 채점 과정에서의 채점자 요인의 분석에 앞서서, 먼저 채점 결과를 해석하기 위하여 MFRM 분석을 통해 채점자 영향을 파악하였다. 이후 MFRM 분석 결과 중에서 추가적인 질적 분석이 필요한 채점 상황을 특정하여 분석 대상을 확정하였다.
다음으로 MFRM 분석 결과에 따라 보고된 채점자 영향에 대한 질적 분석을 실시하였다. 각 상황에 대한 채점자 프로토콜 분석(VPA)을 실시하여 각 채점자 영향에 대한 채점자의 채점 수행 양상을 살펴보고, 해당 양상을 일으킨 채점자 요인을 분석하였다. 이를 위하여 먼저 채점자들의 채점 당시 채점 결과 자료와 함께 채점 과정에 대한 사고 구술 자료를 수집하였으며, 이는 언어 프로토콜 형태로 텍스트로 전사되었다. 전사된 텍스트는 분할 및 분절되어 채점 과정 단계별 코딩 체계에 따라 코딩되었다. 이후 채점 수행의 양상에 따라 추가적인 프로토콜 분석을 통해 채점자 요인을 분석하였다.
채점자 언어 프로토콜 분석 결과 채점자의 채점 일관성 부족에 영향을 미치는 채점자 요인으로 수험자의 감정을 고려하는 점수 결정 스타일, 채점 기준에 의존하는 채점 근거 적용 스타일, 높은 수준의 수험자에 대한 엄격한 채점 스타일이라는 채점자 인지 스타일 요인이 보고되었다. 다음으로 집중 경향성에 영향을 미치는 채점자 요인으로 중앙값을 피하려는 점수 결정 스타일, 최저값을 피하려는 점수 결정 스타일이 보고되었으며, 무작위성에 영향을 미치는 채점자 요인으로는 보상적 채점에 대한 집념과 채점 근거 부재로 인한 인상 중심의 채점이라는 인지·심리적 요인이 보고되었다. 후광성에 영향을 미치는 채점자 요인으로는 언어적 오류에 대한 민감성과 채점 기준의 임의적인 해석 및 적용이라는 채점자 인지·심리적 요인이 보고되었고, 편향적 채점에 영향을 미치는 채점자 요인으로 채점자 집중력 요인과 쓰기 답안에 대한 선호도라는 채점자 요인이 보고되었다.
MFRM 분석을 통한 채점자 영향과 그에 대한 채점자의 인지·심리적 요인의 언어 프로토콜 분석 결과를 바탕으로, Ⅴ장에서는 한국어 쓰기 평가의 채점자 교육을 위한 논의를 하였다. 채점자 교육을 위한 논의를 위해서는 본 연구의 채점 자료를 활용한 분석 결과에서의 채점자 배경 요인과 채점자 인지·심리적 요인을 종합적으로 검토하였다. 먼저 배경 요인에 관한 논의로는 경력에 대한 논의가 있다. 5년 차 이상의 채점자들은 글의 구성 채점 기준에 대한 집중 경향성을 보였고, 5년 차 미만의 채점자들은 언어 적절성 채점 기준에 대한 집중 경향성이 보였다. 이에 관한 채점자 언어 프로토콜을 분석한 결과, 5년 차 이상의 채점자들은 하향식 읽기 방식을 위주로 수험자 쓰기 답안에 접근하고 있었으나, 5년 차 미만의 채점자들은 상향식 읽기 방식을 위주로 수험자 쓰기 답안에 접근하고 있었다. 그 결과 5년 차 이상의 채점자들이 글의 조직과 관련된 채점 기준을 더 유심히 채점한 반면, 5년 차 미만의 채점자들은 언어 사용과 관련된 채점 기준을 더 유심히 채점하면서 각각 상대적으로 더 엄격한 것이 확인되었다. 이와 관련하여 저경력 채점자들이 텍스트 이미지 구축 과정에서 수험자 쓰기 답안의 미시적인 특징에 집중하다가 전체적인 특징을 놓치는 일이 없도록 하향식 읽기 방식도 채점에 활용할 수 있도록 채점자 훈련이 필요함을 제안하였다.
다음으로 수험자 쓰기 답안의 수준 변화에 따른 엄격성의 유지 실패 양상이 1~2년 차의 저경력 채점자들에게서 크게 보였으며, 4~5년 차의 중경력 채점자들과 10년 차 이상의 고경력 채점자들은 상대적으로 그 영향을 덜 받는 것으로 나타났다. 이러한 엄격성 유지 실패 양상을 줄이기 위한 방안을 논의하고자 수험자 쓰기 답안의 수준별로 중요하게 작용하는 채점 기준이 다른 것을 확인하였으며, 이에 따라 저경력 채점자들에게 수험자 쓰기 답안의 수준에 따라 집중해야하는 채점 기준을 교육할 수 있음을 제안하였다.
또한 채점자 경력 요인으로 인한 평가 참조틀(frameworks of scoring)의 구성에서의 차이를 채점자 인지·심리적 요인을 활용한 채점 수행의 분석을 통해 그 구체적인 양상을 확인해 볼 수 있음을 논의하였다. 채점자 언어 프로토콜 분석 결과 보고된 채점 근거 부재의 인상 중심의 채점, 채점 기준에 지나치게 의존한 채점 근거 적용, 보상적 채점에 대한 집착과 같은 채점자 요인은 모두 저경력 채점자들에게서 나타났으며, 본 연구에서는 이것이 채점자의 채점 관련 스키마 요인으로 인한 채점 기준틀 구성의 차이를 보여준다고 보았다. 따라서 같은 채점 상황에서의 고경력 채점자들의 채점 수행을 참조하여 구체적인 채점자 교육 내용을 마련할 수 있음을 논의하였다.
다음으로 채점자 인지·심리적 요인에 따른 논의로는, 균형 있는 척도 점수 활용 교육의 필요성을 보고하였다. 채점 자료 분석 결과 본 연구의 채점자들에게서는 중앙값을 피하고자 하는 점수 결정 스타일 요인으로 인한 불균형한 척도 점수 활용이 보고가 되었다. 본 연구의 채점자들 가운데 중앙값인 3점을 가장 많이 사용한 경우는 한 명밖에 보고되지 않았다. 오히려 중앙값을 부여하지 않으려는 채점자 인지·심리적 요인으로 인하여 2점과 4점에서의 과적합·부적합한 척도 활용이 보고되었는데, 이는 기존 미포드와 울프(Myford & Wolfe, 2004)에서 보고한 중심 경향성과는 다른 연구 결과이다. 또한 최저점을 부과하지 않는 점수 결정 스타일도 보고가 되었는데, 이러한 불균형한 척도 활용은 다른 척도 점수의 부적합과 과적합한 척도 점수 활용으로 이어지므로, 균형 있는 채점 척도 활용에 대한 교육 및 훈련이 필요함을 제안하였다.
다음으로 서로 다른 채점 관련 스키마 요인과 관련된 논의를 하였다. 본 연구에서는 채점자마다 중요시 여기는 평가 구인이 다를 수 있으며, 수험자 쓰기 답안에 대한 개인적 선호에 따른 주관적 척도 점수의 활용이 나타날 수 있음이 보고되었다. 따라서 이러한 현상을 방지하기 위해서는 채점 실시 이전에 충분한 채점자 사전 협의를 통하여 채점 기준 활용을 구체적으로 논의하는 채점자 조정이 필요함을 제안하였다. 특히 사회 언어학적 기능과 관련된 채점 기준에 대해서는 적용 및 해석에서의 유동성이 더욱 크므로 명확한 기준이 필요함을 논의하였다.
마지막으로 채점자 집중력 요인에 관하여도 논의하였다. 본 연구에서의 한국어 쓰기 평가에서 가장 큰 채점자 영향을 준 것으로 보고된 채점자는 10년 차 이상의 고경력 채점자였다. 공식 채점자 훈련의 경험이 있으며 전체적인 적합도와 채점자 간 신뢰도도 양호했던 채점자임에도, MFRM 분석 결과 관찰 점수와 조정 점수의 잔차가 가장 크게 나타나며 가장 큰 채점자 영향이 발생했다고 보고된 바 있다. 이에 관한 원인을 추적한 결과, 특정 채점 상황에서 채점자 집중력 요인으로 인하여 편향적 채점을 수행하며 이로 인해 수험자 전체 순위의 판도가 바뀌었음이 나타났다. 이를 바탕으로 채점자의 고경력과 훈련 경험에도 불구하고 집중력 요인에 의해서 채점자에게 치명적인 실수가 나타날 수 있음을 제언하였으며, 이를 예방하기 위하여 이에 대한 주의를 환기하고 채점자가 집중력을 가지고 채점에 임할 수 있도록 채점 상황에서의 맥락 요인들에 대한 고려를 필요함을 제언하였다.
본 연구는 쓰기 평가의 채점자 중재적 특성으로 인하여 채점 수행에서 발생하는 채점자의 영향을 확인하고 그에 대한 채점자 요인을 살펴본 연구다. 채점자들의 쓰기 평가 결과에 대한 MFRM 분석 결과를 바탕으로 채점자 언어 프로토콜 자료를 활용하여 그 양상을 밝히고, 채점자 인지·심리적 요인에 따른 영향을 규명했다는 점에서 의의를 지닌다.
The purpose of this paper is to examine the effects of the raters shown in the scoring results of the Korean writing evaluation and to analyze the language protocol of the raters in the grading performance to identify the factors of the raters.
Existing findings on the background factors of the raters can ensure the validity of the explanation when the specific scoring situation is considered through the scoring process of the raters. In other words, it is possible to properly interpret the background factors and their effects when the scoring authority can explain under what circumstances and how the scoring method was applied and based on this determined. In this process, there are cognitive and psychological factors of the raters that affect them during scoring performance, apart from the background factors of the raters. In this study, we distinguished it from the background factors of the raters, and we also tried to determine how the factors affect the scoring results by analyzing the language protocols of the raters. To this end, we design an integrated method study using the analysis results of the Multifaceted Lash Model (MFRM) and the language protocol analysis (VPA) for the grading process of the raters, which can interpret the scoring results of the raters.
Chapter II discusses the interventional nature of the scoring factors in the writing evaluation after revealing the characteristics of the second language writing evaluation and the factors involved in the writing evaluation. Based on the existing discussion, we constructed a hypothetical scoring intervention process model in the writing evaluation, and discussed that the scoring was the subject of the interaction between the writing answer and the grading criteria. Next, we construct a theoretical model to illustrate the grading process of the write evaluation raters. This is a hypothetical model of how the multiple variables involved in scoring affect writing according to the writing evaluation scorer's cognitive process, and provides a theoretical basis for coding and analysis of scoring language protocols. In this regard, we presented a cognitive process model of a writing evaluator based on the concept of a phenotype in cognitive linguistics, and the writing evaluator observed that the scoring was performed in two phases: image formation and representation of the examiner's writing answer and score determination.
Next, in order to examine the grading process of raters in the context of specific Korean writing assessments, the question of free writing assessment for middle and advanced learners was first devised. Since then, five Korean learners from Korean language education institutions and undergraduate departments of Korean universities have been recruited to collect Korean writing answers. In addition, evaluation job openings were selected to grade examinees' writing answers, and scoring criteria tables were produced using analytical scoring methods and 5-point Likert scales. In addition, by describing descriptors, the scoring criteria were completed for the raters of this study.
For the collection of grading data, six Korean language teachers working at university institutions were assigned assignments and grading data were collected. Scores' experience was three in one to two years, two in four to five years, and one in ten years or more, and a completely cross-sectional grading was performed on 10 test takers' writing answers as proposed by the main school. A total of 660 scoring results were collected, as the scoring baseline provided to the raters consisted of 11 scoring criteria corresponding to a total of four evaluation categories. For an integrated analysis of the data, the raters performed scoring and at the same time performed Verbal Protocol Report (VPR: Ericsson & Simon, 1993), which collected approximately 10 hours and 45 minutes of recording. The recording materials were all transcribed and coded in a suitable form for the analysis of scoring impact factors.
Prior to the analysis of the scoring interaction factors in the writing evaluation grading process, the scoring impact was first identified through MFRM analysis to interpret the scoring results. Summarizing the results of the MFRM analysis and specifying where additional qualitative analysis is required:
First, the severity of the raters was all reported to be within the normal range. MFRM analysis of the overall scoring of the group of raters and individual levels did not report anything particular about rigor. Therefore, it is expected that an additional analysis of the severity of the raters will be needed for each more detailed scoring situation, rather than looking at the overall scoring.
Second, in the consistency of the raters, R04 was reported as a tolerance mean enhancement (Infit MnSq) value of .74, and the tolerance standardization value (Infit ZStd) was reported as -2.3 and classified as a scoreer that tends to overfit. There is a possibility that the raters' tendency to overfit is due to a variety of reasons, so further qualitative discussion is needed as to what the raters' results show to be overfit.
Third, based on information on the Rasch measurement of MFRM, the residual of R06 was found to be .1, with the largest difference between observation and adjustment scores. Large residuals suggest a large impact on the raters, and the overall consistent scoring of R06 was found to be 1.12, so it can be estimated that R06 made a large error due to the scoring factor in a particular scoring situation.
Fourth, similarly, based on information on Rasch measurements in MFRM, the residual between the Scorer's Observation Score and the MFRM's Adjustment Score is also expected to be reported as .08 resulting in significant Scorer impact. The overall scoring of R03 was the most average among the raters, and the overall consistent scoring was also found to be .92. Therefore, it is estimated that R03 also experienced errors due to scoring factors under certain scoring conditions.
Fifth, to analyze the possibility of using subjective scale scores, such as randomness and halo, the results of examining the exocompatible mean square values of each scale score by score by score by score by score by score by score by score by score by R01 (1.5), 3 (1.4), and 4 (1.5) respectively. Additionally, 4 points (1.3) of R02 and 2 points (1.3) of R03 were also reported to have poor reliability in scale score utilization. Therefore, it is necessary to further verify this through qualitative discussions that examine the language protocols of the raters in the grading process, focusing on the utilization situation of the scale scores.
Sixth, to analyze the concentration tendency of the individual level of the raters, the ratio of the scale score utilization by the raters and the fitted mean square value by scale score were high, with R01 being 3 points (31%) and 4 points (40%), which is likely to show concentration tendency. It can be interpreted that confusion has emerged in the utilization of scale scores of R01. R04 is likely to have concentrated tendencies at 2 points (.6) and 4 points (.7), with the proportion of scale score utilization also at 2 points (17%) and 4 points (31%), which is expected to concentrate on 4 points utilization. At 4 points (7.7) of R03, the possibility of an intensive tendency has been reported, with 17% of scale score utilization. Specifically, to determine what contributed to the utilization of these scale scores, a qualitative analysis of the Scorer's language protocol for scoring situations is required.
Seventh, checking the tolerance of the scoring criteria used by the raters to identify the concentration tendency at the group level of the raters showed overfitting trends in C03 (.81), C08 (.89), C06 (.88), C05 (.94), and C10 (.95). Among them, 2 points (6.6) and 4 points (4.4) of C03 showed a tendency to concentrate on the group level in utilizing 5 points (6.6) of C10.
Eighth, we present MFRM analysis results for biased scoring of raters. It is a case in which the test taker scored biasedly on both test takers, grading criteria, and five, three, and two cases were reported, respectively.
Next, qualitative analysis of the reported scoring effects was conducted based on the results of the MFRM analysis on the above scoring results. Scorer protocol analysis (VPA) was conducted for each situation to examine the scoring performance patterns of the Scorer on each Scorer's impact, and to analyze the Scorer factors that caused that aspect. To this end, first, thought oral data on the grading process were collected along with the grading result data at the time of grading, which was collected in the form of a language reporting protocol and transcribed into text. Transcribed text was segmented and segmented and coded according to the scoring process step-by-step coding scheme. Additional protocol analysis was used to analyze the scoring factors according to the aspects of the scoring performance.
Scorer language protocol analysis results in the report of Scorer recognition style factors that affect Scorer's lack of Scoring Consistency, which are tied to recognition, Mechanical Scoring Evidence Application Style, and Strict Scoring Style for high-level examinees. Next, cognitive style factors to avoid median values were reported as scoring factors affecting concentrated tendency, and cognitive and psychological factors were reported as scoring based on persistence on compensatory scoring and poor scoring criteria application. Scorer factors affecting halo were reported in the scoring schema of sensitivity to language-use categories, and scoring-related schema factors of preference for scoring load gravity factors and writing answers were reported as the scoring factors affecting biased scoring.
Based on the results of the analysis of language protocols on cognitive and psychological factors of such raters, Chapter 장 discussed for the education of raters in Korean writing evaluation. The discussion for the education of the raters was divided into discussions on background factors of the raters and discussions on cognitive and psychological factors of the raters.

Language: kor

URI: https://hdl.handle.net/10371/178408

https://dcollection.snu.ac.kr/common/orgView/000000168144

Files in This Item:

000000168144.pdf 3.05 MB

Appears in Collections:

College of Education (사범대학)
- Dept. of Korean Language Education (국어교육과)
  - Theses (Master's Degree_ 국어교육과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share