Unsupervised Discovery of Non-Categorical L2 Error Patterns Using Wav2Vec2.0 Code Vector Features

Abstract: L2 발음은 두 음성체계 간 상호작용 아래 실현되기에 단일 음소 범주보다 복합적인 정체성을 지닌다. 이 같은 비범주적 속성은 음소 보다 세분화된 접근의 평가를 요구한다. 그러나 분절 이하 조사는 상당한 전문인력을 동반하기에 데이터로부터 오류 특징을 스스로 찾는 자동화 연구들이 대두되었다. 다만 기존 연구들은 비지도 방식으로 음소 이상의 변이 패턴을 찾기 위해, 지도학습 자질이자 음소적으로 규제된 음소사후 확률을 사용한다는 모순적 한계가 있다. 이에 본 연구는 비지도 학습만을 요하며, 외부 규제 없이 표현 학습으로 습득된 음성 단위 Wav2Vec2.0 코드벡터를 대체 분석 자질로 도입한다. 동시에 기존 자동화 연구의 주요 프레임워크를 유지함으로써, 코드백터가 분절 오류만으로는 정의될 수 없는 발음 변이 양상들을 설명할 수 있을지 탐색하고자 한다.

자질의 발음 오류 표현 적합성 탐색은 사용 빈도 계산을 통한 L2 식별력 검증과 분절 단위 오류 표본들의 열 분석을 통한 유형(패턴) 분석 두 단계로 진행된다. 같은 발화목록을 지녀 내용적으로 통제된 L1 (CMU ARTIC) 및 L2 (L2 ARTIC) 단일 화자 코퍼스에서 화자 별 코드벡터 사용 목록의 빈도를 벡터로 구축해 군집화하고 비교하였다. 아울러 L1 TIMIT으로 파인튜닝된 모델로 L2 NIA037 내 분절 단위 오류 탐지를 실시해 분석에 사용될 표본들을 선별했다. 강제정렬로 오류 음소에 대응되는 음성 구간을 찾아 원모델로부터 소속프레임들의 코드벡터열이 추출되면, 대표 인덱스 요약, 빈도 계산, 시퀀스 통합을 거친 내부 분석이 진행되며 우세한 패턴들이 도출된다. 패턴들은 같은 L1 TIMIT을 이용해 구축된 참조자료를 통해 최종 해석된다. 이름하 L1음소-코드 백터 공동 발생 확률로 음성학적 특성을 유추했으며, L1 데이터에 현존하는 전체 코드벡터들을 군집화하여, 패턴 간 관계성 및 각 유형의 고유성을 판단했다.

실험 결과, L1 및 L2 화자 사이 빈도벡터들의 군집화 분리를 통해 자질의 L2 식별력을 선차적으로 확인할 수 있었다. 두 화자 집단 간 사용목록의 차이는 특히 낮은 L2 숙련도일수록 감소되는 목록 크기로 확인될 수 있었는데, L1 기준으로 학습된 음성 단위가 충분히 활용되기 미흡한 발화 수준 때문이다. 또한 분절 이하 패턴 양상에서 다음 세 가지 공통된 특징이 기록되었다. 1) 첫째 오류 유형들은 변화된 조음 특성의 반영도를 따라 형성된 연속체를 이루었으며, 2) 각 연속체 속 중도 유형은 본 특성에 대해 두 개의 코드북에서 상반된 값을 구현하는 양면적 성격을 띄었다. 아울러 3)유형 분포는 학습자 L1에 존재하는 가장 근접한 소리를 향해 편향되었으며, 더 생소한 조음 특성을 가진 목표음이 더 큰 분산을 유발하였다. 각 발견은 언어학적 이해를 동반한다. 변화된 특성에 따른 점진적 위상은 발음 변이가 분절적 틀로 정의될 수 없음을 보여주며, 특히 중도 유형의 상충된 조합은 명확한 음소 범주로 분류 불가능한 비범주성의 전형이다. 비범주성이 L2 특성인 만큼 해당 패턴들은 L1 데이터에서 희박한 발생빈도를 가지기도 했다. 아울러 L1 음성체계와 관련된 비대칭성은, 발음 변이가 근본적으로 학습자의 모국어가 목표 학습어에 영향을 미치며 발생한다는 점을 반영한다. 결국 코드벡터는 L2 발음의 연속체적 성격을 수치화 할 수 있는 수단으로써, 발음 오류의 점진성을 평가할 대체 수단임을 주장한다.
L2 pronunciation is shaped by the interaction of two sound systems, which makes their identity more complex than a single phoneme category. This non-categorical nature demands assessment at a level finer than phonemes. The corresponding sub-segmental inspection, however, is highly labor-intensive, which led to the advent of unsupervised error pattern discovery literature. Nevertheless, previous works fall short of using the supervised and phoneme-prescribed feature, phonetic posterior-gram (PPG) to unsupervisedly discover variation patterns beyond phonemes. Alternatively, this work adopts the Wav2Vec2.0 code vector, a self-supervised learning (SSL) representation acquired through an unsupervised and non-prescriptive process. While maintaining the previous workflow, we aim to understand how well this feature explains sub-segmental variations present in a single segmental error.

To explore the range of variations code vectors capture, we first verify their L1 to L2 discernability via frequency-based usage comparison. From the L1(CMU ARTIC) and L2 (L2 ARCTIC) single-speaker corpora sharing the same reading prompt, probabilities of featural occurrences were constructed into vectors per speaker. These vectors were then clustered to confirm diverging membership. We further dissect within L2 discernment by analyzing patterns among segmentally identical examples. With the model fine-tuned with L1 TIMIT, segmental error detection is run on L2 NIA037 to select error samples submitted for sub-segmental analysis. For each error type, code vector sequences of corresponding sound frames were extracted from the pre-trained model by referencing the forced-alignment time stamp. We then derived dominant patterns among sequences using the steps of pruning, abstraction, and counting. These patterns are ultimately compared against L1 reference material, likewise created with TIMIT, to interpret their phonetic attribute and relationship with other sub-segmental patterns. Namely, conditional probabilities of phonemes per feature and clustering results among all available raw code vectors in L1 were used for each purpose.

The comparative analysis proved that the code vector usage of L1 and L2 speakers is different with frequency vectors well separated into two clusters on account of nativeness. This difference is marked by the decreasing inventory size in proportion to L2 proficiency, which reflects difficulties in articulating sound units defined in L1 standards. Moreover, sub-segmental patterns possessed the following three common traits that manifested linguistic relevancy, 1) The patterns formed an error continuum along the assumed degree of changed articulatory value, whereby 2) the intermediary typology was ambivalent by assuming opposite values in two codebooks. The gradient positioning highlights the beyond-segmental scope of variation, while the conflicting combination is the literal instantiation of non-categoricity. In line with this trait being an L2 attribute, intermediate patterns were also the least observed in the L1 reference data. Lastly, 3) pattern distribution skewed towards the most approximate sound in the learners L1, with more foreign targets incurring greater dispersion. This asymmetry shows that variation at its core occurs due to the L1 phonetic transfer. In the end, we claim that code vectors can be an alternative means to evaluate pronunciation gradience, with abilities to quantify the between-categorical position of errors.

Keywords: Pattern Discovery, SSL (self-supervised learning), Code Vector, L2 pronunciation, Non-Categorical, Sub-Segmental, Error Continuum

Language: eng

URI: https://hdl.handle.net/10371/215767

https://dcollection.snu.ac.kr/common/orgView/000000185558

Files in This Item:

000000185558.pdf 3.46 MB

Appears in Collections:

College of Humanities (인문대학)
- Linguistics (언어학과)
  - Theses (Master's Degree_언어학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share