End-to-End Multi-Channel Speech Enhancement Using Inter-Channel Time-Restricted Attention on Raw Waveform

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

End-to-End Multi-Channel Speech Enhancement Using Inter-Channel Time-Restricted Attention on Raw Waveform

Cited 2 time in Web of Science Cited 4 time in Scopus

Authors: Lee, Hyeonseung; Kim, Hyung Yong; Kang, Woo Hyun; Kim, Jeunghun; Kim, Nam Soo

Abstract: This paper describes a novel waveform-level end-to-end model for multi-channel speech enhancement. The model first extracts sample-level speech embedding using channel-wise convolutional neural network (CNN) and compensates time-delays between the channels based on the embedding, resulting in time-aligned multi-channel signals. Then the signals are given as input of multi-channel enhancement extension of WaveUNet which directly outputs estimated clean speech waveform. The whole model is trained to minimize modified mean squared error (MSE), signal-to-distortion ratio (SDR) cost, and senone cross-entropy of back-end acoustic model at the same time. Evaluated on the CHiME-4 simulated set, the proposed system outperformed state-of-the-art generalized eigenvalue (GEV) beamformer in terms of perceptual evaluation of speech quality (PESQ) and SDR, and showed competitive results in short time objective intelligibility (STOI). Word-error-rates (WERs) of the system's output on simulated sets were comparable to that of bidirectional long short-term memory (BLSTM) GEV beamformer. However, the system showed relatively high WERs on real sets, achieving relative error rate reduction (RERR) of 14.3% over noisy signal on real evaluation set.

Appears in Collections:

Show Full Item Record

Find it @ SNU

SNS Share