Learning-based sound source classification and localization with limited data

Abstract: Identifying the type and position of sound is one of the most important issues in the field of acoustics. In particular, we have no choice but to rely on acoustic information since visual information is strictly blocked within a real-world building structure to identify sources that can cause critical problems, such as mechanical defections. However, traditional approaches for sound source classification and localization utilize classic array processing techniques which are not applicable to sounds from real-world complex structures where sounds do not strictly follow the theory without carefully designed experiments. Therefore, we propose a learning-based approach to identify the type and position of sounds using a single microphone in a real-world building. We attempt to treat this problem as a joint classification problem in which we predict the exact positions of sounds while classifying the types that are assumed to be from pre-defined types of sounds. The most problematic issue is that while the types are readily classified under supervised learning frameworks with one-hot encoded labels, it is difficult to predict the exact positions of the sound from unseen positions during training. In order to address this potential discrepancy, we formulate the position identification problem as a zero-shot learning problem inspired by the human ability to perceive new concepts from previously learned concepts. We extract feature representations from audio data and vectorize the type and position of the sound source as `type/position-aware attributes', instead of labeling each class with a simple one-hot vector. We then train a promising generative model to bridge the extracted features and the attributes by learning the class-invariant mapping to transfer the knowledge from seen to unseen classes through their attributes; generative adversarial networks are conditioned on the class-embeddings. Our proposed methods are evaluated on an indoor noise dataset, SNU-B36-EX, a real-world dataset collected inside a building.
소리의 종류 및 위치를 파악하는 것은 음향학 분야에서 가장 중요한 문제 중 하나이다. 특히 복잡한 건물 구조에서 기계적 결함 등으로 인한 소음원을 식별해야 할 경우 시각적 정보는 엄격히 차단되기 때문에 음향 정보에 의존 할 수 밖에 없다. 그러나 실제 복잡한 구조물은 철저히 계획된 실험이 아니기 때문에 소리의 전파가 이론을 따르기 않는다. 따라서 이 경우 음원의 종류 및 위치를 추정 하기 위해 고전적인 배열 처리 기술을 이용하는 기존의 접근 방식은 제한 될 수 밖에 없다. 따라서, 우리는 실제 건물에서 단일 마이크를 사용하여 음원의 종류 및 위치를 추정하는 학습 기반 접근법을 제안한다. 우리는 이 문제를 우리가 소리의 정확한 위치를 예측하는 동시에 미리 정의된 종류 중 하나로 분류하는 복합 분류 문제로 다루려고 한다. 음원의 종류는 원핫 인코딩 레이블로 지도 학습 프레임 워크에서 쉽게 분류 되지만, 가장 문제가 되는 부분은 훈련 중 보이지 않는 위치에서 나는 음원의 정확한 위치를 예측하는 것이다. 이러한 훈련 집합과 검증 집합의 잠재적 불일치를 해결하기 위해, 우리는 음원의 위치 추정 문제를 이전에 학습한 개념에서 새로운 개념을 지각하는 인간의 능력에서 영감을 받은 제로샷 학습 문제로 해결 하려한다. 우리는 각 분류군을 단순한 원핫 벡터로 레이블링 하는 대신 음성 데이터에서 특징 표현을 추출하고 음원의 종류 및 위치를 `종류/위치를 나타내는 속성'으로 벡터화한다. 이후, 우리는 음성에서 추출된 특징과 속성을 연결하기 위해 분류군에 따라 불변하는 함수를 검증된 생성 모델로 학습하여 해당 속성을 통해 훈련 중 보이는 클래서에서 보이지 않는 분류군으로 정보를 전이한다. 이때 생성 모델로 분류군 조건부 생성적 적대적 신경망을 이용한다. 우리가 제안한 방법은 건물 내에서 수집된 실제 데이터셋인 소음 데이터셋, SNU-B36-EX에서 평가 된다.

Language: eng

URI: https://hdl.handle.net/10371/187757

https://dcollection.snu.ac.kr/common/orgView/000000172067

Files in This Item:

000000172067.pdf 3.58 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Naval Architecture and Ocean Engineering (조선해양공학과)
  - Theses (Ph.D. / Sc.D._조선해양공학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share