Natural Audio Captioning

Chris Dongjoo Kim

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Natural Audio Captioning : 오디오 캡셔닝

DC Field	Value	Language
dc.contributor.advisor	Gunhee Kim	-
dc.contributor.author	Chris Dongjoo Kim	-
dc.date.accessioned	2021-11-30T02:40:15Z	-
dc.date.available	2021-11-30T02:40:15Z	-
dc.date.issued	2021-02	-
dc.identifier.other	000000163979	-
dc.identifier.uri	https://hdl.handle.net/10371/175417	-
dc.identifier.uri	https://dcollection.snu.ac.kr/common/orgView/000000163979	ko_KR
dc.description	학위논문 (석사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2021. 2. Gunhee Kim.	-
dc.description.abstract	본 논문은, 오디오 캡셔닝이라는 주제를 최초로 다룬다. 자연에서 발생하는 소리 를 인간의 언어로 표현하는 문제는 아직 아무도 다뤄보지 않은 중요한 문제이다. 새로운 문제를 이해하고 풀기위해선 학습에 있어 가장 중요한 데이터가 필요하 다. 우리는 약 9만3천개 가량의 자연의 소리에 관한 해설을 크라우드 소싱하여 AudioCaps라는 새로운 데이터셋을 구성하였다. 본 논문에서는 철저한 실험들을 통하여 AudioCaps 데이터셋의 높은 품질을 증명하고, 오미오 캡셔닝이라는 새 로운 문제에 어떤 input representation이 적합한지 꼼꼼하게 확인을 해보았다. 추가로, 자연에서 발생하는 오디오의 특징을 발견하여, 2가지의 새로운 테크닉을 선보인다. 첫번째는, Top-Down multi-scale encoder이라는 방법을 통해 오디오의 세부적인 내용까지 학습하는데 고려를 하도록하고, 두번째로는 aligned semantic attention을 통해 오디오와 주어지는 semantic cue가 정렬이 잘 맞을 수 있도록 유도한다. 이 두가지 테크닉을 이용하여, 새로운 state-of-the-art 성능을 갱신해보 인다	-
dc.description.abstract	We explore the problem of audio captioning: generating natural language descriptions for any kind of audio in the wild, which has been surprisingly unexplored in previous research. We contribute a large-scale dataset of 93K audio clips with human-written text pairs collected via crowdsourcing on the AudioSet dataset. Our thorough empirical studies not only show that our collected captions are indeed faithful to audio inputs but also discover what forms of audio representation and captioning models are eﬀective for the audio captioning. From extensive experiments, we also propose two novel components that help improve audio captioning performance: the top-down multi-scale encoder and aligned semantic attention.	-
dc.description.tableofcontents	Abstract i Contents ii List of Figures iv List of Tables vi Chapter 1 Introduction 1 Chapter 2 Related Works 4 Chapter 3 The Audio Captioning Dataset 7 3.0.1 AudioSet Tailoring . . . . . . . . . . . . . . . . . . . . . . 7 3.0.2 Audio Annotation . . . . . . . . . . . . . . . . . . . . . . 9 3.0.3 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . 11 3.0.4 Dataset Comparison . . . . . . . . . . . . . . . . . . . . . 11 Chapter 4 Approach 18 4.0.1 Top-down Multi-scale Encoder . . . . . . . . . . . . . . . 19 4.0.2 Aligned Semantic Attention . . . . . . . . . . . . . . . . . 20 Chapter 5 Experiments 23 5.0.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . 23 5.0.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.0.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Chapter 6 Conclusion 34 Chapter 7 Appendix 42 요약 45 Acknowledgements 46	-
dc.format.extent	vi, 46	-
dc.language.iso	eng	-
dc.publisher	서울대학교 대학원	-
dc.subject	Neural Network	-
dc.subject	Deep Learning	-
dc.subject	Natural Language Processing	-
dc.subject	Audio in-the-wild	-
dc.subject	Captioning	-
dc.subject	Attention Module	-
dc.subject	신경망	-
dc.subject	딥러닝	-
dc.subject	자연어 처리	-
dc.subject	자연소리 모델링	-
dc.subject	캡셔닝	-
dc.subject.ddc	621.39	-
dc.title	Natural Audio Captioning	-
dc.title.alternative	오디오 캡셔닝	-
dc.type	Thesis	-
dc.type	Dissertation	-
dc.contributor.AlternativeAuthor	김동주	-
dc.contributor.department	공과대학 컴퓨터공학부	-
dc.description.degree	Master	-
dc.date.awarded	2021-02	-
dc.identifier.uci	I804:11032-000000163979	-
dc.identifier.holdings	000000000044▲000000000050▲000000163979▲	-

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Master's Degree_컴퓨터공학부)

Files in This Item:

000000163979.pdf 18.31 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share