Spatial Perception of Complex Indoor Scenes using Neural Representation Learning

Abstract: Spatial perception is a fundamental ability necessary for autonomous mobile agents to move robustly and safely in the real world.

While extensive research on spatial perception has been carried out in AI domains, including computer vision and robotics, a significant challenge still remains for a generic spatial representation capable of learning various spatial properties.
Conventional spatial perception methods in mobile robots and autonomous vehicles rely on a direct comparison between sensor inputs and a metric map, represented by topographical structures and sparse visual cues, such as landmarks and obstacle.
However, these metric map-based representations have limited scalability for large-scale and dynamic environments, are inefficient in mapping processes, and lack expressiveness.
Thus, more flexible, efficient, and informative spatial representations are required to enhance the spatial ability of mobile cognitive agents.

Recently, progress in self-supervised learning with deep generative models has given rise to a new approach, namely neural scene representation, that discovers rich representations of 3D scenes by converting visual sensory data into abstract descriptions.
Additionally, emerging neural rendering technology has demonstrated that neural scene representation can learn to perceive, interpret, and represent various characteristics of synthetic 3D scenes without any explicit supervision.

However, this approach has been focused mainly on high-resolution rendering of a specific single scene or generalizing small-scale synthetic environments with only a few objects.
Thus, extending existing neural representations to learn more diverse environments is crucial for developing a generic spatial representation for autonomous mobile agents that perform services in the real world.

This dissertation investigates spatial representations capable of learning a myriad of environmental features in complex indoor scenes.

A semantic-augmented scene representation is proposed, which fuses geometric features with corresponding semantic components extracted from objects in the environment.
The proposed method utilizes semantic features to augment sparse geometric visual features, as opposed to previous approaches that describe the environment using only metric representations.

PlaceNet, a deep generative model that jointly learns visual information observed from complex indoor scenes and corresponding position data in a self-supervised manner based on a neural rendering approach is introduced.
The demonstration of the model's ability to successfully learn a large-scale 3D house dataset and generate plausible scenes given random observations is provided.

Lastly, TAGS, a multimodal attention mechanism, is presented, which enhances PlaceNet to learn additional relevant spatial contexts by leveraging visual, topographic, and semantic spatial features.
The experimental results show that PlaceNet with TAGS is capable of learning complex indoor scenes, with significant improvements in neural scene rendering performance when compared to a baseline model that disregards multimodality.
공간지각은 실세계에서의 강건하고 안전한 자율 이동 및 동작을 위해 근본적으로 필요한 능력이다.
지난 수십년간 컴퓨터 비전 및 로봇 공학을 포함한 AI 영역에서 공간지각에 대한 광범위한 연구가 수행되었지만, 공간에 내재된 다양한 속성들의 학습을 위한 일반적 표상은 여전히 해결해야 할 중요한 과제로 남아 있다.
현재, 모바일 로봇이나 자율주행 차량은 사전에 탐사된 지형 구조, 그리고 랜드마크나 웨이포인트와 같은 희소한 시각적 단서들이 기록된 지도를 실시간으로 얻어지는 센서 데이터와 직접 비교하는 방식으로 주변 공간을 인식한다.
하지만 이런 정적인 지도 기반의 공간 표현 방법은 대규모의 동적인 환경인 실세계로의 확장에 한계가 있고, 구축 과정이 비효율적이며, 다양한 공간 정보들을 나타낼만큼 표현력이 높지 않다.
따라서 모바일 AI 에이전트의 공간지각 능력을 향상시키기 위해서는 보다 유연하고 효율적이며, 풍부한 정보를 내포 가능한 공간 표상이 요구된다.

최근에는 심층 생성 모델 기반의 자기지도 학습 방법이 발전함에 따라, 영상 데이터를 추상적 표상으로 변환하여 3차원 장면을 표현하고자 하는 ``신경장면표상"이라는 새로운 접근 방식이 등장했다.
한편, 컴퓨터비전과 그래픽스 분야에서 부상중인 뉴럴 렌더링 기술은 신경장면표상이 명시적인 감독 없이도 3차원 장면의 다양한 특성들을 인식하고 표현하도록 학습하는 것이 가능함을 보였다.
그러나 이 접근 방법은 아직까지 특정한 단일 장면에 대한 고해상도 렌더링을 수행하거나 몇가지 개체들이 포함된 작은 규모의 인공 환경을 일반화하는 것에 그친다.
실세계에서 서비스를 수행하는 모바일 AI 에이전트에 적합한 공간표상을 개발하기 위해서는 이와 같은 신경표상이 좀 더 복잡한 환경에 내재된 컨텍스트를 학습할 수 있도록 확장시킬 필요가 있다.

본 논문에서는 복잡한 실내 장면들에서의 수많은 특징을 학습 가능한 공간표상에 대해 살펴본다.
먼저, 공간의 기하학적 특징을 객체로부터 추출된 의미적 특징과 융합하는 의미증강 장면표상을 제안한다.
제안된 방법은 메트릭 표현만으로 환경을 설명하는 기존의 공간지각 방법 대신, 이 방법이 사용하는 희소한 기하학적 시각적 특징을 의미적 특징을 활용하여 보강하는 방식으로 개선한다.

다음으로, 복잡한 실내 장면에서 관찰된 시각 및 위치 정보를 뉴럴 렌더링 기반의 자기지도 방식으로 공동 학습하는 심층 생성 모델인 PlaceNet을 소개한다.
실험 결과는 이 모델이 대규모의 3차원 실내 장면 데이터를 성공적으로 학습하고, 무작위로 관측한 주변 정보가 주어졌을 때 이에 적합한 장면 영상을 생성해내는 것을 보여준다.

마지막으로, 공간과 관련된 세부적인 정보들을 추가적으로 학습할 수 있도록 PlaceNet을 향상시키기 위해 시각적, 지형적, 의미적 공간 특징들을 활용하는 다중모달 주의 메커니즘인 TAGS를 제안한다.
실험 결과를 통해 우리는 TAGS를 적용한 PlaceNet이 복잡한 실내 장면들도 학습 가능하며, 이를 사용하지 않는 모델과 비교하여 장면 생성 성능이 크게 개선될 수 있음을 제시한다.

Language: eng

URI: https://hdl.handle.net/10371/193354

https://dcollection.snu.ac.kr/common/orgView/000000174897

Files in This Item:

000000174897.pdf 90.65 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Ph.D. / Sc.D._컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share