Attentional Sampling for Efficient Visual Computing

장형진

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Attentional Sampling for Efficient Visual Computing : 효율적 영상처리를 위한 주의집중 샘플링

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 장형진

Advisor: 최진영

Major: 공과대학 전기·컴퓨터공학부

Issue Date: 2013-02

Publisher: 서울대학교 대학원

Keywords: attentional sampling scheme ; structured attentional sampling ; empirical attentional sampling ; selective attentional sampling ; tracking failure detection ; speed-up of background subtraction ; complex action recognition

Description: 학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 2. 최진영.

Abstract: 컴퓨터 비전 문제는 영상 획득 장치를 통해 픽셀 단위로 수치화된 데이터를 샘플링 하는 것으로부터 시작된다. 가장 기본이 되는 데이터인 픽셀 값들을 그대로 사용하는 경우도 있고, 이 픽셀 값들을 조합하여 새로운 의미를 가진 데이터들을 구성하고 샘플링 하여 사용하기도 한다. 좋은 성능을 얻기 위해서는 최대한 많은 수의 데이터를 샘플링 하는 것이 필요하지만 이럴 경우 필요로 하는 연산량이 급격히 증가하는 문제가 있다. 반대로 연산량 만을 고려해 최소한의 데이터만 샘플링 하여 사용하는 경우 좋은 성능을 기대하기 어렵다. 그러므로 효율적인 연산량으로 최적의 성능을 얻기 위해서는, 이미지가 바뀜에 따라 혹은 시간이 흐름에 따라 문제를 풀기에 충분한 최소한의 데이터만 찾아내어 샘플링 하는 능동 샘플링(active sampling) 개념이 필요하다. 이러한 능동 샘플링 개념을 현실화하기 위해서는 문제를 해결하는데 중요한 데이터들을 찾아내는 과정이 매우 중요하며, 찾아낸 데이터들을 어떻게 집중하여 샘플링 하는가가 중요해진다. 본 논문에서는 서로 다른 세 가지의 주의집중 샘플링(attentional sampling) 방법, 즉 구조적 주의집중 샘플링(structured attentional sampling), 경험적 주의집중 샘플링(empirical attentional sampling), 선택적 주의집중 샘플링(selective attentional sampling)을 제안하였다. 제안된 각각의 주의집중 샘플링 방법들은 주의집중이 필요한 중요 데이터들을 찾기 위해 문제의 특성에 대한 사전 지식(prior knowledge)을 적용하는 세가지 방법을 제안하고 있으며, 그에 따라 적응적으로 샘플링 하는 방법들이다. 제안된 주의집중 샘플링 방법들은 컴퓨터 비전 문제들에 성공적으로 적용되어 연산 효율뿐만 아니라 각 알고리즘의 성능을 크게 향상 시켰다.

첫 번째 구조적 주의집중 샘플링(structured attentional sampling)은 문제의 특성에 맞춰 미리 구조화된 샘플링 패턴에 따라 샘플링을 수행하는 방법이다. 이러한 구조적 주의집중 샘플링 방법을 사람 눈의 구조를 흉내 내어 물체 추적 실패를 탐지하는 데 적용하였다. 사람 눈 망막 위의 시신경 세포(ganglion cells)의 분포를 근사화한 log-polar 패턴 구조로 이미지 픽셀 샘플링을 수행하여 사람 눈의 유용한 특성을 흉내 내었다. Log-polar 패턴으로 샘플링 된 이미지는 회전(rotation) 변화에 의한 영향은 감소되어 나타나고, 좌우나 위아래로의 병진(translation) 변화는 증폭되어 나타나는 특성이 있다. 이러한 특성은 회전에 의해 나타나는 포즈 변화들로 인해 발생하는 추적 실패에 대한 거짓 경보(false alarm)들은 줄이고, 급격한 위치 변화로 인한 추적 실패에 대한 참 경보(true alarm)를 증가시킬 수 있다. 게다가 log-polar 구조의 특징인 중심와(fovea) 선명화 특성(predominant property)은 초점이 맞춰진 중심 부분(추적 물체의 중심 부분)의 선명도는 증가시키고 그 이외의 주변부(추적 물체 바깥 부분)는 흐릿하게 함으로써 추적 실패의 순간을 정확하게 탐지할 수 있도록 도와준다. 또한 망막 위의 시신경 세포 하나하나는 log-polar 변환 이미지의 각 픽셀에 대응시켜, 각 세포가 빛에 적응하는 방식과 유사하게 각 픽셀의 추적 물체의 색상에 대한 적응을 가우시안 혼합 모델(Gaussian mixture model)을 이용하여 모델링 하였다. 이러한 방식으로 제안된 추적 실패 탐지를 위한 구조적 주의집중 샘플링의 유용성은 다양한 실험을 통해 검증되었다.

두 번째 경험적 주의집중 샘플링(empirical attentional sampling)은 이전에 획득된 경험적 지식을 현재 단계 샘플링에 사용하는 방식이다. 경험적 지식은 경험 학습 과정을 통하여 확률 분포로 모델링 된다. 이러한 경험적 샘플링 개념은 움직이는 물체 탐지를 위해 일반적으로 사용되는 배경 제거 방법들에 픽셀 단위의 선택적 연산 마스크를 적용하여 연산 속도를 향상시키는 방식으로 적용되었다. 제안된 샘플링 방법은 전경 지역(foreground region)과 같이 주의집중을 필요로 하는 영역에 초점이 맞춰져 샘플링이 진행되도록 설계되었다. 주의집중 영역은 전경 확률 지도(foreground probability map)로 표현되고, 이 확률 지도는 이전 프레임에서의 탐지 결과를 이용하여 재귀적(recursive) 확률 업데이트 방식으로 추정된다. 전경 확률 지도는 전경 부분의 시간적(temporal), 공간적(spatial), 주파수적(frequency) 특성을 이용하여 생성되었다. 생성된 전경 확률 지도를 이용하여, 무작위 샘플링(randomly scattered sampling), 공간 확장 방식의 중요 샘플링(spatially expanding importance sampling), 놀람 픽셀 샘플링(surprise pixel sampling) 방법들이 순차적으로 진행되면서 주의집중 샘플링 마스크를 생성한다. 제안된 경험적 주의집중 샘플링 방법의 효율성은 다양한 실험을 통해 검증되었다. 제안된 방법은 기존의 픽셀 단위의 배경 제거 방법의 연산 속도를 탐지 성능 저하 없이 약 6.6배 향상 시켰다. 또한 기존의 배경 제거 알고리즘을 이용하여 full HD 영상(1920x1080)에서 실시간으로 움직이는 물체를 탐지할 수 있도록 하였다.

선택적 주의집중 샘플링(selective attentional sampling)은 주어진 데이터와 목적에 대한 사전 정보를 이용하여 문제의 해결을 위해 꼭 필요로 하는 중요 데이터만 미리 선택하여 문제 해결의 효율성을 높이는 방식이다. 본 논문에서는 이러한 선택적 샘플링 방식을 이용하여 일반인이 추는 유명 대중가요의 춤을 인식하는 방법을 제안하였다. 대중가요 춤은 일반적으로, 발레나 리듬 체조의 춤 동작과는 달리 하나하나를 따로 이름을 붙일 수 없는 짧고 복잡하며 다양한 행동의 연속으로 나타난다. 특히 춤에 대한 일정한 제약이 없다 보니, 동작의 정확성 보다는 추는 사람의 개성과 자유로움에 따라 동일한 춤도 다양하게 표현이 된다. 이러한 행동의 자유로움과 다양함, 그리고 시간적으로 긴 행동의 길이 때문에 기존의 행동 인식 알고리즘은 직접적으로 적용할 수 없다. 본 논문에서는 명확하게 구분할 수 없을 정도로 자유로운 행동의 흐름 특징을 효과적으로 표현하고 인식 알고리즘에 적용할 수 있도록 하기 위해 새로운 행동 특징 표현 방법을 제안하고, 이를 효과적으로 낮은 차원 데이터로 표현하는 방법을 제안하였다. 또한 효율적인 인식을 위해 특징적인 시공간적 행동의 변화 지점을 주의집중적 행동 지점(attentional motion spot)라 명명하고 이를 자동을 선택하는 방법을 제안하였다. 이 특징 점들의 시공간적 분포를 혼합 가우시안(Gaussian) 분포로 모델링하고, 이렇게 표현된 모델링 방법을 행동 악보(Action Chart)라고 명명하였다. 이 행동 악보는 시공간적인 행동의 흐름을 음악 악보처럼 중요 행동의 시간적 발생 지점과 종류, 지속 시간을 표현하고 있다. 이렇게 표현된 행동 악보를 이용하여 새롭게 제작된 대중 가요 춤 데이터 세트를 효율적이고 효과적으로 인식하였다. 제안된 방법을 검증하기 위하여 제안된 방법을 구성하는 세부 알고리즘 하나하나를 실험적으로 검증하여 각 부분의 필요성을 보였고, 현재 존재하는 길고 복잡한 행동을 인식하는 방법을 직접 구현하여 동일한 데이터 세트를 이용하여 제안된 방법이 인식 성능과 연산 시간측면에서 월등히 뛰어남을 검증하였다. 또한 더 나아가 행동 악보를 이용하면 긴 춤 동작을 사람이 하는 것과 거의 유사한 성능으로 요약 가능함을 보였다.
In many practical computer vision scenarios it is possible to use information gleaned from the previous observations through the sampling process. In order to achieve a good performance with small computation, it is desirable that the samples cover the domain of target distribution with the small number of samples as possible via a concept of active or adaptive sampling. Based on the active sampling strategy, sampling could be concentrated on attentional portions, which can improve not only the sampling efficiency but also performances of algorithms. In this thesis, we define three different attentional sampling concepts, structured attentional sampling, empirical attentional sampling and selective attentional sampling. The proposed attentional sampling methods are successfully applied to computer vision problems, by achieving dramatic improvement in the sense of performance as well as computational load.
The structured attentional sampling scheme uses an inherent structure to sample an interesting region densely instead of equally distributed sampling over the entire region. This sampling scheme is applied to a tracking failure detection method by imitating human visual system. In this scheme, we adopt a sampling structure based on Log-polar transformation simulating retina structure. Since the log-polar structure shows invariance against rotational changes and intensifies translational changes, it helps to reduce false alarms arising from rotational pose variations and increase true alarms in abrupt translational changes. In addition, foveal predominant property of log-polar structure helps to detect the tracking failing moment by amplifying the resolution around focus (tracking box center) and blurring the peripheries. Each ganglion cell corresponds to a pixel of log-polar image, and its adaptation is modeled as Gaussian mixture model. The validity of the structured attentional sampling method is illustrated through various experiments.
The empirical attentional sampling scheme uses previously obtained empirical knowledge when sampling in current time. The empirical knowledge is modeled by a probability distribution function through an empirical learning process. This empirical sampling scheme is applied to mask generation to speed up conventional background subtraction algorithms for moving object detection. The proposed sampling strategy is designed to focus on attentional region such as foreground regions. The attentional region is estimated by using the detection results in the previous frame in a recursive probabilistic way. We generate a foreground probability map by using foreground properties of temporal, spatial, and frequency properties. Based on this foreground probability map, randomly scattered sampling, spatially expanding importance sampling and surprise pixel sampling are performed sequentially to make the attention sampling mask. The efficiency of the proposed empirical attention sampling method is shown through various experiments. The proposed masking method successfully speeds up pixel-wise background subtraction methods approximately 6.6 times without deteriorating detection performance. Also real-time detection with Full HD video is successfully achieved by various conventional background subtraction algorithms together with the proposed sampling scheme.
The selective attentional sampling scheme does not use whole data but selects only important data enough to achieve a given classification objective. This selective sampling scheme is applied to the recognition of pop dances. Pop dances are action streams consisting of diverse actions which cannot be simply annotated. For such ``unannotatable'' action streams, conventional methods cannot be applied directly due to their complexity and longevity. In order to describe unannotatable action stream effectively, the proposed method employs a novel mid-level ``feature flow'' with low dimensional embedding. Also, for the purpose of recognition, ``attentional motion spots'' holding important information about the sequence are automatically selected. The feature values and the temporal locations of each attentional motion spot are modeled with Gaussian mixtures as ``Action Charts.'' The Action Chart describes the characteristics of an action stream in the spatio-temporal domain. Using the abstract information in the Action Charts, the proposed method efficiently recognizes pop dance sequences. In order to demonstrate the validity of the proposed method, we compare our method against the state-of-the-art methods with a newly built SNU Pop-Dance dataset containing long action streams composed of diverse actions.

Language: English

URI: https://hdl.handle.net/10371/118909

Files in This Item:

000000009870.pdf 5.47 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Ph.D. / Sc.D._전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share