2D Human Pose Estimation and Tracking with spatial and temporal features

Abstract: 2D human pose estimation and tracking aim to detect the location of a person's parts and their trajectory. A pose is composed of parts of a person, and a person's part is an element of the body such as arms, legs and head. Pose estimation technique is being utilized both industrially and academically. For example, in a home training system, pose detection can detect the user's pose and help the user correct the posture. Also, in human action recognition research, human pose information can be exploited as a helpful supplementary information.
In order to apply human pose studies to real-world systems, the model is required to be of high performance and also light enough to run in a real-time manner. In this paper, we have focused on improving accuracy. We have considered how to utilize the feature values to achieve high accuracy using the spatial and temporal features.
Spatial feature means characteristic values such as textures, patterns, and postures that can be extracted from images. We have made better use of the spatial feature by dividing it into local and global features. The global feature is likely to include a large number of parts, while the local feature focuses on a relatively small number of parts.
First, we have proposed a structure that can use the global-local feature at the same time to improve the performance. The global network intensively learns the global feature, and the local network can learn various regional information from images. The local network performs as a function of refining the pose detected in the global network sequentially. To prove the efficiency of the proposed method, experiments have been conducted on the Leeds sports dataset (LSP) data, which is one of the single-person pose estimation datasets.
Secondly, we define the rare pose using global feature and solve the imbalance in poses. First of all, the poses are classified using location information of the entire pose. Experiments have shown that the poses are distributed around certain poses (standing poses, upper body poses, etc.), and an imbalance between them apparently exists. We have proposed methods such as weighted loss, synthesizing rare pose data, etc. to resolve the imbalance. Experiments are conducted using MPII and COCO data, which are widely used in multi-person pose estimation.
The temporal feature refers to the varying information of poses along the time. It is usually recommended to use time information when analyzing objects in a video. Therefore thirdly, we have estimated and tracked the poses with a map that expresses the change of a person's movement. The network learns the spatial and temporal maps together to create synergy between each other. The experiment has been conducted in multi-person pose tracking data, Posetrack 2017 and 2018.
Even if the proposed three methods improve different issues, utilized together. For example, a new structure is a top-down approach and has parallel two deconvolutions for spatial (Heatmap) and temporal map (TML). Additionally, the rare pose data augmentation and the local network are applied to increase performance. Thus, adopting three methods is available to improve performance and more extensible in the pose estimation field.
2D 이미지에서 사람의 포즈를 검출하는 연구는 사람의 파트들의 위치를 검출하는 것을 목표로한다. 포즈는 사람의 파트들로 구성되어 있고 사람의 파트는 팔, 다리, 머리 등으로 사람을 구성하는 신체의 요소들을 의미한다. 사람의 포즈 정보는 다양한 분야에서 활용 될 수 있다. 또한, 사람의 동작 감지 연구 분야에서는 사람의 포즈 정보가 매우 훌륭한 입력 특징 값으로 사용된다.
사람의 포즈 검출 연구를 실제 시스템에 적용하기 위해서는 높은 정확도, 실시간성, 다양한 기기에 사용 가능하도록 가벼운 모델이 필요하다. 본 논문에서는 정확도를 개선하는 연구에 초점을 맞췄다. 높은 정확도를 달성하기 위해서 특징값을 어떻게 활용할지에 대해 고민을 했으며, 지역적 특징값과 시간 특징값을 사용해서 문제를 개선했다
지역적 특징값은 사람의 텍스쳐, 형태와 같은 특징을 표현하는 것을 의미한다. 우리는 지역적 특징 값을 다수의 파트를 담고 있는 Global feature 와 소수의 파트를 담고 있는 Local feature로 분류해서 문제를 접근했다. 첫번째로는 global-local feature 을 동시에 사용해서 성능을 개선하는 연구에 집중했다. Global feature을 집중적으로 학습하는 네트워크와 다양한 형태의 local 정보를 학습 할 수 있는 local network을 설계했다. Local network에서는 global network에서 검출한 포즈를 다시 한번 개선하는 역할을 수행한다. 제안된 방법의 효율성을 증명하기 위해서 single-person pose estimation 데이터 중 하나인 Leeds sports dataset (LSP) 데이터에서 실험을 수행했다.
두번째로는 global한 정보를 통해 희귀한 포즈를 검출해서 포즈의 불균형을 해소해 성능을 개선하는 연구를 수행했다. 우선적으로 포즈 데이터 내에서 전체 포즈의 위치 정보를 사용해서 포즈들을 분류했다. 실험 결과, 일정 포즈를 (서있는 포즈, 상반신만 있는 포즈 등) 중심으로 포즈들이 분포 된며 포즈 간의 불균형이 있음을 밝혀냈다. 우리는 포즈 간의 불균형을 해소하기 위해 weight loss, generate rare pose data 등의 방법을 제안했다. 제안된 방법의 효율성을 증명하기 위해서 multi-person pose estimation 데이터에서 많이 사용되는 MPII와 COCO 데이터에서 실험을 수행했다.
시간 특징값은 시간 흐름에 따른 움직임 변화값을 의미한다. 동영상에서 객체를 분석하기 위해서는 시간 정보를 활용하는 것이 좋다. 그래서 세번째로 우리는 사람의 움직임 변화를 맵으로 표현해서 포즈를 추적했다. 이때 포즈의 지역적 특징값과 같이 학습해서 서로간의 시너지 효과를 낼 수 있도록 네트워크를 제안했다. 제안된 방법의 효율성을 증명하기 위해서 multi-person pose tracking 데이터인 posetrack 2017 과 2018에서 실험을 수행했다.
본 논문에서는 지역적 특징과 시간적 특징을 활용해서 포즈의 성능을 개선하는 방법들을 제안했다. 서로 다른 문제들을 해결했지만 나아가 하나로 묶여 문제를 해결 할 수 있다. 예를들어, top-down 형태의 네트워크 구조에서 Heatmap과 TML을 각각 학습 할 수 있는 평행적 구조의 decovolution network을 제안 할 수 있다. 여기에 Heatmap의 성능 개선을 위해 local network와 rare pose data augmentation 방식 또한 추가할 수 있다. 이렇게 제안된 방법을 결합해서 더 나은 포즈의 성능을 개선 할 수 있는 방법들이 제안 될 수 있다.

Language: eng

URI: https://hdl.handle.net/10371/175870

https://dcollection.snu.ac.kr/common/orgView/000000164674

Files in This Item:

000000164674.pdf 12.65 MB

Appears in Collections:

Graduate School of Convergence Science and Technology (융합과학기술대학원)
- Dept. of Transdisciplinary Studies(융합과학부)
  - Theses (Ph.D. / Sc.D._융합과학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share