Policy Gradient in Bounded Continuous Action Space using Logitnormal distribution

Abstract: Reinforcement Learning has been intensively researched and has solved various challenges such as robotics, the game of go, and games. Policy gradient is most lively analyzed method among many Reinforcement Learning algorithms. When the given action space is continuous, policy distribution is commonly assumed as Gaussian distribution. The action sampled from the policy distribution is sent to the environment and the environment returns the reward that will used for training. However, the action out of the defined boundary is clipped before being passed to the environment. In this progress, the boundary effect occurs that disturbed actions are executed and twisted rewards are returned. This is the problem because it misleads the algorithm and disrupts the training. Previously, the researchers suggested the use of Beta distribution, but this induces other problems.
In this paper, we introduce Logitnormal distribution as a substitute and address the boundary effect. We examine the performances of Logitnormal, Beta and Gaussian distribution in comparison. Experiments are conducted in MuJoCo simulator on continuous action spaces. The results show that policy gradient using Logitnormal distribution overperformed the method using the others in both Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO). Logitnormal distribution has generality and efficiency unlike Beta distribution, thus not shares its drawbacks and proves its potentiality to be the major distribution in reinforcement learning.
강화학습은 굉장히 활발히 연구되며 현재까지 로봇, 바둑, 게임 등 도전적인 다양한 문제를 해결해왔다. 정책 경사 강화학습은 그중에서도 가장 활발히 연구되는 알고리즘이다. 주어진 공간이 연속 공간일 때, 보편적으로 정책 경사 강화학습에서는 정책 확률 분포를 정규 분포로 가정하여 해결한다. 정책 확률 분포에서 결정된 행동은 환경으로 전달되어 행해지고 그 결과로 학습에 사용될 보상을 받는다. 그러나 유계 공간을 넘어서는 행동은 강제로 깎여 환경으로 전달된다. 이 과정에서 분포에서 생성된 표본이 변형되어 실행되고, 보상을 받게 되는 경계 효과가 발생한다. 경계 효과는 알고리즘이 경사를 오인하게 하여 학습을 저해하므로 해결해야 할필요가 있다. 기존에 이를 위한 해결책으로 베타 분포을 활용하는 제안이 있었으나 일부 단점이 나타난다.
이 논문에서 우리는 정규 분포의 대체로서 로짓 정규 분포를 소개하고 이를 이용해 경계 효과를 해결한다. MuJoCo의 연속공간 시뮬레이션에서의 실험을 통해 정규분포, 베타 분포와 그 성질과 성능을 비교한다. 실험결과, 로짓 정규 분포를 이용한 정책 경사 강화학습은 근위 정책 최적화 (PPO)와 신뢰 영역 정책 최적화 (TRPO)에서 모두 정규 분포를 앞지르는 성과를 보였다. 로짓 정규 분포는 베타 분포와는 다르게 일반성과 효율성을 가지고 있어 그 단점을 공유하지 않으므로, 강화학습에서 주력 확률 분포가 될 가능성을 보여준다.

Language: eng

URI: https://hdl.handle.net/10371/175421

https://dcollection.snu.ac.kr/common/orgView/000000164274

Files in This Item:

000000164274.pdf 0.82 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Master's Degree_컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share