Transductive Approach for Compositional Conservatism in Offline Reinforcement Learning

Abstract: Offline reinforcement learning (RL) is a compelling framework for learning optimal policies from past experiences without additional interaction with the environment. Nevertheless, offline RL inevitably faces the problem of distributional shifts, where the states and actions encountered during policy execution are not in the training dataset distribution. A common solution involves incorporating conservatism into the policy or value function, which safeguards against uncertainties and unknowns. In this paper, we also focus on achieving the same objectives of conservatism but from a different perspective. We propose COmpositional COnservatism with Anchor-seeking (COCOA) for offline RL, an approach that pursues conservatism in a compositional manner on top of the transductive reparameterization [1]. In this reparameterization, the input variable (the state in our case) is decomposed into an anchor and its difference from the original input. COCOA is designed to seek both in-distribution anchors and in-distribution differences by utilizing the learned reverse dynamics model, encouraging conservatism in the compositional input space for the policy or value function. Such compositional conservatism is independent of and agnostic to the prevalent behavioral conservatism in offline RL. Accordingly, we apply COCOA to four state-of-the-art offline RL algorithms and evaluate them on the D4RL benchmark, showing that COCOA generally improves the performance of each algorithm.
오프라인 강화 학습은 추가적인 환경 상호작용 없이 과거 데이터만으로 최적 정책을 학습하기 위한 효과적인 방법론이다. 그러나 오프라인으로 학습된 에이전트는 훈련 데이터 셋 분포에 없는 상태 또는 행동에 대해서는 잘 동작하지 않는다는 문
제가 있다. 이를 분포 변화 문제라고 하는데, 일반적인 해결 방법은 정책이나 가치함수에 보수성을 적용해서 불확실한 미지의 상황을 사전에 방지하는 것이다. 본 연구에서는 이러한 보수성을 기존과는 다른 관점에서 달성하고자 한다. 우리는 오프
라인 강화 학습을 위한 새로운 접근법인 COCOA(COmpositional COnservatism with Anchor-seeking)를 제안한다. 이는 전이적 재매개화[1]를 오프라인 강화 학습에 적용시키고 구성적 보수성이 이루어지도록 한 방법론이다. 이 재매개화 과정에서 입력 변수, 곧 상태는 앵커 부분과 그 외의 부분으로 분해된다. 이때, 학습된 역방향 동역학 모델을 이용하여 상태를 분해하면, 앵커와 나머지 부분이 모두 오프라인 데이터 분포 내에서 추출되도록 유도하여, 정책이나 가치 함수의 구성적 입력 공간에서의 보수성을 추구할 수 있다. COCOA의 구성적 보수성은 기존에 오프라인 강화 학습에서 주로 사용되었던 행동적 보수성과 독립적이며 그 종류와 무관하게 같이 사용될 수 있다. 이에 따라, 행동적 보수성에 기반한 네 가지 최신 오프라인 강화 학습 알고리즘에 COCOA를 적용한 결과, D4RL 벤치마크에서 각 알고리즘들의 성능을 일반적으로 향상시키는 것으로 나타났다.

Language: eng

URI: https://hdl.handle.net/10371/210406

https://dcollection.snu.ac.kr/common/orgView/000000183055

Files in This Item:

000000183055.pdf 6.32 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Program in Artificial Intelligence (협동과정-인공지능전공)
  - Theses (Master's Degree_협동과정-인공지능전공)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share