Learning Linear-Quadratic Regulators via Thompson Sampling with Preconditioned Langevin Dynamics

김기훈

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Learning Linear-Quadratic Regulators via Thompson Sampling with Preconditioned Langevin Dynamics : 사전 조건화된 랑주뱅 동역학을 결합한 톰슨 샘플링을 통한 선형 2차 제어기 학습

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 김기훈

Advisor: 양인순

Issue Date: 2023-08

Publisher: 서울대학교 대학원

Keywords: Linear quadratic regulator ; Thompson sampling ; Langevin dynamics

Description: 학위논문(석사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2023. 8. 양인순.

Abstract: 톰슨 샘플링(Thompson sampling)은 온라인 학습 문제에서 탐색과 활용 사이의 균형을 맞추는 데 널리 사용되는 방법으로, 이에는 선형 이차 제어기 (Linear Quadratic Regulator)를 위한 강화학습을 포함한다.그러나선형 이차 제어기 학습에 사용되는 톰슨 샘플링의 이론적 분석은 종종 가우시안 잡음의 경우에만 제한되는 경우가 많다. 또한, 우리는 알려진 시스템 파라미터가 미리 지정된 한정된 집합에 속한다는 가정을 더할 때 샘플링을 직접 수행할 수 있으며, 이는 제한적인 것으로 보인다. 이에 우리는 선형 이차 제어기를 위한 새로운 톰슨 샘플링 알고리즘을 제안하며, 비가우시안 잡음을 포함한 더 넓은 범위의 문제를 다루기 위해 랑주뱅 동역학(Langevin dynamics)를 활용하려 한다.또한, 특정 초기화 방법이나 실제 시스템 파라미터에 대한 정보를 필요로 하지 않으면서도, 사전 분포와 허용 가능한 집합에 대한 최소한의 가정만으로 우리의 알고리즘은 근사 사후 분포로부터 빠르게 샘플링 할 수 있다. 우리 알고리즘은 제곱근 T-스케일의 기대 후회(regret) 상한을 가지며, 이는 이전 연구들의 알고리즘 성능보다 개선된 결과이다. 또한, 우리의 알고리즘 성능 분석은 자기 정규화 기법과 함께 사전 조건화된 랑주뱅 동역학의 수렴 부등식을 활용한다. 우리 알고리즘의 성능은 수치 실험을 통해 입증되었다.
Thompson sampling (TS) is a widely used approach for addressing the exploration-exploitation trade-off in online learning problems, including reinforcement learning for linear quadratic regulators (LQR). However, in TS for learning LQR, its theoretical analysis is often limited to the case of Gaussian noises. The sampling can be performed directly when we further assume that the unknown system parameters lie in a prespecified compact set, which is seemingly restrictive. We propose a new TS algorithm for LQR, exploiting Langevin dynamics to handle a larger class of problems including those with non-Gaussian noises. The notion of the preconditioner is introduced to generate samples from non-conjugate posterior distributions. Our algorithm
is capable of sampling parameters from approximate posteriors quickly. It attains square root T-scale expected regret bound slightly improving the previous results under the minimal assumption on the prior distribution and admissible set requiring neither a particular initialization technique nor information on the true system parameter. Our regret analysis leverages a nontrivial concentration inequality for the preconditioned Langevin algorithm together with self-normalization techniques. The performance of our algorithm has been demonstrated through numerical experiments as well.

Language: eng

URI: https://hdl.handle.net/10371/196438

https://dcollection.snu.ac.kr/common/orgView/000000178190

Files in This Item:

000000178190.pdf 3.04 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Master's Degree_전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share