Efficient Linear Contextual Bandit Algorithms  with Improved Regret Bounds

Abstract: This thesis contains two proposed efficient algorithms: (i) Doubly Robust
Thompson Sampling (DRTS) and (ii) Hybridization by Randomization (HyRan).
DRTS employs the doubly-robust method used in missing data literature to
Thompson Sampling with contexts (LinTS). A challenging aspect of the bandit
problem is that a stochastic reward is observed only for the chosen arm and
the rewards of other arms remain missing. The dependence of the arm choice
on the past context and reward pairs compounds the complexity of regret
analysis. Different from previous works relying on missing data techniques
[Dimakopoulou et al., 2019, Kim and Paik, 2019], the proposed algorithm is
designed to allow a novel additive regret decomposition leading to an improved
regret bound with the order of \tilde{O}(\phi^{-2}\sqrt{T}) where \phi^2 is the minimum eigenvalue
of the covariance matrix of contexts and T is the time horizon. This is the
first regret bound of LinTS using \phi^{2} without the dimension of the context, d and the regret bound of the proposed algorithm is \tilde{O}(d\sqrt{T}) in many practical
scenarios, improving the bound of LinTS by a factor of \sqrt{d}. A benefit of the
proposed method is that it utilizes all the context data, chosen or not chosen,
thus allowing to circumvent the technical definition of unsaturated arms used
in theoretical analysis of LinTS. Empirical studies show the advantage of the
proposed algorithm over LinTS.
HyRan is a novel bandit algorithm and show that our proposed algorithm
establish the regret bound of \tilde{O}(\sqrt{dT}), which is optimal up to the logarithmic
ifactors. The novelty comes from the two modifications where the first is to
utilize all contexts, both selected and unselected, and the second is to randomize
the contribution to the estimator. These modifications render a novel
decomposition of the cumulative regret into two main additive terms whose
bounds can be derived by employing the structure of the compounding estimator.
While previous algorithms such as SupLinUCB [Chu et al., 2011] have
shown \tilde{O}(\sqrt{dT}) regret, exploiting independence via a phased algorithm, HyRan
is the first to achieve \tilde{O}(\sqrt{dT}) regret keeping the practical advantage without
resorting to generating independent samples. The numerical experiments show
that the practical performance of our proposed algorithm is in line with the
theoretical guarantees.
본 학위논문은 순차적 결정 문제(Sequential decision making problem)를
위한 효율적인 선형 다중 슬롯 머신 알고리즘(Linear Contextual Bandit
Algorithm)을 제안한다. 선형 다중 슬롯 머신 알고리즘은 유한 개의
선택지(Arm)가 주어진 특정 환경 안에서 학습자가 그 선택지의
내용(Context)을 관찰하고 이들 중 보상 (Reward)을 최대화하는 행동을
파악하고 선택하는 방법론이다. 보상은 선택지의 내용과 선형 관계를
가지고 있다. 현재까지 제안된 선형 다중 슬롯 머신 알고리즘은 내용과
보상의 관계를 추정할 때, 선택된 내용과 보상으로만 추정하고 있다.
이는 선택되지 않은 내용들을 관찰만 하고 추정에는 사용할 수 없는
비효율성을 유발한다. 이로 인해 다중 슬롯 머신이 활용되는 뉴스 기사
배치 알고리즘이나 광고 추천 알고리즘이나 모바일 건강관리 시스템
등에서 선택받지 못한 기사, 광고, 건강관리법과 같은 내용이 추정에
사용될 수 없는 비효율성이 발생한다.
본 학위논문에서는 선택받지 못한 내용들도 추정에 활용할 수 있는
새로운 선형 다중 슬롯 머신 알고리즘 두 가지를 제안하였다. 첫째는
결측자료 분석법 중 이중 강건법(Doubly Robust)을 적용하여 관측하지
못한 보상을 유사 보상(Pseudo-reward)으로 대체하면서 선택되지 못한
내용도 추정에 활용할 수 있도록 하였고, 이를 통해 내용의 차원의
제곱근만큼 후회 상한 (Regret bound)를 개선하였다. 둘째는 간단한
랜덤화(Randomization)를 적용하여 선택받지 못한 내용을 활용하는
방법과 선택한 내용만 사용하는 방법을 혼합하여 만든 혼합
추정량(Compound Estimator)을 정의하고, 이 알고리즘이 최적(Optimal
rate)의 후회 상한을 가졌음을 증명하였다. 본 학위논문은 새 알고리즘이
선택받지 못한 내용을 활용하면서 이론적으로 성능이 개선되었음을
증명하였고, 시뮬레이션 데이터에 적용한 결과를 통해서도 기존
알고리즘보다 성능이 개선되었음을 확인하였다.

Language: eng

URI: https://hdl.handle.net/10371/181257

https://dcollection.snu.ac.kr/common/orgView/000000170385

Files in This Item:

000000170385.pdf 2.08 MB

Appears in Collections:

College of Natural Sciences (자연과학대학)
- Dept. of Statistics (통계학과)
  - Theses (Ph.D. / Sc.D._통계학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share