Self-attention Based Recurrent Reinforcement Learning for Algorithmic Trading

Abstract: The advancement of deep neural networks and the capability to effectively process complex data has resulted in a significant increase in academic interest in the application of algorithm trading modeled on neural network structures. The utilization of machine learning in algorithmic trading is driven by two primary objectives.
The first objective is to identify meaningful characteristics that can shed light on the fluctuations observed in the financial market. The second objective is to detect underlying causal relationships within multivariate financial time series data.
The task of extracting valuable features from financial time series to make predictions or explain market movements is challenging due to the inherent volatility and high levels of noise present in such data.
Most algorithmic trading methodologies to date have primarily focused on the process of feature engineering, aimed at directly extracting meaningful features or factors from financial time series data.
The majority of algorithmic trading models currently in use determine the optimal position through supervised learning models, such as predicting direction or price, rather than learning from a profit-maximizing objective function.
This approach not only fails to fully incorporate the direct utility of a given position, but also leads to double error resulting from indirect decision making.
In light of these limitations, reinforcement learning-based algorithmic trading models have emerged as a viable alternative. These models learn the optimal behavior by maximizing the expected reward from observations within a given market environment.
This approach overcomes the limitations of supervised learning-based algorithmic trading models by directly incorporating the direct utility of a given position.

The present study proposes a novel convergence model that integrates recurrent reinforcement learning (RRL) and the self-attention mechanism.
The efficacy of the proposed model is rigorously tested using various financial time series data sourced from the stock market.
The model structure is first defined through the identification of its key components, including the environment, reward, state-space, and action space.
The proposed model is built upon RRL, a policy-based model that leverages time dependency among generated trading signals across different time-stamps.
RRL recurrently generates trading signals based on previous signals, utilizes market observations as inputs to the policy function, and seeks to maximize expected rewards through the optimization of the utility function with regards to returns.
To enhance the performance of the RRL, the proposed model combines auxiliary neural networks, including a supervised learning sub-network for prediction power, a feature extraction sub-network for reconstructing the original input sequence, and a self-attention mechanism for reallocating temporal weights within latent state variables.
The proposed model is described in two ways in this paper, including a finite time horizon case and an infinite horizon case. Furthermore, it is demonstrated that the martingale assumption for short-term fluctuations in the infinite time horizon case enables the model to learn from the same advantage setting as in the finite time horizon case.

In this study, two empirical applications of the proposed model were carried out and the results were analyzed in a systematic manner. The first application involved the use of the finite time horizon Deep RRL algorithmic trading model to perform intra-day trading on stocks listed on the KOSPI200 index. The experiment was conducted using 40 days of minute-level price and volume (OHCLV) data for KOSPI200 listed stocks, spanning from March to June 2019.
Stocks that experienced external market interventions, such as volatility interruptions and limited price movements, were excluded from the experiment. To enhance the diversity of the training set, data augmentation was employed to reduce the momentum effect and correlation effect among stocks.
The second application involved the use of the infinite time horizon Deep RRL algorithmic trading model to perform daily trading on stocks listed on the S&P 500 index. The experiment was conducted using daily OHCLV data for S&P500 listed stocks, comprising the largest market cap stocks in individual sectors, ranging from January 2000 to January 2020. All data used in the experiments were divided into a training set, validation set, and test set, and a single model was trained for all stocks in the market using the training set.

In the empirical applications of the proposed Deep RRL model, a comprehensive comparative analysis was performed to evaluate the performance of the model against various other commonly utilized models in algorithmic trading.
The models considered for comparison include Long-short term memory (LSTM) and Random Forest, which are typical supervised machine learning models used for predicting direction or price. Another model considered was the A3C policy-based reinforcement learning model. Additionally, the ARIMA time series model was also included in the comparison.
To further verify the efficacy of the proposed model, an ablation study was conducted in the infinite time horizon experiment.
The study aimed to assess the impact of each of the additional sub-networks on the model's performance. The results showed that the proposed models outperformed the other models in terms of nominal return and return on risk, suggesting that the proposed model has better performance in terms of returns.
The results of the ablation study also indicated that the additional structures of the proposed models contribute to improving the performance of the recurrent reinforcement learning model.
심층신경망(Deep neural network)의 발달로 인해 복잡한 데이터를 다룰 수 있게 됨에 따라, 심층신경망 구조를 활용한 알고리듬 트레이딩은 단순히 기계학습의 응용분야 중 하나에 그칠 뿐 아니라 학문적 차원에서 발달하고 있다.
머신러닝을 이용한 알고리듬 트레이딩의 핵심은 다음과 같이 두 가지이다.
첫째로는 금융시계열로부터 시장 가격의 변동을 설명할 수 있는 특성을 추출하는 것이고, 둘 째로는 다차원의 금융시계열로부터 특성들 간의 시간적 인과성을 찾아내는 것에 있다.
금융시계열은 노이즈가 강한 특성이 있기 때문에, 금융시계열 자체로부터 직접 예측에 필요한 유의미한 특성을 뽑아내는 것이 어렵다.
이 때문에 그동안의 알고리듬 트레이딩은 주로 사전적 지식(domain knowledge)을 이용하여, 금융시계열로부터 특성 변수를 직접 추출 또는 가공하는 방법론(feature engineering)이 주요하게 연구되었다.
또한 알고리듬 트레이딩의 이윤 극대화를 직접적으로 학습의 목적함수에 이용하는 대신, 방향 분류 모델 또는 가격에 대한 회귀 모델 등과 같은 지도학습 모델을 통해 간접적으로 최적의 포지션을 예측하는 경우가 많았다.
그러나 지도학습을 이용한 방법은 포지션에 의한 효용을 학습에 충분히 반영하지 못하고, 예측 결과로부터 간접적인 의사결정을 수행하기 때문에 이윤극대화라는 목적에 대해 이중 오차가 발생한다.
이와 같은 문제를 극복하기 위해, 주어진 환경에서의 관측으로부터 기대보상을 최대화하는 행동을 학습하는 강화학습 기반의 알고리듬 트레이딩 모델이 지도학습 기반의 모델의 단점을 극복하는 좋은 대안이 되고 있다.

본 논문에서는 순환강화학습과 셀프 어텐션 구조를 결합한 융합 모델을 제안하고, 주식 시장에서의 거래 신호를 발생하는 실증적으로 제안된 모델의 유용함을 검증한다.
먼저 제안된 모델의 네트워크 구조 및 학습에 필요한 환경, 보상, 상태집합, 그리고 에이전트의 행동집합을 정의한다.
제안된 모델은 서로 다른 시점에서 생성된 거래 신호들 간의 시간적 의존성(temporal dependency)를 반영한 정책 그라디언트 모델인 심층 순환강화학습 구조를 기반으로 한다.
심층 순환강화학습은 에피소드(episode)내의 각 시점에서의 관측치를 정책함수의 입력으로 갖고 이전의 거래 신호에 대해 순환적(recurrent)으로 결정되는 거래 신호를 생성하고, 생성된 거래 신호로부터 얻는 보상인 수익률들에 대한 효용함수를 극대화하도록 정책함수의 파라미터를 학습하는 정책 기반(policy-based) 강화학습 모형이다.
여기에 지도학습에 해당하는 예측모델 및 비지도학습에 해당하는 생성모델 오차를 전파하는 하이브리드 인공신경망과 은닉 표현 시퀀스의 시간적 중요도를 학습하는 셀프 어텐션 계층을 접목하였다.
또한 제안된 모델의 강화학습 과정을 상세하게 기술하기 위해, 시계(Time horizon)의 범위가 한정이 되어있는 경우와 한정이 없는 경우 두 상태 모두에 대해서 모델을 정의한다.
이 때, 시계의 범위가 한정이 되어있지 않은 경우라도 단기 변동에 대해 위험중립적 선호, 즉 매수와 매도에 대한 정책의 사전적인(prior) 위험 중립 확률이 1대 1에 가깝다고 근사하는 경우에는 시계의 범위가 한정되어있는 모델과 동일한 이익(Advantage)로부터 학습이 가능함을 보인다.

다음으로, 모델 성능을 검증하는 두 가지 응용 실험을 수행하고 그 결과를 분석한다.
첫 번째로 유한 시간 순환강화학습 알고리듬 트레이딩 모델의 응용으로, 코스피200 상장 종목에 대한 데이 트레이딩(Intraday trading)에 적용한 실험을 수행한다.
유한 시간 순환강화학습 알고리듬 트레이딩 모델의 응용 실험에 사용된 데이터는 2019년 3월부터 6월까지의 총 40일 간의 KOSPI200 등재 종목의 분 단위의 시가, 종가, 고가, 저가, 거래량이며, 이 중 거래정지, VI, 제한가 등 외부적인 시장 거래 개입이 발생한 종목을 제외한 종목들에 대해 실험을 수행하였다.
실험은 전체 종목 데이터를 훈련 데이터와 검증 데이터, 그리고 시험 데이터로 나누어 시장 전체 종목에 대한 데이 트레이딩을 수행하는 단일 모델을 생성하였으며, 추세 효과 또는 종목 간 동조 효과를 줄이기 위해 데이터 증강(Data augmentation)을 사용하여 훈련 데이터를 다양화하였다.
두 번째로 무한 시간 순환강화학습 알고리듬 트레이딩 모델의 응용으로, S\&P 500 상장 종목에 대한 일일 거래에 적용한 실험을 수행한다.
무한 시간 순환강화학습 알고리듬 트레이딩 모델의 응용 실험에 사용된 데이터는 2000년 1월부터 2019년 11월까지의 총 5000일 간의 개별 S\&P500 등재 종목으로, 이를 훈련 데이터와 검증 데이터, 그리고 시험 데이터로 나누어 개별 종목 별로 모델을 생성하였다.
해당 기간에 존속했던 기업 중 11개의 섹터 별로 말일 기준 시가총액이 가장 큰 기업들에 대해 개별적으로 모델을 학습 및 시험을 수행하였다.

두 실험에서 공통적으로 머신러닝 기반의 지도학습 모델인 LSTM, Random Forest와 강화학습 모델인 A3C 모델, 그리고 시계열 모델인 ARIMA와의 모델 성능을 비교한다.
또한 무한 시간 순환강화학습 알고리듬 트레이딩 모델 실험에서는 제안된 모델의 구조적 효과를 검증하기 위해, 모델의 구성요소가 되는 하부 학습 구조를 모델에서 제거하여 제안된 융합 강화학습 모델의 각 요소가 성능 향상에 영향을 끼치는지를 검증한다.
결과적으로, 우리의 모델이 일반적인 머신러닝 모델 또는 기본적인 강화학습 모델에 비해 명목 수익률, 리스크 대비 수익률 등의 수익률 평가 지표 측면에서 더욱 높은 성능을 가짐을 제시한다.
또한 모델 하부 학습 구조들에 의한 모델효과 검증 결과를 통해, 강화학습 입력 시퀀스에 도입한 셀프 어텐션 인코더 구조와, 피 지도학습-비지도학습 융합 학습 모델이 제안된 모델의 성능 향상에 유기적으로 도움을 주는 요소가 됨을 제시한다.

Language: eng

URI: https://hdl.handle.net/10371/196336

https://dcollection.snu.ac.kr/common/orgView/000000178359

Files in This Item:

000000178359.pdf 6.00 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Industrial Engineering (산업공학과)
  - Theses (Ph.D. / Sc.D._산업공학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share