Deep learning-based stock price clustering using image encoder for time series

Abstract: Cluster analysis research, which classifies groups based on similarities, is being actively conducted in various academic fields. This research applies further analyses by examining the characteristics based on similarities within each cluster and across different clusters. Besides, cluster analysis based on stock prices is being actively exploited in the stock market for various objectives, including future price and direction prediction, algorithm trading, investment recommendation system, portfolio management, and outlier detection. Generally, the traditional method reduces the high dimension of stock data using principal component analysis and forms clusters applying k-Means and Hierarchical Clustering.

In this study, we encoded the stock price as an image using Gramian Angular Field to handle the input data. The first proposed model employed Deep Clustering, which is actively researched in the computer vision field, to optimize both dimensionality reduction and clustering. We calculated reconstruction loss based on CNN-AutoEncoder and obtained the probability distribution of data belonging to each cluster using Student's t-distribution kernel function. The normalized distribution, the probability divided by the cluster size, was defined as an auxiliary target distribution and assumed it as a true label. We were able to approach cluster analysis through supervised learning even though it was unsupervised learning in nature. Therefore, we computed the difference between the two probability distributions using Kullback-Leibler divergence and defined it as clustering loss. We defined final loss as the sum of reconstruction loss and clustering loss to train the model. The second model reduced the dimension of an image through CNN-AutoEncoder and clustered using the k-Means algorithm.

The data used in the experiment was approximately 500 stocks from S&P500 and 960 business days ranging from late-2016 to mid-2020. The proposed models were compared with the traditional clustering methods based on four validation metrics. These models extracted similarities from the training data to cluster with a high association, which resulted in better performance of future validation data. We implemented Paired Sample T-test in our experiment and identified a meaningful difference within the validation metric, especially in the correlation coefficient.

We were able to find that the imbalance in cluster size was problematic when applying traditional cluster algorithms to stock prices. Thus, we defined the cluster size ratio by dividing the number of data in the largest clusters by the number of data in the smallest clusters to compare with the alternative clustering models. The number of stocks in the cluster was relatively balanced compared to other alternative models due to the normalization effect of the auxiliary target distribution.

Finally, we constructed a portfolio by sampling the stocks within the cluster. We obtained optimal portfolio weights by solving the Tangency Portfolio problem. The portfolio was constructed by selecting stocks at the optimal weight, which was later compared the return with other alternative models.
여러 학술 분야에서 유사성을 기반으로 여러 집단으로 분류하는 군집 분석의 연구가 활발히 되고 있으며 군집 내의 유사성 및 특성을 분석하고 다른 군집과의 차이를 살피는 등의 응용이 다양하게 되어 있다. 군집분석은 주식시장에도 이용되고 있고 주식 가격을 군집화 하여 미래 주가 및 방향성 예측, 알고리즘 트레이딩, 투자 추천 시스템, 포트폴리오 관리, 이상치 탐지 등에 이용하고 있다. 전통적인 방법론은 일반적으로 높은 차원의 주식 데이터를 주성분 분석 (Principal Component Analysis) 등의 방법론을 이용하여 축소하고 k-Means, Hierarchical clustering 등의 방법론을 이용하여 군집을 형성한다.

본 연구에서는 주식 가격을 Gramian Angular Field를 이용하여 인코딩한 이미지를 입력 데이터로 구성한다. 또한 첫 번째 모델로 컴퓨터 비전 분야에서 활발하게 연구되고 있는 Deep Clustering 방법론을 이용하여 축소 차원과 군집 형성을 공동으로 최적화하여 군집을 만든다. 합성곱 오토인코더 (CNN-AutoEncoder)를 기반으로 재건설 손실값 (reconstruction loss)을 구하고 Student의 t-분포를 커널 함수로 사용하여 데이터가 군집에 속할 확률을 구한다. 이 확률을 군집 크기로 나누어 정규화한 분포를 보조 타켓 분포 (auxiliary target distribution)로 정의하고 이를 실제 라벨이라고 가정하면 비지도학습인 군집분석을 지도학습과 같이 학습할 수 있다. 따라서 두 확률 분포의 차이를 쿨백-라이블러 발산(Kullback-Leibler divergence)을 이용하여 구하고 이를 군집 손실값 (clustering loss)으로 정의한다. 재건설 손실값과 군집 손실값을 합하여 최종 손실값 (total loss)으로 정의하고 이 손실값으로 모델을 학습한다. 두 번째 모델은 딥러닝 기반의 합성곱 오토인코더를 이용하여 이미지의 차원을 축소하고 k-Means 알고리즘을 이용하여 군집을 형성하는 것이다.

실험에 사용한 데이터는 S&P 500이고 2016년 말부터 2020년 중순까지의 960영업일을 약 500개의 주식을 대상으로 분석한다. 제안하는 모델은 네 가지 검증 측도를 기준으로 비교 군집분석 방법론과 비교한다. 제안하는 모델은 훈련 데이터에서 보이는 유사한 특징을 추출하여 연관성이 높은 주식을 뽑아내 군집을 형성하고 한 차례 미래 시점인 검증 데이터에서 보다 우수한 성능의 주식 군집을 형성한다. 이를 확인하기 위하여 대응 표본 T-검정 (Paired Sample T-test)을 통하여 실험을 진행하였고 특히 상관계수를 기반으로 한 검증 측도에서 유의미한 차이가 있었다.

전통적인 군집 알고리즘에 주식 가격을 적용하면 군집의 크기의 불균형도가 심한 것을 실험을 통해 알 수 있다. 따라서 가장 큰 군집의 데이터 수를 가장 작은 군집의 데이터 수로 나눈 비율을 정의하고 제안하는 모델의 비율을 전통적인 군집 모델과 비교하였다. 모델의 보조 타겟 분포의 정규화 영향으로 군집 내의 주식의 수가 다른 모델 대비 균등하게 분포되어 있어 안정적이고 합리적인 비율이 나온다.

마지막으로 군집을 형성한 후에 군집에서 주식을 적절한 방법으로 샘플링 (sampling)하여 포트폴리오를 구성한다. 접점 포트폴리오 (tangency portfolio) 문제를 풀어 포트폴리오에서 최적 포트폴리오의 주식 가중치를 얻는다. 이를 기반으로 주식을 최적 비율만큼 선택하여 포트폴리오를 구성하고 다른 모델들과 수익률을 비교한다.

Language: kor

URI: https://hdl.handle.net/10371/175191

https://dcollection.snu.ac.kr/common/orgView/000000165713

Files in This Item:

000000165713.pdf 4.42 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Industrial Engineering (산업공학과)
  - Theses (Master's Degree_산업공학과)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share