S-Space Graduate School of Convergence Science and Technology (융합과학기술대학원) Dept. of Transdisciplinary Studies(융합과학부) Theses (Master's Degree_융합과학부)
Modeling Stock Prices with Textual Contents in 10-Q Reports
10-Q 분기별 재무보고서 원문을 이용한 장기주가모델
- 융합과학기술대학원 융합과학부(디지털정보융합전공)
- Issue Date
- 서울대학교 대학원
- 학위논문 (석사)-- 서울대학교 대학원 : 융합과학기술대학원 융합과학부(디지털정보융합전공), 2018. 8. 서봉원.
- Researchers in the field of finance have reviewed various adaptation of algorithms and data sources for modeling stock prices. In previous literatures, prediction models proved to be effective in modeling short-term prices. However, there has been relatively little application of these datasets on modeling long-term price trends.
Successful research in fundamental analysis, school of financial research that models stock prices using news and other company-related information, depended on text analysis of news media, social network services, and internet message boards. This study proposes that textual content of 10-Q form mandated by Securities and Exchange Commission is more useful in modeling long-term stock prices since 10-Q reports have relatively less variety in information content, more future-oriented set of topics, and accurate information regarding future company risks and performance.
In order to test informational applicability of 10-Q text, total of 18,237 10-Q reports of companies listed in January 2018 Standard & Poors 500 index were collected from January 2004 to January 2018 from SEC website. The collected corpus has been split into train and test subsets for out-of-sample evaluation. Test subset consists of 3,000 observations that are published between December 2015 to January 2018, whereas train subset consists of 11,132 documents that are published at least 90 days prior to the first publication date in test subset.
Due to inherent difficulties in modeling long-term prices, we use few adjustments to standard methods in our experiment. In order to utilize only the most current and relevant information, only the recent additions to each 10-Q document is used during the experiment. We then used 50 topic probabilities created using Latent Dirichlet Allocation on our corpus to mitigate curse of dimensionality and induce hidden topical structures. Moreover, information content in 10-Q corpus is known to be associated with possible future events, which may be more effectively modeled using abnormal gains or losses within a specified period. Thus, we also propose a binary classification of stock prices using modified return-on-investment to represent stocks that have had abnormal gains within 90 days and stocks that have not.
We then construct 10-Q corpus stock price prediction model using Stacked Ensemble of Generalized Linear Regression, Random Forest, Extremely Randomized Trees, Gradient Boosting Machine, and Deep Feed-Forward Neural Network. Models were evaluated through the entire test set, quarter subsets, and simulated investment portfolio. Notable findings of this study include: 1) Highest prediction area under the Receiver Operating Characteristic curve (AUC) of 0.5878 on model using Latent Dirichlet Allocation, proposed performance measurement, and Stacked Ensemble. 2) Model performance increases by using Latent Dirichlet Allocation (highest AUC of 0.5878) compared to bag-of-words text representation (0.5727). 3) Prediction cannot be made by using traditional return-on-investment using publication price and price at 90 days after publication. 3) Portfolio earnings using stocks selected by our model had higher two-year compound earnings of 55.78% compared to S&P 500 average of 29.98%. 4) There is no observable difference in topics between false positive stocks and true positive stocks in investment simulation. 5) Earnings in simulated investment portfolio increased when our proposed performance measurement is closer to distribution of 90-day prices.
This study makes the following contributions to the growing body of research in finance and machine learning. First, we review application of previously overlooked textual contents in 10-Q reports for modeling long-term stock prices. Second, we propose a new method for representing future stock prices and review difference in performance of models that are built through conventional and proposed return on investment. Lastly, we review application of LDA in using 10-Q corpus to build price prediction models.
In sum, this study concludes that 10-Q corpus can be used for modeling stock prices. 10-Q corpus is a unique, future-oriented dataset that can be used for analysis of future stocks and company value. This study only reviewed a handful number of approaches to test informational usability of 10-Q reports, but there are a number of measurements that may enhance performance and provide a better understanding of effects of 10-Q text on stock prices.