S-Space Graduate School of Convergence Science and Technology (융합과학기술대학원) Dept. of Transdisciplinary Studies(융합과학부) Theses (Ph.D. / Sc.D._융합과학부)
Memory Management of Multicore Systems to Accelerate Memory Intensive Applications
메모리 집약적 응용프로그램 가속을 위한 멀티코어 시스템 메모리 관리
- 융합과학기술대학원 융합과학부(지능형융합시스템전공)
- Issue Date
- 서울대학교 대학원
- 학위논문 (박사)-- 서울대학교 대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2018. 8. 안정호.
- Modern data-parallel architectures provide increasing computation performance by adopting chips equipped with several tens or even hundreds of compute units, and hence are widely being used for processing data-intensive applications (e.g., big data, machine learning) in various areas. As the computing system is highly parallelized and the amount of data processed by the latest applications increases, the performance of DRAM-based memory systems becomes increasingly important for the overall system performance. However, improving the performance of the main-memory systems has remained relatively challenging due to a high-cost premium, as opposed to the rapid growth in computational power. Moreover, todays multi-threaded/multi-programmed applications do not fully utilize the peak memory bandwidth provided by contemporary memory systems, because they do not consider a complicated main-memory subsystem. In this thesis, we propose novel application-level solutions to accelerate the latest big memory applications and emerging Convolutional Neural Networks (CNNs) exploiting multicore and manycore platforms.
The importance and popularity of big memory applications continue to grow, but utilizing small (e.g., 4KB) pages incurs frequent TLB misses, substantially degrading the performance of the system. Large (e.g., 1 GB) pages or direct segments can alleviate this penalty due to page table walks, but at the same time such a strategy exposes the organizational and operational details of modern DRAM-based memory systems to applications. Row-buffer conflicts are regarded as the main culprits behind the very large gaps between peak and achieved main-memory throughput, but hardware-based approaches in memory controllers have achieved only limited success whereas existing proposals that change memory allocators cannot be applied to large pages or direct segments. In this thesis, we first propose a set of application-level techniques to improve effective main-memory bandwidth by minimizing row-buffer conflict for big memory applications using large pages. Experiments with a contemporary x86 server show that combining large pages with the proposed address linearization, bank coloring, and write streaming techniques improves the performance of the three big memory applications of high-throughput key-value store, fast-Fourier transform, and radix sort by 37.6, 22.9, and 68.1 percent, respectively.
CNNs have become the default choice for processing visual information, and the design complexity of CNNs has been steadily increasing to improve accuracy. To cope with the massive amount of computation needed for such complex CNNs, the latest solutions utilize blocking of an image over the available dimensions (e.g., horizontal, vertical, channel, and kernel) and batching of multiple input images to improve data reuse in the memory hierarchy. However, there are only a few studies focused on the memory bottleneck problem caused by limited bandwidth. Bandwidth bottleneck can easily occur in CNN acceleration as CNN layers have different sizes with varying computation needs and as batching is typically performed over each layer of CNN for an ideal data reuse. Moreover, this problem has become more serious as the latest CNN models actively adopt non-convolutional (non-CONV) layers, including batch normalization (BN), to improve prediction performance. Non-CONV layers, including BN, typically have relatively lower computational intensity compared to the convolutional or fully-connected layers, and hence they are often constrained by main-memory bandwidth.
In this thesis, we first introduce a strategy of partitioning compute units where the cores within each partition process a batch of input data in a synchronous manner to maximize data reuse but different partitions run asynchronously. We show that it can lead to 8.0% of performance gain on a commercial 64-core processor when running ResNet-50.
Then, we propose to restructure BN layers by first splitting it into two sub-layers (fission) and then combining the first sub-layer with its preceding convolutional layer and the second sub-layer with the following activation and convolutional layers (fusion). The proposed solution can significantly reduce main-memory accesses while training the latest CNN models, and the experiments on a chip multiprocessor with our modified Caffe implementation show that the proposed BN restructuring can improve the performance of DenseNet with 121 convolutional layers by 28.4%.
- Files in This Item: