Memory Management of Multicore Systems to Accelerate Memory Intensive Applications

정대진

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Memory Management of Multicore Systems to Accelerate Memory Intensive Applications : 메모리 집약적 응용프로그램 가속을 위한 멀티코어 시스템 메모리 관리

DC Field	Value	Language
dc.contributor.advisor	안정호	-
dc.contributor.author	정대진	-
dc.date.accessioned	2018-11-12T00:58:08Z	-
dc.date.available	2018-11-12T00:58:08Z	-
dc.date.issued	2018-08	-
dc.identifier.other	000000152162	-
dc.identifier.uri	https://hdl.handle.net/10371/143182	-
dc.description	학위논문 (박사)-- 서울대학교 대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2018. 8. 안정호.	-
dc.description.abstract	Modern data-parallel architectures provide increasing computation performance by adopting chips equipped with several tens or even hundreds of compute units, and hence are widely being used for processing data-intensive applications (e.g., big data, machine learning) in various areas. As the computing system is highly parallelized and the amount of data processed by the latest applications increases, the performance of DRAM-based memory systems becomes increasingly important for the overall system performance. However, improving the performance of the main-memory systems has remained relatively challenging due to a high-cost premium, as opposed to the rapid growth in computational power. Moreover, todays multi-threaded/multi-programmed applications do not fully utilize the peak memory bandwidth provided by contemporary memory systems, because they do not consider a complicated main-memory subsystem. In this thesis, we propose novel application-level solutions to accelerate the latest big memory applications and emerging Convolutional Neural Networks (CNNs) exploiting multicore and manycore platforms. The importance and popularity of big memory applications continue to grow, but utilizing small (e.g., 4KB) pages incurs frequent TLB misses, substantially degrading the performance of the system. Large (e.g., 1 GB) pages or direct segments can alleviate this penalty due to page table walks, but at the same time such a strategy exposes the organizational and operational details of modern DRAM-based memory systems to applications. Row-buffer conflicts are regarded as the main culprits behind the very large gaps between peak and achieved main-memory throughput, but hardware-based approaches in memory controllers have achieved only limited success whereas existing proposals that change memory allocators cannot be applied to large pages or direct segments. In this thesis, we first propose a set of application-level techniques to improve effective main-memory bandwidth by minimizing row-buffer conflict for big memory applications using large pages. Experiments with a contemporary x86 server show that combining large pages with the proposed address linearization, bank coloring, and write streaming techniques improves the performance of the three big memory applications of high-throughput key-value store, fast-Fourier transform, and radix sort by 37.6, 22.9, and 68.1 percent, respectively. CNNs have become the default choice for processing visual information, and the design complexity of CNNs has been steadily increasing to improve accuracy. To cope with the massive amount of computation needed for such complex CNNs, the latest solutions utilize blocking of an image over the available dimensions (e.g., horizontal, vertical, channel, and kernel) and batching of multiple input images to improve data reuse in the memory hierarchy. However, there are only a few studies focused on the memory bottleneck problem caused by limited bandwidth. Bandwidth bottleneck can easily occur in CNN acceleration as CNN layers have different sizes with varying computation needs and as batching is typically performed over each layer of CNN for an ideal data reuse. Moreover, this problem has become more serious as the latest CNN models actively adopt non-convolutional (non-CONV) layers, including batch normalization (BN), to improve prediction performance. Non-CONV layers, including BN, typically have relatively lower computational intensity compared to the convolutional or fully-connected layers, and hence they are often constrained by main-memory bandwidth. In this thesis, we first introduce a strategy of partitioning compute units where the cores within each partition process a batch of input data in a synchronous manner to maximize data reuse but different partitions run asynchronously. We show that it can lead to 8.0% of performance gain on a commercial 64-core processor when running ResNet-50. Then, we propose to restructure BN layers by first splitting it into two sub-layers (fission) and then combining the first sub-layer with its preceding convolutional layer and the second sub-layer with the following activation and convolutional layers (fusion). The proposed solution can significantly reduce main-memory accesses while training the latest CNN models, and the experiments on a chip multiprocessor with our modified Caffe implementation show that the proposed BN restructuring can improve the performance of DenseNet with 121 convolutional layers by 28.4%.	-
dc.description.tableofcontents	Contents Abstract i Contents v List of Figures viii List of Tables x Introduction 1 1.1 Accelerating Big Memory Applications 3 1.2 Accelerating the Latest CNN Models 5 1.3 Research Contributions 10 1.4 Outline 11 Large Pages on Steroids: Small Ideas to Accelerate Big Memory Applications 12 2.1 DRAM Organizational and Operational Details Exposed to Big Memory Workloads 12 2.2 Exploiting DRAM Interleaving Exposed to Multithreaded Big Memory Applications 16 2.3 Experimental Setup 20 2.4 Evaluation 21 CNN Background and Trends 26 3.1 Pertinent details of vanilla CNN models 26 3.2 Trends in CNN accelerator designs 29 3.3 Trends in recent CNN models 32 3.4 DenseNet: a state-of-the-art CNN model 34 Partitioning Compute Units in CNN Acceleration for Statistical Memory Traffic Shaping 37 4.1 Introduction 37 4.2 Data Reuse Characteristics of CNN 39 4.3 Statistical Memory Traffic Shaping by Partitioning Compute Units 45 4.4 Experimental Setup 48 4.5 Evaluation 49 Restructuring Batch Normalization Layer for Accelerating CNN Training 53 5.1 Restructuring Batch Normalization 53 5.1.1 Analyzing DenseNet 53 5.1.2 Fission-n-Fusion 59 5.2 Experimental Setup 66 5.3 Evaluation 69 5.4 Related Work 75 5.4.1 Maximizing data reuse 76 5.4.2 Pruning and approximate computing 76 5.4.3 Training acceleration 77 5.4.4 Fusing and blending layers 77 Conclusion 79 6.1 Future Work 82 Bibliography 84 국문초록 95	-
dc.language.iso	en	-
dc.publisher	서울대학교 대학원	-
dc.subject.ddc	620.82	-
dc.title	Memory Management of Multicore Systems to Accelerate Memory Intensive Applications	-
dc.title.alternative	메모리 집약적 응용프로그램 가속을 위한 멀티코어 시스템 메모리 관리	-
dc.type	Thesis	-
dc.description.degree	Doctor	-
dc.contributor.affiliation	융합과학기술대학원 융합과학부(지능형융합시스템전공)	-
dc.date.awarded	2018-08	-

Appears in Collections:

Graduate School of Convergence Science and Technology (융합과학기술대학원)
- Dept. of Transdisciplinary Studies(융합과학부)
  - Theses (Ph.D. / Sc.D._융합과학부)

Files in This Item:

Memory Management of Multicore Systems to Accelerate Memory Intensive Applications.pdf 2.36 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share