Publications

Detailed Information

Memory Management of Multicore Systems to Accelerate Memory Intensive Applications : 메모리 집약적 응용프로그램 가속을 위한 멀티코어 시스템 메모리 관리

DC Field Value Language
dc.contributor.advisor안정호-
dc.contributor.author정대진-
dc.date.accessioned2018-11-12T00:58:08Z-
dc.date.available2018-11-12T00:58:08Z-
dc.date.issued2018-08-
dc.identifier.other000000152162-
dc.identifier.urihttps://hdl.handle.net/10371/143182-
dc.description학위논문 (박사)-- 서울대학교 대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2018. 8. 안정호.-
dc.description.abstractModern data-parallel architectures provide increasing computation performance by adopting chips equipped with several tens or even hundreds of compute units, and hence are widely being used for processing data-intensive applications (e.g., big data, machine learning) in various areas. As the computing system is highly parallelized and the amount of data processed by the latest applications increases, the performance of DRAM-based memory systems becomes increasingly important for the overall system performance. However, improving the performance of the main-memory systems has remained relatively challenging due to a high-cost premium, as opposed to the rapid growth in computational power. Moreover, todays multi-threaded/multi-programmed applications do not fully utilize the peak memory bandwidth provided by contemporary memory systems, because they do not consider a complicated main-memory subsystem. In this thesis, we propose novel application-level solutions to accelerate the latest big memory applications and emerging Convolutional Neural Networks (CNNs) exploiting multicore and manycore platforms.

The importance and popularity of big memory applications continue to grow, but utilizing small (e.g., 4KB) pages incurs frequent TLB misses, substantially degrading the performance of the system. Large (e.g., 1 GB) pages or direct segments can alleviate this penalty due to page table walks, but at the same time such a strategy exposes the organizational and operational details of modern DRAM-based memory systems to applications. Row-buffer conflicts are regarded as the main culprits behind the very large gaps between peak and achieved main-memory throughput, but hardware-based approaches in memory controllers have achieved only limited success whereas existing proposals that change memory allocators cannot be applied to large pages or direct segments. In this thesis, we first propose a set of application-level techniques to improve effective main-memory bandwidth by minimizing row-buffer conflict for big memory applications using large pages. Experiments with a contemporary x86 server show that combining large pages with the proposed address linearization, bank coloring, and write streaming techniques improves the performance of the three big memory applications of high-throughput key-value store, fast-Fourier transform, and radix sort by 37.6, 22.9, and 68.1 percent, respectively.

CNNs have become the default choice for processing visual information, and the design complexity of CNNs has been steadily increasing to improve accuracy. To cope with the massive amount of computation needed for such complex CNNs, the latest solutions utilize blocking of an image over the available dimensions (e.g., horizontal, vertical, channel, and kernel) and batching of multiple input images to improve data reuse in the memory hierarchy. However, there are only a few studies focused on the memory bottleneck problem caused by limited bandwidth. Bandwidth bottleneck can easily occur in CNN acceleration as CNN layers have different sizes with varying computation needs and as batching is typically performed over each layer of CNN for an ideal data reuse. Moreover, this problem has become more serious as the latest CNN models actively adopt non-convolutional (non-CONV) layers, including batch normalization (BN), to improve prediction performance. Non-CONV layers, including BN, typically have relatively lower computational intensity compared to the convolutional or fully-connected layers, and hence they are often constrained by main-memory bandwidth.

In this thesis, we first introduce a strategy of partitioning compute units where the cores within each partition process a batch of input data in a synchronous manner to maximize data reuse but different partitions run asynchronously. We show that it can lead to 8.0% of performance gain on a commercial 64-core processor when running ResNet-50.

Then, we propose to restructure BN layers by first splitting it into two sub-layers (fission) and then combining the first sub-layer with its preceding convolutional layer and the second sub-layer with the following activation and convolutional layers (fusion). The proposed solution can significantly reduce main-memory accesses while training the latest CNN models, and the experiments on a chip multiprocessor with our modified Caffe implementation show that the proposed BN restructuring can improve the performance of DenseNet with 121 convolutional layers by 28.4%.
-
dc.description.tableofcontentsContents



Abstract i

Contents v

List of Figures viii

List of Tables x



Introduction 1

1.1 Accelerating Big Memory Applications 3

1.2 Accelerating the Latest CNN Models 5

1.3 Research Contributions 10

1.4 Outline 11

Large Pages on Steroids: Small Ideas to Accelerate Big Memory Applications 12

2.1 DRAM Organizational and Operational Details Exposed to Big Memory Workloads 12

2.2 Exploiting DRAM Interleaving Exposed to Multithreaded Big Memory Applications 16

2.3 Experimental Setup 20

2.4 Evaluation 21

CNN Background and Trends 26

3.1 Pertinent details of vanilla CNN models 26

3.2 Trends in CNN accelerator designs 29

3.3 Trends in recent CNN models 32

3.4 DenseNet: a state-of-the-art CNN model 34

Partitioning Compute Units in CNN Acceleration for Statistical Memory Traffic Shaping 37

4.1 Introduction 37

4.2 Data Reuse Characteristics of CNN 39

4.3 Statistical Memory Traffic Shaping by Partitioning Compute Units 45

4.4 Experimental Setup 48

4.5 Evaluation 49

Restructuring Batch Normalization Layer for Accelerating CNN Training 53

5.1 Restructuring Batch Normalization 53

5.1.1 Analyzing DenseNet 53

5.1.2 Fission-n-Fusion 59

5.2 Experimental Setup 66

5.3 Evaluation 69

5.4 Related Work 75

5.4.1 Maximizing data reuse 76

5.4.2 Pruning and approximate computing 76

5.4.3 Training acceleration 77

5.4.4 Fusing and blending layers 77

Conclusion 79

6.1 Future Work 82

Bibliography 84

국문초록 95
-
dc.language.isoen-
dc.publisher서울대학교 대학원-
dc.subject.ddc620.82-
dc.titleMemory Management of Multicore Systems to Accelerate Memory Intensive Applications-
dc.title.alternative메모리 집약적 응용프로그램 가속을 위한 멀티코어 시스템 메모리 관리-
dc.typeThesis-
dc.description.degreeDoctor-
dc.contributor.affiliation융합과학기술대학원 융합과학부(지능형융합시스템전공)-
dc.date.awarded2018-08-
Appears in Collections:
Files in This Item:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share