Publications
Detailed Information
TrainBox: An Extreme-Scale Neural Network Training Server Architecture by Systematically Balancing Operations
Cited 4 time in
Web of Science
Cited 5 time in Scopus
- Authors
- Issue Date
- 2020-10
- Publisher
- IEEE COMPUTER SOC
- Citation
- 2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 2020), pp.825-838
- Abstract
- Neural network is a major driving force of another golden age of computing; the computer architects have proposed specialized accelerators (e.g., TPU), high-speed interconnects (e.g., NVLink), and algorithms (e.g., ring-based reduction) to efficiently support neural network applications. As a result, they achieve orders of magnitude higher efficiency for neural network computation and inter-accelerator communication over traditional computing platforms. In this paper, we identify that the emerging platforms have shifted the performance bottleneck of neural network from model computation and inter-accelerator communication to data preparation. Although overlapping data preparation and the others has hidden the preparation overhead, the higher input processing demands of emerging platforms start to reverse the situation; at scale, data preparation requires an infeasible amount of the host-side CPU, memory, and PCIe resources. Our detailed analysis reveals that this heavy resource consumption comes from data transformation for neural network specific formats, and buffering for communication among devices. Therefore, we propose a scalable neural network server architecture by balancing data preparation and the others. To achieve extreme scalability, our design relies on a scalable device array, rather than the limited host resources, with three key ideas. First, we offload CPU-intensive operations to the customized data preparation accelerators to scale the training performance regardless of the host-side CPU performance. Second, we apply direct inter-device communication to eliminate unnecessary data copies and reduce the pressure on the host memory. Lastly, we cluster underlying devices considering unique communication patterns of the neural network processing and interconnect characteristics to efficiently utilize aggregated interconnect bandwidth. Our evaluation shows that the proposed architecture achieves 44.4x higher training throughput on average over a naively extended server architecture with 256 neural network accelerators.
- Files in This Item:
- There are no files associated with this item.
Item View & Download Count
Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.