Efficient I/O Management Schemes for All-Flash HPC Storage Systems

Abstract: Most I/O traffic in high performance computing (HPC) storage systems is dominated by checkpoints and the restarts of HPC applications.
For such a bursty I/O, new all-flash HPC storage systems with an integrated burst buffer (BB) and parallel file system (PFS) have been proposed.
However, most of the distributed file systems (DFS) used to configure the storage systems provide a single connection between a compute node and a server node, which hinders users from utilizing the high I/O bandwidth provided by an all-flash server node.
To provide multiple connections, DFSs must be modified to increase the number of sockets, which is an extremely difficult and time-consuming task owing to their complicated structures.
Users can increase the number of daemons in the DFSs to forcibly increase the number of connections without a DFS modification.
Because each daemon has a mount point for its connection, there are multiple mount points in the compute nodes, resulting in significant effort required for users to distribute file I/O requests to multiple mount points.
In addition, to avoid access to a PFS composed of low-speed storage devices, such as hard disks, dedicated BB allocation is preferred despite its severe underutilization.
However, a BB allocation method may be inappropriate because all-flash HPC storage systems speed up access to the PFS.
To handle such problems, we propose an efficient user-transparent I/O management scheme for all-flash HPC storage systems.
The first scheme, I/O transfer management, provides multiple connections between a compute node and a server node without additional effort from DFS developers and users.
To do so, we modified a mount procedure and I/O processing procedures in a virtual file system (VFS).
In the second scheme, data management between BB and PFS, a BB over-subscription allocation method is adopted to improve the BB utilization.
Unfortunately, the allocation method aggravates the I/O interference and demotion overhead from the BB to the PFS, resulting in a degraded checkpoint and restart performance.
To minimize this degradation, we developed an I/O scheduler and a new data management based on the checkpoint and restart characteristics.
To prove the effectiveness of our proposed schemes, we evaluated our I/O transfer and data management schemes between the BB and PFS.
The I/O transfer management scheme improves the write and read I/O throughputs for the checkpoint and restart by up to 6- and 3-times, that of a DFS using the original kernel, respectively.
Based on the data management scheme, we found that the BB utilization is improved by at least 2.2-fold, and a stabler and higher checkpoint performance is guaranteed.
In addition, we achieved up to a 96.4\% hit ratio of the restart requests on the BB and up to a 3.1-times higher restart performance than that of other existing methods.
고성능 컴퓨팅 스토리지 시스템의 입출력 대역폭의 대부분은 고성능 어플리케이션의 체크포인트와 재시작이 차지하고 있다.
이런 고성능 어플리케이션의 폭발적인 입출력을 원활하게 처리하게 위하여, 고급 플래시 저장 장치와 저급 플래시 저장 장치를 이용하여 버스트 버퍼와 PFS를 합친 새로운 플래시 기반의 고성능 컴퓨팅 스토리지 시스템이 제안되었다.
하지만 스토리지 시스템을 구성하기 위하여 사용되는 대부분의 분산 파일 시스템들은 노드간 하나의 네트워크 연결을 제공하고 있어 서버 노드에서 제공할 수 있는 높은 플래시들의 입출력 대역폭을 활용하지 못한다.
여러개의 네트워크 연결을 제공하기 위해서는 분산 파일 시스템이 수정되어야 하거나, 분산 파일 시스템의 클라이언트 데몬과 서버 데몬의 갯수를 증가시키는 방법이 사용되어야 한다.
하지만, 분산 파일 시스템은 매우 복잡한 구조로 구성되어 있기 때문에 많은 시간과 노력이 분산 파일 시스템 개발자들에게 요구된다.
데몬의 갯수를 증가시키는 방법은 각 네트워크 커넥션마다 새로운 마운트 포인트가 존재하기 때문에, 직접 파일 입출력 리퀘스트를 여러 마운트 포인트로 분산시켜야 하는 엄청난 노력이 사용자에게 요구된다.
서버 데몬의 개수를 증가시켜 네트워크 커넥션의 수를 증가시킬 경우엔, 서버 데몬이 서로 다른 파일 시스템 디렉토리 관점을 갖기 때문에 사용자가 직접 서로 다른 서버 데몬을 인식하고 데이터 충돌이 일어나지 않도록 주의해야 한다.
게다가, 기존에는 사용자들이 하드디스크와 같은 저속 저장 장치로 구성된 PFS로의 접근을 피하기 위하여, 버스트 버퍼의 효율성을 포기하면서도 전용 버스트 버퍼 할당 방식 (Dedicated BB allocation method)을 선호했다.
하지만 새로운 플래시 기반의 고성능 컴퓨팅 스토리지 시스템에서는 병렬 파일 시스템으로의 접근이 빠르기때문에, 해당 버스트 버퍼 할당 방식을 사용하는것은 적절치 않다.
이런 문제들을 해결하기 위하여, 본 논문에서 사용자에게 내부 처리과정이 노출 되지않는 새로운 플래시 기반의 고성능 스토리지 시스템을 위한 효율적인 데이터 기법들을 소개한다.
첫번째 기법인 입출력 전송 관리 기법은 분산 파일 시스템 개발자와 사용자들의 추가적인 노력없이 컴퓨트 노드와 서버 노드 사이에 여러개의 커넥션을 제공한다.
이를 위해, 가상 파일 시스템의 마운트 수행 과정과 입출력 처리 과정을 수정하였다.
두번째 기법인 데이터 관리 기법에서는 버스트 버퍼의 활용률을 향상 시키기 위하여 버스트 버퍼 초과 할당 기법 (BB over-subscription method)을 사용한다.
하지만, 해당 할당 방식은 사용자 간의 입출력 경합과 디모션 오버헤드를 발생하기때문에 낮은 체크포인트와 재시작 성능을 제공한다.
이를 방지하기 위하여, 체크포인트와 재시작의 특성을 기반으로 버스트 버퍼와 병렬 파일 시스템의 데이터를 관리한다.
본 논문에서는 제안한 방법들의 효과를 증명하기 위하여 실제 플래시 기반의 스토리지 시스템을 구축하고 제안한 방법들을 적용하여 성능을 평가했다.
실험을 통해 입출력 전송 관리 기법이 기존 기법보다 최대 6배 그리고 최대 2배 높은 쓰기 그리고 읽기 입출력 성능을 제공했다.
데이터 관리 기법은 기존 방법에 비해, 버스트 버퍼 활용률을 2.2배 향상 시켰다.
게다가 높고 안정적인 체크포인트 성능을 보였으며 최대 3.1배 높은 재시작 성능을 제공했다.

Language: eng

URI: https://hdl.handle.net/10371/169316

http://dcollection.snu.ac.kr/common/orgView/000000162547

Files in This Item:

000000162547.pdf 9.18 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Ph.D. / Sc.D._전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share