Efficient Parallel Simulation Techniques for Multi-processor Embedded Systems

윤덕용

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Efficient Parallel Simulation Techniques for Multi-processor Embedded Systems : 멀티-프로세서 임베디드 시스템을 위한 효율적인 병렬 시뮬레이션 기법

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 윤덕용

Advisor: 하순회

Major: 전기·컴퓨터공학부

Issue Date: 2012-02

Publisher: 서울대학교 대학원

Abstract: 계속되는 반도체 공정기술의 발전으로 인해 수십에서 수백 개의 프로세서를 하나의 칩에 넣어 MPSoC (Multi-Processor System-on-Chip) 를 만드는 것이 가능해졌다. 가상 프로토타이핑 시스템은 저렴하고 하드웨어가 없어도 시뮬레이션이 가능하다는 이점 때문에 MPSoC의 설계 검증을 위해서 많이 사용되고 있다. MPSoC의 가상 프로토타이핑 시스템에서는 일반적으로 여러 컴포넌트 시뮬레이터가 하나의 호스트에서 수행된다. 하지만 하나의 칩에 집적되는 컴포넌트 수가 증가됨에 따라 시뮬레이션 속도는 감소하게 된다.
병렬 시뮬레이션은 증가되는 여러 컴포넌트 시뮬레이터를 병렬적으로 수행하여 시뮬레이션 성능을 높이는 기법이다. 하지만 컴포넌트 시뮬레이터 사이에서 잦은 통신이 발생하는 경우 통신과 동기화로 인한 손실은 오히려 시뮬레이션 성능을 기존보다 더 느리게 만들 때도 있다. 따라서 이 논문에서는 효율적인 기법들을 이용하여 MPSoC의 병렬 시뮬레이션 성능을 높일 수 있는 시뮬레이션 프레임워크를 제안한다. 제안하는 시뮬레이션 프레임워크의 특징은 다음과 같다.
첫째로 제안한 시뮬레이션 프레임워크는 확장 가능하고 유연한 특징을 가지고 있다. 제안한 프레임워크에서는 각각의 컴포넌트 시뮬레이터의 전역 시간을 시뮬레이터 래퍼가 관리하도록 하고 컴포넌트 시뮬레이터에서는 지역 시간을 이용하여 시뮬레이션을 진행하도록 하여 잦은 동기화 없이 빠르게 시뮬레이션을 진행할 수 있도록 하였다. 또한 주기적 널 메시지의 전송을 통해 컴포넌트 시뮬레이터가 주기적으로 진행상태를 업데이트하도록 하여 병렬성을 높였다. 실험 결과는 제안한 병렬 시뮬레이션 프레임워크가 좋은 확장성을 가짐을 보여주었다.
둘째로 제안한 병렬 시뮬레이션 프레임워크에서는 주어진 어플리케이션의 특성을 고려하여 최적의 시뮬레이션 성능을 낼 수 있는 설정을 찾아준다. 이러한 설정은 시뮬레이션 플랫폼, 컴포넌트 시뮬레이터의 매핑, 그리고 시간 동기화를 위한 주기적 널 메시지의 간격을 포함한다. 최적의 설정 값을 찾기 위해 설정의 요소들을 고려한 성능 예측 수식을 세웠고 유전 알고리즘을 통해 좋은 결과를 찾도록 하였다. 그 결과 제안한 기법은 적절한 매핑 결과를 통해 시뮬레이션의 작업량을 잘 분산시키고 병렬성을 높이면서도 통신 손실이 크지 않은 널 메시지의 주기를 찾아주는 것을 확인할 수 있었다. 실험 결과를 통해 가정한 성능 예측식이 약 90%의 정확도로 성능을 예측함을 확인할 수 있었고 제안한 유전 알고리즘이 어플리케이션의 특성을 고려한 높은 시뮬레이션 성능을 가지는 설정을 찾음을 확인할 수 있었다.
셋째, 시뮬레이션 성능을 더욱 향상시키기 위해 각 컴포넌트 시뮬레이터에서 시뮬레이션 목적으로 시뮬레이션 캐시를 사용하는 완화된 동기화 기법을 제안하였다. 제안한 프레임워크에서 응답을 기다리는 동기적인 통신은 공유 메모리에 대한 접근마다 발생한다. 과도한 공유 메모리 접근은 시뮬레이션의 진행을 느리게 만들고 시뮬레이션의 성능을 저하시킨다. 이를 위해 시뮬레이션 캐시와 메시지 그룹화를 통한 느슨한 동기화 기법을 제안하여 동기적인 통신 횟수를 줄여 시뮬레이션 성능을 높이도록 하였다. 시뮬레이션 캐시는 하위 메모리 시스템으로의 잦은 메모리 접근을 줄여주는 아키텍처 캐시처럼 컴포넌트 시뮬레이터와 백플레인 사이에 발생하는 잦은 동기적인 통신을 효과적으로 줄여준다. 공유 메모리에 읽거나 쓰는 요청이 발생할 때 각각을 전달하는 것이 아니라 큰 단위의 캐시 라인 사이즈만큼 전달하고 컴포넌트 시뮬레이터 상에 위치한 시뮬레이션 캐시에서 이를 보관하도록 하여 추후 캐시라인에 해당하는 메모리 접근이 발생할 경우 동기적인 통신을 발생시키는 것이 아니라 캐시의 값을 돌려주고 비동기적인 통신을 발생시켜서 성능을 높인다. 실제적인 멀티미디어 예제의 실험을 통해 제안한 기법이 시뮬레이션 성능을 평균 330% 향상시키는 것을 확인할 수 있었다.
마지막으로 시뮬레이션 성능을 더욱 높이기 위해 시간과 공간을 동시에 병렬화하는 기법을 제안하였다. 위에서 제안한 기법들은 공간 병렬화에 기반을 두고 있고 이 경우 최대 컴포넌트 시뮬레이터의 개수만큼의 병렬성을 가질 수 있다. 따라서 사용 가능한 호스트 프로세서의 개수가 더 많더라도 제한된 범위의 병렬 수행만이 가능하다. 제안된 시뮬레이션 프레임워크에서는 호스트 코드 수행을 통해 태스크 사이의 데이터 종속성을 제거하여 각 태스크를 모두 병렬적으로 수행할 수 있었다. 제안된 기법을 통해 타겟 아키텍처의 프로세서 개수 이상의 병렬성을 얻을 수 있었다.
With the continuous evolution of semiconductor process technology, it is now possible to integrate tens of hundreds of processors in a single chip to make an MPSoC (Multi-Processor System-on-Chip). For design verification of an MPSoC, a virtual prototyping system has been widely used as a cheap and fast method without a hardware prototyping. It usually consists of component simulators working together in a single simulation host. As the number of processing components integrated in a chip increases with relentless technology scaling, the simulation performance degrades significantly to the extent that hardware emulation is sought for again.
Parallel simulation aims to accelerate the simulation speed by running component simulators concurrently. But extra overhead of communication and synchronization between simulators may overshadow the benefits of parallel simulation. To solve the problem, this thesis proposes a novel simulation framework for efficient parallel simulation for MPSoC. The key features of the proposed simulation framework are stated as follows.
First, the proposed parallel simulation framework is scalable and flexible. In the proposed framework, a simulator wrapper performs time synchronization on behalf of the associated simulator itself between the simulators and the simulation backplane. By integrating the simulator wrapper modules and the simulation backplane into a single process, synchronization overhead is greatly reduced. And component simulators send null messages periodically to the backplane to enable parallel simulation without causality problem.
Second, the proposed simulation framework allows us to configure the simulation configuration for a given application, thus resulting in the best simulation performance. The configuration contains a simulation platform, mapping of component simulators to participating host processors, and period of null message transfer for time synchronization. To explore configuration space, we propose a novel performance analysis technique for simulation performance estimation, considering both the characteristics of a target application and the configuration parameters of the simulation host using genetic-algorithm (GA). As a result, the proposed technique enables the efficient exploitation of parallelism by 1) well-balanced distribution of simulation workloads to host processors and 2) the minimized overhead for null message transfer, in turn, leading to the maximal simulation performance. The experimental results show that the proposed analysis technique predicts the simulation performance with more than 90% accuracy on average for various target applications and simulation environments and we are able to find the optimal configurations for wide variance of application characteristics and simulation platform through performance analysis.
Third, to boost up the simulation speed further, we propose a novel technique, called relaxed synchronization, which uses a simulation cache at each component simulator for simulation purpose. In the proposed framework, a synchronous communication that waits the response takes place every shared memory requests from component simulators. And excessive shared memory access blocks the progress of simulation and it may degrade the simulation performance. Like an architectural cache that reduces the memory access frequency, a simulation cache reduces the count of synchronous communication effectively between the corresponding component simulator and the simulation backplane. When a read or write request to a shared memory is made, a cache line, not a single element, is transferred to utilize the space and temporal locality for simulation. The proposed technique is based on an assumption that the application program uses a relaxed memory model. Through experiments with real-life multimedia applications, it is proved that the proposed approach improves the simulation performance by up to 330%.
Last but the least, we propose a simulation technique exploiting both space and time parallelism to boost the performance further. The maximum degree of parallelism based on the space-parallel approach is confined to the number of target processors that are executed in parallel. On the other hand, the proposed technique exploits parallel execution of tasks in different timelines by resolving data dependencies between tasks using redundant host simulation. The proposed technique provides higher degree of parallelism beyond the number of processors in the target architecture.

Language: eng

URI: https://hdl.handle.net/10371/156600

http://dcollection.snu.ac.kr:80/jsp/common/DcLoOrgPer.jsp?sItemId=000000001065

Files in This Item:: There are no files associated with this item.

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Computer Science and Engineering (컴퓨터공학부)
  - Theses (Ph.D. / Sc.D._컴퓨터공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share