50th International Conference on Parallel Processing, 2021
The performance and scalability of Parallel Discrete Event Simulation (PDES) can be significantly... more The performance and scalability of Parallel Discrete Event Simulation (PDES) can be significantly impacted by temporarily inactive threads that occupy CPU resources but do no useful processing. A recent design called Demand-Driven PDES (DD-PDES) identifies such threads and de-schedules them from CPU cores to eliminate the unnecessary overhead. In this paper, we propose significant further improvements to DD-PDES. First, we introduce a new GVT (Global Virtual Time)-guided algorithm named GG-PDES to perform de-scheduling operations in a lock-free fashion and without relying on a centralized controller thread as was used previously. Second, we introduce the Dynamic CPU Affinity algorithm built on top of GG-PDES that adaptively pins simulation threads to CPU cores to achieve a balanced execution. We demonstrate that these optimizations can yield performance improvements in the range of 13% to 50% over the original DD-PDES system.
2018 IEEE/ACM 22nd International Symposium on Distributed Simulation and Real Time Applications (DS-RT), 2018
Recent studies investigated the performance of Parallel Discrete Event Simulation (PDES) on Intel... more Recent studies investigated the performance of Parallel Discrete Event Simulation (PDES) on Intel Xeon Phi manycore processors, but generally reported underwhelming performance results, especially at high scales when all cores and thread contexts are fully loaded. While the lack of scalability in an earlier study on a Knights Corner (KC) processor is an artifact of physical limitations of the KC system, performance challenges on a Knights Landing (KNL) system partially stem from a slower global virtual time (GVT) computation algorithm used in that study. In this paper, we re-examine PDES performance on KNL under more efficient GVT algorithms to alleviate the GVT bottleneck. Specifically, we compare a synchronous GVT algorithm based on barrier synchronization, and two asynchronous GVT implementations: a modified Mattern's algorithm for shared memory systems and a recently-proposed wait-free algorithm. Using the ROSS simulator, we demonstrate that minimizing the GVT bottleneck res...
In this paper, we present PPT-GPU, a scalable performance prediction toolkit for GPUs. PPT-GPU ac... more In this paper, we present PPT-GPU, a scalable performance prediction toolkit for GPUs. PPT-GPU achieves scalability through a hybrid high-level modeling approach where some computations are extrapolated and multiple parts of the model are parallelized. The tool primary prediction models use pre-collected memory and instructions traces of the workloads to accurately capture the dynamic behavior of the kernels. PPT-GPU reports an extensive array of GPU performance metrics accurately while being easily extensible. We use a broad set of benchmarks to verify predictions accuracy. We compare the results against hardware metrics collected using vendor profiling tools and cycle-accurate simulators. The results show that the performance predictions are highly correlated to the actual hardware (MAPE: < 16% and Correlation: > 0.98). Moreover, PPT-GPU is orders of magnitude faster than cycle-accurate simulators. This comprehensiveness of the collected metrics can guide architects and deve...
Traditional parallel discrete event simulation (PDES) systems treat each simulation thread in the... more Traditional parallel discrete event simulation (PDES) systems treat each simulation thread in the same manner, regardless of whether a thread has events to process in its input queue or not. At the same time, many real-life simulation models exhibit significant execution locality, where only part of the model (and thus a subset of threads) are actively sending or receiving messages in a given time period. These inactive threads still continuously check their queues and participate in simulation-wide time synchronization mechanisms, such as computing Global Virtual Time (GVT). This wastes resources, ties up CPU cores with threads that offer no contribution to event processing and limits the performance and scalability of the simulation. In this paper, we propose a new paradigm for managing PDES threads that we call Demand-Driven PDES (DD-PDES). The key idea behind DD-PDES is to identify threads that have no events to process and de-schedule them from the CPU until they receive a mess...
Proceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, 2021
Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine... more Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine-grain communication, especially in execution environments with high communication cost. Low latencies of on-chip communication in emerging manycore processors promise to substantially alleviate conventional PDES bottlenecks. However, scaling to manycore clusters requires balancing faster on chip communication with slower traditional network communication between cluster nodes. In this work, we investigate performance of PDES on a cluster of Intel's Knights Landing (KNL) processors, identify performance bottlenecks, and propose techniques to address them. Specifically, we propose three performance optimizations: (1) a new design of the communication buffer centered around the use of atomic compare-and-swap operations to reduce synchronization overhead between a dedicated communication thread and computation threads; (2) careful selection of the number of computation threads per commu...
50th International Conference on Parallel Processing, 2021
The performance and scalability of Parallel Discrete Event Simulation (PDES) can be significantly... more The performance and scalability of Parallel Discrete Event Simulation (PDES) can be significantly impacted by temporarily inactive threads that occupy CPU resources but do no useful processing. A recent design called Demand-Driven PDES (DD-PDES) identifies such threads and de-schedules them from CPU cores to eliminate the unnecessary overhead. In this paper, we propose significant further improvements to DD-PDES. First, we introduce a new GVT (Global Virtual Time)-guided algorithm named GG-PDES to perform de-scheduling operations in a lock-free fashion and without relying on a centralized controller thread as was used previously. Second, we introduce the Dynamic CPU Affinity algorithm built on top of GG-PDES that adaptively pins simulation threads to CPU cores to achieve a balanced execution. We demonstrate that these optimizations can yield performance improvements in the range of 13% to 50% over the original DD-PDES system.
2018 IEEE/ACM 22nd International Symposium on Distributed Simulation and Real Time Applications (DS-RT), 2018
Recent studies investigated the performance of Parallel Discrete Event Simulation (PDES) on Intel... more Recent studies investigated the performance of Parallel Discrete Event Simulation (PDES) on Intel Xeon Phi manycore processors, but generally reported underwhelming performance results, especially at high scales when all cores and thread contexts are fully loaded. While the lack of scalability in an earlier study on a Knights Corner (KC) processor is an artifact of physical limitations of the KC system, performance challenges on a Knights Landing (KNL) system partially stem from a slower global virtual time (GVT) computation algorithm used in that study. In this paper, we re-examine PDES performance on KNL under more efficient GVT algorithms to alleviate the GVT bottleneck. Specifically, we compare a synchronous GVT algorithm based on barrier synchronization, and two asynchronous GVT implementations: a modified Mattern's algorithm for shared memory systems and a recently-proposed wait-free algorithm. Using the ROSS simulator, we demonstrate that minimizing the GVT bottleneck res...
In this paper, we present PPT-GPU, a scalable performance prediction toolkit for GPUs. PPT-GPU ac... more In this paper, we present PPT-GPU, a scalable performance prediction toolkit for GPUs. PPT-GPU achieves scalability through a hybrid high-level modeling approach where some computations are extrapolated and multiple parts of the model are parallelized. The tool primary prediction models use pre-collected memory and instructions traces of the workloads to accurately capture the dynamic behavior of the kernels. PPT-GPU reports an extensive array of GPU performance metrics accurately while being easily extensible. We use a broad set of benchmarks to verify predictions accuracy. We compare the results against hardware metrics collected using vendor profiling tools and cycle-accurate simulators. The results show that the performance predictions are highly correlated to the actual hardware (MAPE: < 16% and Correlation: > 0.98). Moreover, PPT-GPU is orders of magnitude faster than cycle-accurate simulators. This comprehensiveness of the collected metrics can guide architects and deve...
Traditional parallel discrete event simulation (PDES) systems treat each simulation thread in the... more Traditional parallel discrete event simulation (PDES) systems treat each simulation thread in the same manner, regardless of whether a thread has events to process in its input queue or not. At the same time, many real-life simulation models exhibit significant execution locality, where only part of the model (and thus a subset of threads) are actively sending or receiving messages in a given time period. These inactive threads still continuously check their queues and participate in simulation-wide time synchronization mechanisms, such as computing Global Virtual Time (GVT). This wastes resources, ties up CPU cores with threads that offer no contribution to event processing and limits the performance and scalability of the simulation. In this paper, we propose a new paradigm for managing PDES threads that we call Demand-Driven PDES (DD-PDES). The key idea behind DD-PDES is to identify threads that have no events to process and de-schedule them from the CPU until they receive a mess...
Proceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, 2021
Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine... more Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine-grain communication, especially in execution environments with high communication cost. Low latencies of on-chip communication in emerging manycore processors promise to substantially alleviate conventional PDES bottlenecks. However, scaling to manycore clusters requires balancing faster on chip communication with slower traditional network communication between cluster nodes. In this work, we investigate performance of PDES on a cluster of Intel's Knights Landing (KNL) processors, identify performance bottlenecks, and propose techniques to address them. Specifically, we propose three performance optimizations: (1) a new design of the communication buffer centered around the use of atomic compare-and-swap operations to reduce synchronization overhead between a dedicated communication thread and computation threads; (2) careful selection of the number of computation threads per commu...
Uploads
Papers by Ali Eker