Ali Eker

Followers

Following

Public Views

Interests

Uploads

Papers by Ali Eker

GVT-Guided Demand-Driven Scheduling in Parallel Discrete Event Simulation

50th International Conference on Parallel Processing, 2021

The performance and scalability of Parallel Discrete Event Simulation (PDES) can be significantly... more The performance and scalability of Parallel Discrete Event Simulation (PDES) can be significantly impacted by temporarily inactive threads that occupy CPU resources but do no useful processing. A recent design called Demand-Driven PDES (DD-PDES) identifies such threads and de-schedules them from CPU cores to eliminate the unnecessary overhead. In this paper, we propose significant further improvements to DD-PDES. First, we introduce a new GVT (Global Virtual Time)-guided algorithm named GG-PDES to perform de-scheduling operations in a lock-free fashion and without relying on a centralized controller thread as was used previously. Second, we introduce the Dynamic CPU Affinity algorithm built on top of GG-PDES that adaptively pins simulation threads to CPU cores to achieve a balanced execution. We demonstrate that these optimizations can yield performance improvements in the range of 13% to 50% over the original DD-PDES system.

Download

Performance Implications of Global Virtual Time Algorithms on a Knights Landing Processor

2018 IEEE/ACM 22nd International Symposium on Distributed Simulation and Real Time Applications (DS-RT), 2018

Recent studies investigated the performance of Parallel Discrete Event Simulation (PDES) on Intel... more Recent studies investigated the performance of Parallel Discrete Event Simulation (PDES) on Intel Xeon Phi manycore processors, but generally reported underwhelming performance results, especially at high scales when all cores and thread contexts are fully loaded. While the lack of scalability in an earlier study on a Knights Corner (KC) processor is an artifact of physical limitations of the KC system, performance challenges on a Knights Landing (KNL) system partially stem from a slower global virtual time (GVT) computation algorithm used in that study. In this paper, we re-examine PDES performance on KNL under more efficient GVT algorithms to alleviate the GVT bottleneck. Specifically, we compare a synchronous GVT algorithm based on barrier synchronization, and two asynchronous GVT implementations: a modified Mattern's algorithm for shared memory systems and a recently-proposed wait-free algorithm. Using the ROSS simulator, we demonstrate that minimizing the GVT bottleneck res...

Download

Hybrid, scalable, trace-driven performance modeling of GPGPUs

In this paper, we present PPT-GPU, a scalable performance prediction toolkit for GPUs. PPT-GPU ac... more In this paper, we present PPT-GPU, a scalable performance prediction toolkit for GPUs. PPT-GPU achieves scalability through a hybrid high-level modeling approach where some computations are extrapolated and multiple parts of the model are parallelized. The tool primary prediction models use pre-collected memory and instructions traces of the workloads to accurately capture the dynamic behavior of the kernels. PPT-GPU reports an extensive array of GPU performance metrics accurately while being easily extensible. We use a broad set of benchmarks to verify predictions accuracy. We compare the results against hardware metrics collected using vendor profiling tools and cycle-accurate simulators. The results show that the performance predictions are highly correlated to the actual hardware (MAPE: < 16% and Correlation: > 0.98). Moreover, PPT-GPU is orders of magnitude faster than cycle-accurate simulators. This comprehensiveness of the collected metrics can guide architects and deve...

Demand-Driven PDES: Exploiting Locality in Simulation Models

Traditional parallel discrete event simulation (PDES) systems treat each simulation thread in the... more Traditional parallel discrete event simulation (PDES) systems treat each simulation thread in the same manner, regardless of whether a thread has events to process in its input queue or not. At the same time, many real-life simulation models exhibit significant execution locality, where only part of the model (and thus a subset of threads) are actively sending or receiving messages in a given time period. These inactive threads still continuously check their queues and participate in simulation-wide time synchronization mechanisms, such as computing Global Virtual Time (GVT). This wastes resources, ties up CPU cores with threads that offer no contribution to event processing and limits the performance and scalability of the simulation. In this paper, we propose a new paradigm for managing PDES threads that we call Demand-Driven PDES (DD-PDES). The key idea behind DD-PDES is to identify threads that have no events to process and de-schedule them from the CPU until they receive a mess...

Download

Controlled Asynchronous GVT

Proceedings of the 48th International Conference on Parallel Processing

Download

Load-Aware Dynamic Time Synchronization in Parallel Discrete Event Simulation

Proceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation

Download

High-Performance PDES on Manycore Clusters

Proceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, 2021

Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine... more Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine-grain communication, especially in execution environments with high communication cost. Low latencies of on-chip communication in emerging manycore processors promise to substantially alleviate conventional PDES bottlenecks. However, scaling to manycore clusters requires balancing faster on chip communication with slower traditional network communication between cluster nodes. In this work, we investigate performance of PDES on a cluster of Intel's Knights Landing (KNL) processors, identify performance bottlenecks, and propose techniques to address them. Specifically, we propose three performance optimizations: (1) a new design of the communication buffer centered around the use of atomic compare-and-swap operations to reduce synchronization overhead between a dedicated communication thread and computation threads; (2) careful selection of the number of computation threads per commu...

Download