-
Machine Learning for Microprocessor Performance Bug Localization
Authors:
Erick Carvajal Barboza,
Mahesh Ketkar,
Michael Kishinevsky,
Paul Gratz,
Jiang Hu
Abstract:
The validation process for microprocessors is a very complex task that consumes substantial engineering time during the design process. Bugs that degrade overall system performance, without affecting its functional correctness, are particularly difficult to debug given the lack of a golden reference for bug-free performance. This work introduces two automated performance bug localization methodolo…
▽ More
The validation process for microprocessors is a very complex task that consumes substantial engineering time during the design process. Bugs that degrade overall system performance, without affecting its functional correctness, are particularly difficult to debug given the lack of a golden reference for bug-free performance. This work introduces two automated performance bug localization methodologies based on machine learning that aims to aid the debugging process. Our results show that, the evaluated microprocessor core performance bugs whose average IPC impact is greater than 1%, our best-performing technique is able to localize the exact microarchitectural unit of the bug $\sim$77\% of the time, while achieving a top-3 unit accuracy (out of 11 possible locations) of over 90% for bugs with the same average IPC impact. The proposed system in our simulation setup requires only a few seconds to perform a bug location inference, which leads to a reduced debugging time.
△ Less
Submitted 27 March, 2023;
originally announced March 2023.
-
Machine Learning-based Low Overhead Congestion Control Algorithm for Industrial NoCs
Authors:
Shruti Yadav Narayana,
Sumit K. Mandal,
Raid Ayoub,
Michael Kishinevsky,
Umit Y. Ogras
Abstract:
Network-on-Chip (NoC) congestion builds up during heavy traffic load and cripples the system performance by stalling the cores. Moreover, congestion leads to wasted link bandwidth due to blocked buffers and bouncing packets. Existing approaches throttle the cores after congestion is detected, reducing efficiency and wasting line bandwidth unnecessarily. In contrast, we propose a lightweight machin…
▽ More
Network-on-Chip (NoC) congestion builds up during heavy traffic load and cripples the system performance by stalling the cores. Moreover, congestion leads to wasted link bandwidth due to blocked buffers and bouncing packets. Existing approaches throttle the cores after congestion is detected, reducing efficiency and wasting line bandwidth unnecessarily. In contrast, we propose a lightweight machine learning-based technique that helps predict congestion in the network. Specifically, our proposed technique collects the features related to traffic at each destination. Then, it labels the features using a novel time reversal approach. The labeled data is used to design a low overhead and an explainable decision tree model used at runtime congestion control. Experimental evaluations with synthetic and real traffic on industrial 6$\times$6 NoC show that the proposed approach increases fairness and memory read bandwidth by up to 114\% with respect to existing congestion control technique while incurring less than 0.01\% of overhead.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
Theoretical Analysis and Evaluation of NoCs with Weighted Round-Robin Arbitration
Authors:
Sumit K. Mandal,
Jie Tong,
Raid Ayoub,
Michael Kishinevsky,
Ahmed Abousamra,
Umit Y. Ogras
Abstract:
Fast and accurate performance analysis techniques are essential in early design space exploration and pre-silicon evaluations, including software eco-system development. In particular, on-chip communication continues to play an increasingly important role as the many-core processors scale up. This paper presents the first performance analysis technique that targets networks-on-chip (NoCs) that emp…
▽ More
Fast and accurate performance analysis techniques are essential in early design space exploration and pre-silicon evaluations, including software eco-system development. In particular, on-chip communication continues to play an increasingly important role as the many-core processors scale up. This paper presents the first performance analysis technique that targets networks-on-chip (NoCs) that employ weighted round-robin (WRR) arbitration. Besides fairness, WRR arbitration provides flexibility in allocating bandwidth proportionally to the importance of the traffic classes, unlike basic round-robin and priority-based arbitration. The proposed approach first estimates the effective service time of the packets in the queue due to WRR arbitration. Then, it uses the effective service time to compute the average waiting time of the packets. Next, we incorporate a decomposition technique to extend the analytical model to handle NoC of any size. The proposed approach achieves less than 5% error while executing real applications and 10% error under challenging synthetic traffic with different burstiness levels.
△ Less
Submitted 11 August, 2023; v1 submitted 21 August, 2021;
originally announced August 2021.
-
Automatic Microprocessor Performance Bug Detection
Authors:
Erick Carvajal Barboza,
Sara Jacob,
Mahesh Ketkar,
Michael Kishinevsky,
Paul Gratz,
Jiang Hu
Abstract:
Processor design validation and debug is a difficult and complex task, which consumes the lion's share of the design process. Design bugs that affect processor performance rather than its functionality are especially difficult to catch, particularly in new microarchitectures. This is because, unlike functional bugs, the correct processor performance of new microarchitectures on complex, long-runni…
▽ More
Processor design validation and debug is a difficult and complex task, which consumes the lion's share of the design process. Design bugs that affect processor performance rather than its functionality are especially difficult to catch, particularly in new microarchitectures. This is because, unlike functional bugs, the correct processor performance of new microarchitectures on complex, long-running benchmarks is typically not deterministically known. Thus, when performance benchmarking new microarchitectures, performance teams may assume that the design is correct when the performance of the new microarchitecture exceeds that of the previous generation, despite significant performance regressions existing in the design. In this work, we present a two-stage, machine learning-based methodology that is able to detect the existence of performance bugs in microprocessors. Our results show that our best technique detects 91.5% of microprocessor core performance bugs whose average IPC impact across the studied applications is greater than 1% versus a bug-free design with zero false positives. When evaluated on memory system bugs, our technique achieves 100% detection with zero false positives. Moreover, the detection is automatic, requiring very little performance engineer time.
△ Less
Submitted 19 November, 2020; v1 submitted 17 November, 2020;
originally announced November 2020.
-
Online Adaptive Learning for Runtime Resource Management of Heterogeneous SoCs
Authors:
Sumit K. Mandal,
Umit Y. Ogras,
Janardhan Rao Doppa,
Raid Z. Ayoub,
Michael Kishinevsky,
Partha P. Pande
Abstract:
Dynamic resource management has become one of the major areas of research in modern computer and communication system design due to lower power consumption and higher performance demands. The number of integrated cores, level of heterogeneity and amount of control knobs increase steadily. As a result, the system complexity is increasing faster than our ability to optimize and dynamically manage th…
▽ More
Dynamic resource management has become one of the major areas of research in modern computer and communication system design due to lower power consumption and higher performance demands. The number of integrated cores, level of heterogeneity and amount of control knobs increase steadily. As a result, the system complexity is increasing faster than our ability to optimize and dynamically manage the resources. Moreover, offline approaches are sub-optimal due to workload variations and large volume of new applications unknown at design time. This paper first reviews recent online learning techniques for predicting system performance, power, and temperature. Then, we describe the use of predictive models for online control using two modern approaches: imitation learning (IL) and an explicit nonlinear model predictive control (NMPC). Evaluations on a commercial mobile platform with 16 benchmarks show that the IL approach successfully adapts the control policy to unknown applications. The explicit NMPC provides 25% energy savings compared to a state-of-the-art algorithm for multi-variable power management of modern GPU sub-systems.
△ Less
Submitted 21 August, 2020;
originally announced August 2020.
-
Performance Analysis of Priority-Aware NoCs with Deflection Routing under Traffic Congestion
Authors:
Sumit K. Mandal,
Anish Krishnakumar,
Raid Ayoub,
Michael Kishinevsky,
Umit Y. Ogras
Abstract:
Priority-aware networks-on-chip (NoCs) are used in industry to achieve predictable latency under different workload conditions. These NoCs incorporate deflection routing to minimize queuing resources within routers and achieve low latency during low traffic load. However, deflected packets can exacerbate congestion during high traffic load since they consume the NoC bandwidth. State-of-the-art ana…
▽ More
Priority-aware networks-on-chip (NoCs) are used in industry to achieve predictable latency under different workload conditions. These NoCs incorporate deflection routing to minimize queuing resources within routers and achieve low latency during low traffic load. However, deflected packets can exacerbate congestion during high traffic load since they consume the NoC bandwidth. State-of-the-art analytical models for priority-aware NoCs ignore deflected traffic despite its significant latency impact during congestion. This paper proposes a novel analytical approach to estimate end-to-end latency of priority-aware NoCs with deflection routing under bursty and heavy traffic scenarios. Experimental evaluations show that the proposed technique outperforms alternative approaches and estimates the average latency for real applications with less than 8% error compared to cycle-accurate simulations.
△ Less
Submitted 8 November, 2020; v1 submitted 10 August, 2020;
originally announced August 2020.
-
Analytical Performance Modeling of NoCs under Priority Arbitration and Bursty Traffic
Authors:
Sumit K. Mandal,
Raid Ayoub,
Michael Kishinevsky,
Mohammad M. Islam,
Umit Y. Ogras
Abstract:
Networks-on-Chip (NoCs) used in commercial many-core processors typically incorporate priority arbitration. Moreover, they experience bursty traffic due to application workloads. However, most state-of-the-art NoC analytical performance analysis techniques assume fair arbitration and simple traffic models. To address these limitations, we propose an analytical modeling technique for priority-aware…
▽ More
Networks-on-Chip (NoCs) used in commercial many-core processors typically incorporate priority arbitration. Moreover, they experience bursty traffic due to application workloads. However, most state-of-the-art NoC analytical performance analysis techniques assume fair arbitration and simple traffic models. To address these limitations, we propose an analytical modeling technique for priority-aware NoCs under bursty traffic. Experimental evaluations with synthetic and bursty traffic show that the proposed approach has less than 10% modeling error with respect to cycle-accurate NoC simulator.
△ Less
Submitted 27 July, 2020;
originally announced July 2020.
-
Analytical Performance Models for NoCs with Multiple Priority Traffic Classes
Authors:
Sumit K. Mandal,
Raid Ayoub,
Michael Kishinevsky,
Umit Y. Ogras
Abstract:
Networks-on-chip (NoCs) have become the standard for interconnect solutions in industrial designs ranging from client CPUs to many-core chip-multiprocessors. Since NoCs play a vital role in system performance and power consumption, pre-silicon evaluation environments include cycle-accurate NoC simulators. Long simulations increase the execution time of evaluation frameworks, which are already noto…
▽ More
Networks-on-chip (NoCs) have become the standard for interconnect solutions in industrial designs ranging from client CPUs to many-core chip-multiprocessors. Since NoCs play a vital role in system performance and power consumption, pre-silicon evaluation environments include cycle-accurate NoC simulators. Long simulations increase the execution time of evaluation frameworks, which are already notoriously slow, and prohibit design-space exploration. Existing analytical NoC models, which assume fair arbitration, cannot replace these simulations since industrial NoCs typically employ priority schedulers and multiple priority classes. To address this limitation, we propose a systematic approach to construct priority-aware analytical performance models using micro-architecture specifications and input traffic. Our approach consists of developing two novel transformations of queuing system and designing an algorithm which iteratively uses these two transformations to estimate end-to-end latency. Our approach decomposes the given NoC into individual queues with modified service time to enable accurate and scalable latency computations. Specifically, we introduce novel transformations along with an algorithm that iteratively applies these transformations to decompose the queuing system. Experimental evaluations using real architectures and applications show high accuracy of 97% and up to 2.5x speedup in full-system simulation.
△ Less
Submitted 3 January, 2020; v1 submitted 6 August, 2019;
originally announced August 2019.