ACM Transactions on Embedded Computing Systems, 2017
Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computi... more Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computing systems. To provide the necessary computing power, the number of Processing Elements integrated on-chip increases and NoC based infrastructures are adopted to address the interconnection scalability. The advent of these new architectures surfaces the need for more sophisticated, distributed resource management paradigms, which in addition to the extreme integration scaling, make the new systems more prone to errors manifested both at hardware and software. In this work, we highlight the need for Run-Time Resource management to be enhanced with fault tolerance features and propose SoftRM, a resource management framework which can dynamically adapt to permanent failures in a self-organized, workload-aware manner. Self-organization allows the resource management agents to recover from a failure in a coordinated way by electing a new agent to replace the failed one, while workload awarene...
2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2019
Nowadays, big data and machine learning are transforming the way we realize and manage our data. ... more Nowadays, big data and machine learning are transforming the way we realize and manage our data. Even though the healthcare domain has recognized big data analytics as a prominent candidate, it has not yet fully grasped their promising benefits that allow medical information to be converted to useful knowledge. In this paper, we introduce AEGLE's big data infrastructure provided as a Platform as a Service. Utilizing the suite of genomic analytics from the Chronic Lymphocytic Leukaemia (CLL) use case, we show that on-demand acceleration is profitable w.r.t a pure software cloud-based solution. However, we further show that on-demand acceleration is not offered as a "free-lunch" and we provide an in-depth analysis and lessons learnt on the co-design implications to be carefully considered for enabling cost-effective acceleration at the cloud-level.
Electrocardiogram analysis has been established as a key factor for analysing and assessing the h... more Electrocardiogram analysis has been established as a key factor for analysing and assessing the health status of a person. The ECG analysis flow is complex, relies on machine learning algorithms such as Support Vector Machines classifier and in an effort to be executed in real-time Hardware acceleration is required. In this paper we focus on utilizing High Level Synthesis capabilities to produce efficient SVM hardware accel- erators, targeting ECG analysis. Our case study is arrhythmia detection using MIT-BIH ECG signal medical database. We show that as a first step, the original code under acceleration can be re-structured in order to create instances which are efficiently transformed into a HW accelerator. As a second step, an exploration is performed on the transformed code in order to determine which HLS directives produce the best outcome in terms of various performance and resources utilization metrics. Our combined analysis shows that we can achieve results of up to 94% execu...
Zenodo (CERN European Organization for Nuclear Research), May 16, 2023
Function-as-a-service (FaaS) represents the next frontier in the evolution of cloud computing bei... more Function-as-a-service (FaaS) represents the next frontier in the evolution of cloud computing being an emerging paradigm that removes the burden of configuration and management issues from users. This is achieved by replacing the well-established monolithic approach with graphs of standalone, small, stateless, event-driven components called functions. At the same time, from the cloud providers' perspective, problems such as availability, load balancing and scalability need to be resolved without being aware of the functionality, behavior or resource requirements of their tenants' code. However, in this context, functions' containers coexist with others inside a host of finite resources, where a passive resource allocation technique does not guarantee a well-defined quality of service (QoS) in regards to time latency. In this paper, we present Sequence Clock, an expandable latency targeting tool that actively monitors serverless invocations in a cluster and offers execution of a sequential chain of functions, also known as pipelines or sequences, while achieving the targeted time latency. Two regulation methods were utilized, with one of them achieving up to 82% decrease in the severity of time violations and in some cases even eliminating them completely.
2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)
This paper presents the cloud infrastructure of the AEGLE project, that targets to integrate clou... more This paper presents the cloud infrastructure of the AEGLE project, that targets to integrate cloud technologies together with heterogeneous reconfigurable computing in large scale healthcare systems for Big Bio-Data analytics. AEGLEs engineering concept brings together the hot big-data engines with emerging acceleration technologies, putting the basis for personalized and integrated health-care services, while also promoting related research activities. We introduce the design of AEGLE's accelerated infrastructure along with the corresponding software and hardware acceleration stacks to support various big data analytics workloads showing that through effective resource containerization AEGLE's cloud infrastructure is able to support high heterogeneity regarding to storage types, execution engines, utilized tools and execution platforms. Special care is given to the integration of high performance accelerators within the overall software stack of AEGLE's infrastructure, which enable efficient execution of analytics, up to 140× according to our preliminary evaluations, over pure software executions.
Electrocardiogram analysis has been established as a key factor for analysing and assessing the h... more Electrocardiogram analysis has been established as a key factor for analysing and assessing the health status of a person. The ECG analysis flow is complex, relies on machine learning algorithms such as Support Vector Machines classifier and in an effort to be executed in real-time Hardware acceleration is required. In this paper we focus on utilizing High Level Synthesis capabilities to produce efficient SVM hardware accelerators, targeting ECG analysis. Our case study is arrhythmia detection using MIT-BIH ECG signal medical database. We show that as a first step, the original code under acceleration can be restructured in order to create instances which are efficiently transformed into a HW accelerator. As a second step, an exploration is performed on the transformed code in order to determine which HLS directives produce the best outcome in terms of various performance and resources utilization metrics. Our combined analysis shows that we can achieve results of up to 94% execution latency gain compared to the original SVM code and the designer is provided with the infrastructure necessary in order to decide the best trade-off between gains in latency versus increase in utilized FPGA HW resources.
Zenodo (CERN European Organization for Nuclear Research), May 16, 2023
Serverless is an emerging paradigm that greatly simplifies the usage of cloud resources providing... more Serverless is an emerging paradigm that greatly simplifies the usage of cloud resources providing unprecedented auto-scaling, simplicity, and cost-efficiency features. Thus, more and more individuals and organizations adopt it, to increase their productivity and focus exclusively on the functionality of their application. Additionally, the cloud is expanding towards the deep edge, forming a continuum in which the eventdriven nature of the serverless paradigm seems to make a perfect match. The extreme heterogeneity introduced, in terms of diverse hardware resources and frameworks available, requires systematic approaches for evaluating serverless deployments. In this paper, we propose a methodology for evaluating serverless frameworks deployed on hybrid edge-cloud clusters. Our methodology focuses on key performance knobs of the serverless paradigm and applies a systematic way for evaluating these aspects in hybrid edge-cloud environments. We apply our methodology on three open-source serverless frameworks, OpenFaaS, Openwhisk, and Lean Openwhisk respectively, and we provide key insights regarding their performance implications over resource-constrained edge devices.
ACM Transactions in Embedded Computing Systems, Oct 13, 2016
Many-accelerator Systems-on-Chip (SoC) have recently emerged as a promising platform paradigm tha... more Many-accelerator Systems-on-Chip (SoC) have recently emerged as a promising platform paradigm that combines parallelization with heterogeneity, in order to cover the increasing demands for high performance and energy efficiency. To exploit the full potential of many-accelerator systems, automated design verification and analysis frameworks are required, targeted to both computational and interconnection optimization. Accurate simulation of interconnection schemes should use real stimuli, which are produced from fully functional nodes, requiring the prototyping of the processing elements and memories of the many-accelerator system. In this article, we argue that the Hierarchical Network-on-Chip (HNoC) scheme forms a very promising solution for many-accelerator systems in terms of scalability and data-congestion minimization. We present a parameterizable SystemC prototyping framework for HNoCs, targeted to domain-specific manyaccelerator systems. The framework supports the prototyping of processing elements, memory modules, and underlying interconnection infrastructure, while it provides an API for their easy integration to the HNoC. Finally, it enables holistic system simulation using real node data. Using as a case study a many-accelerator system of an MRI pipeline, an analysis on the proposed framework is presented to demonstrate the impact of the system parameters on the system. Through extensive experimental analysis, we show the superiority of HNoC schemes in comparison to typical interconnection architectures. Finally, we show that, adopting the proposed many-accelerator design flow, significant performance improvements are achieved, from 1.2× up to 26×, as compared to a x86 software implementation of the MRI pipeline. CCS Concepts: r Hardware → Buses and high-speed links; Modeling and parameter extraction; Emerging architectures; r Computer systems organization → Special purpose systems;
IEEE Transactions on Parallel and Distributed Systems, Feb 1, 2022
GPU is the dominant platform for accelerating general-purpose workloads due to its computing capa... more GPU is the dominant platform for accelerating general-purpose workloads due to its computing capacity and cost-efficiency. GPU applications cover an ever-growing range of domains. To achieve high throughput, GPUs rely on massive multi-threading and fast context switching to overlap computations with memory operations. We observe that among the diverse GPU workloads, there exists a significant class of kernels that fail to maintain a sufficient number of active warps to hide the latency of memory operations, and thus suffer from frequent stalling. We argue that the dominant Thread-Level Parallelism model is not enough to efficiently accommodate the variability of modern GPU applications. To address this inherent inefficiency, we propose a novel micro-architecture with lightweight Out-Of-Order execution capability enabling Instruction-Level Parallelism to complement the conventional Thread-Level Parallelism model. To minimize the hardware overhead, we carefully design our extension to highly re-use the existing micro-architectural structures and study various design trade-offs to contain the overall area and power overhead, while providing improved performance. We show that the proposed architecture outperforms traditional platforms by 23 percent on average for low-occupancy kernels, with an area and power overhead of 1.29 and 10.05 percent, respectively. Finally, we establish the potential of our proposal as a micro-architecture alternative by providing 16 percent speedup over a wide collection of 60 general-purpose kernels.
Modern computing systems are dealing with a diverse set of complex and dynamic workloads in the p... more Modern computing systems are dealing with a diverse set of complex and dynamic workloads in the presence of varying job arrival rates. This diversity is raising the need for the development of sophisticated run-time mechanisms that efficiently manage system's resources. In addition, moving towards kilo-core processor architectures, centralized resource management approaches will most probably form a severe performance bottleneck, thus the study of Distributed Run-Time Resource Management (DRTRM) schemes is now gaining a lot of attention. In this paper, we propose a job-arrival aware DRTRM framework for applications with malleable characteristics, implemented on top of the Intel Single-Chip Cloud Computer (SCC) many-core platform. We show that resource allocation is highly affected not only by the internal decision mechanisms but also from the incoming application interval rate on the system. Based on this observation, we propose an effective admission control strategy utilizing Voltage and Frequency Scaling (VFS) of parts of the DRTRM which eventually retains the distributed decision making thus improving system performance in combination with significant gains in its consumed energy. 1 Index Terms-distributed resource management, run-time, many-core, Intel SCC
Many-Accelerator (MA) systems have been introduced as a promising architectural paradigm that can... more Many-Accelerator (MA) systems have been introduced as a promising architectural paradigm that can boost performance and improve power of general-purpose computing platforms. In this paper, we focus on the problem of resource under-utilization, i.e. Dark Silicon, in FPGA-based MA platforms. We show that except the typically expected peak power budget, on-chip memory resources form a severe under-utilization factor in MA platforms, leading up to 75% of dark silicon. Recognizing that static memory allocation-the de-facto mechanism supported by modern design techniques and synthesis tools-forms the main source of memory-induced Dark Silicon, we introduce a novel framework that extends conventional High Level Synthesis (HLS) with dynamic memory management (DMM) features, enabling accelerators to dynamically adapt their allocated memory to the runtime memory requirements, thus maximizing the overall accelerator count through effective sharing of FPGA's memories resources. We show that our technique delivers significant gains in FPGA's accelerators density, i.e. 3.8×, and application throughput up to 3.1× and 21.4× for shared and private memory accelerators.
ACM Transactions on Cyber-Physical Systems, Jun 13, 2018
The Internet-of-Things (IoT) envisions an infrastructure of ubiquitous networked smart devices of... more The Internet-of-Things (IoT) envisions an infrastructure of ubiquitous networked smart devices offering advanced monitoring and control services. The current art in IoT architectures utilizes gateways to enable application-specific connectivity to IoT devices. In typical configurations, IoT gateways are shared among several IoT edge devices. Given the limited available bandwidth and processing capabilities of an IoT gateway, the service quality (SQ) of connected IoT edge devices must be adjusted over time not only to fulfill the needs of individual IoT device users but also to tolerate the SQ needs of the other IoT edge devices sharing the same gateway. However, having multiple gateways introduces an interdependent problem, the binding, i.e., which IoT device shall connect to which gateway. In this article, we jointly address the binding and allocation problems of IoT edge devices in a multigateway system under the constraints of available bandwidth, processing power, and battery lifetime. We propose a distributed trade-based mechanism in which after an initial setup, gateways negotiate and trade the IoT edge devices to increase the overall SQ. We evaluate the efficiency of the proposed approach with a case study and through extensive experimentation over different IoT system configurations regarding the number and type of the employed IoT edge devices. Experiments show that our solution improves the overall SQ by up to 56% compared to an unsupervised system. Our solution also achieves up to 24.6% improvement on overall SQ compared to the state-of-the-art SQ management scheme, while they both meet the battery lifetime constraints of the IoT devices.
The main goal of dynamic memory allocators is to minimize memory fragmentation. Fragmentation res... more The main goal of dynamic memory allocators is to minimize memory fragmentation. Fragmentation results from the interaction of workload behavior and allocator policy. There are, however, no works systematically capturing this interaction in an informative data structure. We consider this gap responsible for the absence of a standardized, quantitative fragmentation metric, the lack of workload characterization techniques with respect to their dynamic memory behavior, and the absence of an open, widely used benchmark suite targeting dynamic memory allocation. Such shortcomings are profoundly asymmetric to the operation's ubiquity. This paper presents a trace-based simulation methodology for constructing representations of workload-allocator interaction. We use two-dimensional rectangular bin packing (2DBP) as our foundation. Classical 2DBP algorithms minimize their products' makespan, but virtual memory systems employing demand paging deem such a criterion inappropriate. We view an allocator's placement decisions as a solution to a 2DBP instance, optimizing some unknown criterion particular to that allocator's policy. Our end product is a compact data structure that fits e.g. the simulation of 80 million requests in a 350 MiB file. By design, it is concerned with events residing entirely in virtual memory; no information on memory accesses, indexing costs or any other factor is kept. We bootstrap our contribution's significance by exploring its relationship to maximum resident set size (RSS). Our baseline is the assumption that less fragmentation amounts to smaller peak RSS. We thus define a fragmentation metric in the 2DBP substrate and compute it for 28 workloads linked to 4 modern allocators. We also measure peak RSS for the 112 resulting pairs. Our metric exhibits a strong monotonic relationship (Spearman coefficient > 0.65) in half of those cases: allocators achieving better 2DBP placements yield 9%-30% smaller peak RSS, with the trends remaining consistent across two different machines. Considering our representation's minimalism, the presented empirical evidence is a robust indicator of its potency. If workload-allocator interplay in the virtual address space suffices to evaluate a novel fragmentation definition, numerous other useful applications of our tool can be studied. Both augmenting 2DBP and exploring alternative computations on it provide ample fertile ground for future research.
ACM Transactions in Embedded Computing Systems, Sep 27, 2017
Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computi... more Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computing systems. To provide the necessary computing power, the number of Processing Elements integrated onchip increases and NoC based infrastructures are adopted to address the interconnection scalability. The advent of these new architectures surfaces the need for more sophisticated, distributed resource management paradigms, which in addition to the extreme integration scaling, make the new systems more prone to errors manifested both at hardware and software. In this work, we highlight the need for Run-Time Resource management to be enhanced with fault tolerance features and propose SoftRM, a resource management framework which can dynamically adapt to permanent failures in a self-organized, workload-aware manner. Self-organization allows the resource management agents to recover from a failure in a coordinated way by electing a new agent to replace the failed one, while workload awareness optimizes this choice according to the status of each core. We evaluate the proposed framework on Intel Single-chip Cloud Computer (SCC), a NoC based many-core system and customize it to achieve minimum interference on the resource allocation process. We showcase that its workload-aware features manage to utilize free resources in more that 90% of the conducted experiments. Comparison with relevant state-of-the-art fault tolerant frameworks shows decrease of up to 67% in the imposed overhead on application execution. CCS Concepts: • General and reference → Cross-computing tools and techniques; • Computer systems organization → Multicore architectures; • Networks → Network on chip; • Computing methodologies → Self-organization;
Multiplication is an arithmetic operation that has a significant impact on the performance of var... more Multiplication is an arithmetic operation that has a significant impact on the performance of various real-life applications, such as digital signal processing, image processing and computer vision. In this study, targeting to exploit the efficiency of alternative number representation formats, the authors propose an energy-efficient scheme for multiplying 2'scomplement binary numbers with two least significant bits (LSBs). The double-LSB (DLSB) arithmetic delivers several benefits, such as the symmetric representation range, the number negation performed only by bitwise inversion, and the facilitation of the rounding process in the results of floating point architectures. The hardware overhead of the proposed circuit, when implemented at 45 nm, is negligible in comparison with the conventional Modified Booth multiplier for the ordinary 2'scomplement numbers (3.1% area and 3.3% energy average overhead for different multiplier's bit-width). Moreover, the proposed DLSB multiplier outperforms the previous state-of-the-art implementation by providing 10.2% energy and 7.8% area average gains. Finally, they demonstrate how the DLSB multipliers can be effectively used as a building block for the implementation of larger multiplications, delivering area and energy savings.
The advent of many-accelerator Systems-on-Chip (SoC), as a result of the ever increasing demands ... more The advent of many-accelerator Systems-on-Chip (SoC), as a result of the ever increasing demands for high performance and energy efficiency, has lead to the need for new interconnection schemes among the system components, which minimize the communication overhead. Towards this need, Hierarchical Networks-on-Chip (HNoCs) can provide an efficient communication paradigm for such systems: Each node is an autonomous sub-network including the hardware accelerators needed by the respective application thread, thus retaining data locality and minimizing congestion. However, HNoC design may lead to exponential increase in the design space size, due to the numerous parameter combinations of the sub-networks and the overall HNoC. In addition, the need for a prototyping framework supporting HNoC simulation with real stimuli is crucial for the accurate system evaluation. Therefore, the goal of this paper is to present (a) a SystemC framework for cycle-accurate simulation of Hierarchical NoCs, accompanied with a NoC API for node mapping on the HNoC; and (b) an exploration flow that targets to reduce the increased design space size. By using the Rician Denoising algorithm for MRI scans as a case study, the proposed DSE flow could achieve up to 2× and 1.48× time and power improvements respectively, as compared to a typical DSE flow.
Nowadays, big data and machine learning are transforming the way we realize and manage our data. ... more Nowadays, big data and machine learning are transforming the way we realize and manage our data. Even though the healthcare domain has recognized big data analytics as a prominent candidate, it has not yet fully grasped their promising benefits that allow medical information to be converted to useful knowledge. In this paper, we introduce AEGLE's big data infrastructure provided as a Platform as a Service. Utilizing the suite of genomic analytics from the Chronic Lymphocytic Leukaemia (CLL) use case, we show that on-demand acceleration is profitable w.r.t a pure software cloud-based solution. However, we further show that on-demand acceleration is not offered as a "free-lunch" and we provide an in-depth analysis and lessons learnt on the co-design implications to be carefully considered for enabling cost-effective acceleration at the cloud-level.
Energy efficiency is considered today as a first class design principle of modern many-core compu... more Energy efficiency is considered today as a first class design principle of modern many-core computing systems in the effort to overcome the limited power envelope. However, many-core processors are characterised by high micro-architectural complexity, which is propagated up to the application level affecting both performance and energy consumption. In this paper, we present CF-TUNE, an online and scalable auto-tuning framework for energy aware applications mapping on emerging many-core architectures. CF-TUNE enables the extraction of an energy-efficient tuning configuration point with minimal application characterisation on the whole tuning configuration space. Instead of analyzing every application against every tuning configuration, it adopts a collaborative filtering technique that quickly and with high accuracy configures the application's tuning parameters by identifying similarities with previously optimized applications. We evaluate CF-TUNE's efficiency against a set of demanding and diverse applications mapped on Intel Many Integrated Core processor and we show that with minimal characterization, e.g. only either two or four evaluations, CF-TUNE recommends a tuning configuration that performs at least at the 94% level of the optimal one.
ACM Transactions on Embedded Computing Systems, 2017
Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computi... more Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computing systems. To provide the necessary computing power, the number of Processing Elements integrated on-chip increases and NoC based infrastructures are adopted to address the interconnection scalability. The advent of these new architectures surfaces the need for more sophisticated, distributed resource management paradigms, which in addition to the extreme integration scaling, make the new systems more prone to errors manifested both at hardware and software. In this work, we highlight the need for Run-Time Resource management to be enhanced with fault tolerance features and propose SoftRM, a resource management framework which can dynamically adapt to permanent failures in a self-organized, workload-aware manner. Self-organization allows the resource management agents to recover from a failure in a coordinated way by electing a new agent to replace the failed one, while workload awarene...
2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2019
Nowadays, big data and machine learning are transforming the way we realize and manage our data. ... more Nowadays, big data and machine learning are transforming the way we realize and manage our data. Even though the healthcare domain has recognized big data analytics as a prominent candidate, it has not yet fully grasped their promising benefits that allow medical information to be converted to useful knowledge. In this paper, we introduce AEGLE's big data infrastructure provided as a Platform as a Service. Utilizing the suite of genomic analytics from the Chronic Lymphocytic Leukaemia (CLL) use case, we show that on-demand acceleration is profitable w.r.t a pure software cloud-based solution. However, we further show that on-demand acceleration is not offered as a "free-lunch" and we provide an in-depth analysis and lessons learnt on the co-design implications to be carefully considered for enabling cost-effective acceleration at the cloud-level.
Electrocardiogram analysis has been established as a key factor for analysing and assessing the h... more Electrocardiogram analysis has been established as a key factor for analysing and assessing the health status of a person. The ECG analysis flow is complex, relies on machine learning algorithms such as Support Vector Machines classifier and in an effort to be executed in real-time Hardware acceleration is required. In this paper we focus on utilizing High Level Synthesis capabilities to produce efficient SVM hardware accel- erators, targeting ECG analysis. Our case study is arrhythmia detection using MIT-BIH ECG signal medical database. We show that as a first step, the original code under acceleration can be re-structured in order to create instances which are efficiently transformed into a HW accelerator. As a second step, an exploration is performed on the transformed code in order to determine which HLS directives produce the best outcome in terms of various performance and resources utilization metrics. Our combined analysis shows that we can achieve results of up to 94% execu...
Zenodo (CERN European Organization for Nuclear Research), May 16, 2023
Function-as-a-service (FaaS) represents the next frontier in the evolution of cloud computing bei... more Function-as-a-service (FaaS) represents the next frontier in the evolution of cloud computing being an emerging paradigm that removes the burden of configuration and management issues from users. This is achieved by replacing the well-established monolithic approach with graphs of standalone, small, stateless, event-driven components called functions. At the same time, from the cloud providers' perspective, problems such as availability, load balancing and scalability need to be resolved without being aware of the functionality, behavior or resource requirements of their tenants' code. However, in this context, functions' containers coexist with others inside a host of finite resources, where a passive resource allocation technique does not guarantee a well-defined quality of service (QoS) in regards to time latency. In this paper, we present Sequence Clock, an expandable latency targeting tool that actively monitors serverless invocations in a cluster and offers execution of a sequential chain of functions, also known as pipelines or sequences, while achieving the targeted time latency. Two regulation methods were utilized, with one of them achieving up to 82% decrease in the severity of time violations and in some cases even eliminating them completely.
2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)
This paper presents the cloud infrastructure of the AEGLE project, that targets to integrate clou... more This paper presents the cloud infrastructure of the AEGLE project, that targets to integrate cloud technologies together with heterogeneous reconfigurable computing in large scale healthcare systems for Big Bio-Data analytics. AEGLEs engineering concept brings together the hot big-data engines with emerging acceleration technologies, putting the basis for personalized and integrated health-care services, while also promoting related research activities. We introduce the design of AEGLE's accelerated infrastructure along with the corresponding software and hardware acceleration stacks to support various big data analytics workloads showing that through effective resource containerization AEGLE's cloud infrastructure is able to support high heterogeneity regarding to storage types, execution engines, utilized tools and execution platforms. Special care is given to the integration of high performance accelerators within the overall software stack of AEGLE's infrastructure, which enable efficient execution of analytics, up to 140× according to our preliminary evaluations, over pure software executions.
Electrocardiogram analysis has been established as a key factor for analysing and assessing the h... more Electrocardiogram analysis has been established as a key factor for analysing and assessing the health status of a person. The ECG analysis flow is complex, relies on machine learning algorithms such as Support Vector Machines classifier and in an effort to be executed in real-time Hardware acceleration is required. In this paper we focus on utilizing High Level Synthesis capabilities to produce efficient SVM hardware accelerators, targeting ECG analysis. Our case study is arrhythmia detection using MIT-BIH ECG signal medical database. We show that as a first step, the original code under acceleration can be restructured in order to create instances which are efficiently transformed into a HW accelerator. As a second step, an exploration is performed on the transformed code in order to determine which HLS directives produce the best outcome in terms of various performance and resources utilization metrics. Our combined analysis shows that we can achieve results of up to 94% execution latency gain compared to the original SVM code and the designer is provided with the infrastructure necessary in order to decide the best trade-off between gains in latency versus increase in utilized FPGA HW resources.
Zenodo (CERN European Organization for Nuclear Research), May 16, 2023
Serverless is an emerging paradigm that greatly simplifies the usage of cloud resources providing... more Serverless is an emerging paradigm that greatly simplifies the usage of cloud resources providing unprecedented auto-scaling, simplicity, and cost-efficiency features. Thus, more and more individuals and organizations adopt it, to increase their productivity and focus exclusively on the functionality of their application. Additionally, the cloud is expanding towards the deep edge, forming a continuum in which the eventdriven nature of the serverless paradigm seems to make a perfect match. The extreme heterogeneity introduced, in terms of diverse hardware resources and frameworks available, requires systematic approaches for evaluating serverless deployments. In this paper, we propose a methodology for evaluating serverless frameworks deployed on hybrid edge-cloud clusters. Our methodology focuses on key performance knobs of the serverless paradigm and applies a systematic way for evaluating these aspects in hybrid edge-cloud environments. We apply our methodology on three open-source serverless frameworks, OpenFaaS, Openwhisk, and Lean Openwhisk respectively, and we provide key insights regarding their performance implications over resource-constrained edge devices.
ACM Transactions in Embedded Computing Systems, Oct 13, 2016
Many-accelerator Systems-on-Chip (SoC) have recently emerged as a promising platform paradigm tha... more Many-accelerator Systems-on-Chip (SoC) have recently emerged as a promising platform paradigm that combines parallelization with heterogeneity, in order to cover the increasing demands for high performance and energy efficiency. To exploit the full potential of many-accelerator systems, automated design verification and analysis frameworks are required, targeted to both computational and interconnection optimization. Accurate simulation of interconnection schemes should use real stimuli, which are produced from fully functional nodes, requiring the prototyping of the processing elements and memories of the many-accelerator system. In this article, we argue that the Hierarchical Network-on-Chip (HNoC) scheme forms a very promising solution for many-accelerator systems in terms of scalability and data-congestion minimization. We present a parameterizable SystemC prototyping framework for HNoCs, targeted to domain-specific manyaccelerator systems. The framework supports the prototyping of processing elements, memory modules, and underlying interconnection infrastructure, while it provides an API for their easy integration to the HNoC. Finally, it enables holistic system simulation using real node data. Using as a case study a many-accelerator system of an MRI pipeline, an analysis on the proposed framework is presented to demonstrate the impact of the system parameters on the system. Through extensive experimental analysis, we show the superiority of HNoC schemes in comparison to typical interconnection architectures. Finally, we show that, adopting the proposed many-accelerator design flow, significant performance improvements are achieved, from 1.2× up to 26×, as compared to a x86 software implementation of the MRI pipeline. CCS Concepts: r Hardware → Buses and high-speed links; Modeling and parameter extraction; Emerging architectures; r Computer systems organization → Special purpose systems;
IEEE Transactions on Parallel and Distributed Systems, Feb 1, 2022
GPU is the dominant platform for accelerating general-purpose workloads due to its computing capa... more GPU is the dominant platform for accelerating general-purpose workloads due to its computing capacity and cost-efficiency. GPU applications cover an ever-growing range of domains. To achieve high throughput, GPUs rely on massive multi-threading and fast context switching to overlap computations with memory operations. We observe that among the diverse GPU workloads, there exists a significant class of kernels that fail to maintain a sufficient number of active warps to hide the latency of memory operations, and thus suffer from frequent stalling. We argue that the dominant Thread-Level Parallelism model is not enough to efficiently accommodate the variability of modern GPU applications. To address this inherent inefficiency, we propose a novel micro-architecture with lightweight Out-Of-Order execution capability enabling Instruction-Level Parallelism to complement the conventional Thread-Level Parallelism model. To minimize the hardware overhead, we carefully design our extension to highly re-use the existing micro-architectural structures and study various design trade-offs to contain the overall area and power overhead, while providing improved performance. We show that the proposed architecture outperforms traditional platforms by 23 percent on average for low-occupancy kernels, with an area and power overhead of 1.29 and 10.05 percent, respectively. Finally, we establish the potential of our proposal as a micro-architecture alternative by providing 16 percent speedup over a wide collection of 60 general-purpose kernels.
Modern computing systems are dealing with a diverse set of complex and dynamic workloads in the p... more Modern computing systems are dealing with a diverse set of complex and dynamic workloads in the presence of varying job arrival rates. This diversity is raising the need for the development of sophisticated run-time mechanisms that efficiently manage system's resources. In addition, moving towards kilo-core processor architectures, centralized resource management approaches will most probably form a severe performance bottleneck, thus the study of Distributed Run-Time Resource Management (DRTRM) schemes is now gaining a lot of attention. In this paper, we propose a job-arrival aware DRTRM framework for applications with malleable characteristics, implemented on top of the Intel Single-Chip Cloud Computer (SCC) many-core platform. We show that resource allocation is highly affected not only by the internal decision mechanisms but also from the incoming application interval rate on the system. Based on this observation, we propose an effective admission control strategy utilizing Voltage and Frequency Scaling (VFS) of parts of the DRTRM which eventually retains the distributed decision making thus improving system performance in combination with significant gains in its consumed energy. 1 Index Terms-distributed resource management, run-time, many-core, Intel SCC
Many-Accelerator (MA) systems have been introduced as a promising architectural paradigm that can... more Many-Accelerator (MA) systems have been introduced as a promising architectural paradigm that can boost performance and improve power of general-purpose computing platforms. In this paper, we focus on the problem of resource under-utilization, i.e. Dark Silicon, in FPGA-based MA platforms. We show that except the typically expected peak power budget, on-chip memory resources form a severe under-utilization factor in MA platforms, leading up to 75% of dark silicon. Recognizing that static memory allocation-the de-facto mechanism supported by modern design techniques and synthesis tools-forms the main source of memory-induced Dark Silicon, we introduce a novel framework that extends conventional High Level Synthesis (HLS) with dynamic memory management (DMM) features, enabling accelerators to dynamically adapt their allocated memory to the runtime memory requirements, thus maximizing the overall accelerator count through effective sharing of FPGA's memories resources. We show that our technique delivers significant gains in FPGA's accelerators density, i.e. 3.8×, and application throughput up to 3.1× and 21.4× for shared and private memory accelerators.
ACM Transactions on Cyber-Physical Systems, Jun 13, 2018
The Internet-of-Things (IoT) envisions an infrastructure of ubiquitous networked smart devices of... more The Internet-of-Things (IoT) envisions an infrastructure of ubiquitous networked smart devices offering advanced monitoring and control services. The current art in IoT architectures utilizes gateways to enable application-specific connectivity to IoT devices. In typical configurations, IoT gateways are shared among several IoT edge devices. Given the limited available bandwidth and processing capabilities of an IoT gateway, the service quality (SQ) of connected IoT edge devices must be adjusted over time not only to fulfill the needs of individual IoT device users but also to tolerate the SQ needs of the other IoT edge devices sharing the same gateway. However, having multiple gateways introduces an interdependent problem, the binding, i.e., which IoT device shall connect to which gateway. In this article, we jointly address the binding and allocation problems of IoT edge devices in a multigateway system under the constraints of available bandwidth, processing power, and battery lifetime. We propose a distributed trade-based mechanism in which after an initial setup, gateways negotiate and trade the IoT edge devices to increase the overall SQ. We evaluate the efficiency of the proposed approach with a case study and through extensive experimentation over different IoT system configurations regarding the number and type of the employed IoT edge devices. Experiments show that our solution improves the overall SQ by up to 56% compared to an unsupervised system. Our solution also achieves up to 24.6% improvement on overall SQ compared to the state-of-the-art SQ management scheme, while they both meet the battery lifetime constraints of the IoT devices.
The main goal of dynamic memory allocators is to minimize memory fragmentation. Fragmentation res... more The main goal of dynamic memory allocators is to minimize memory fragmentation. Fragmentation results from the interaction of workload behavior and allocator policy. There are, however, no works systematically capturing this interaction in an informative data structure. We consider this gap responsible for the absence of a standardized, quantitative fragmentation metric, the lack of workload characterization techniques with respect to their dynamic memory behavior, and the absence of an open, widely used benchmark suite targeting dynamic memory allocation. Such shortcomings are profoundly asymmetric to the operation's ubiquity. This paper presents a trace-based simulation methodology for constructing representations of workload-allocator interaction. We use two-dimensional rectangular bin packing (2DBP) as our foundation. Classical 2DBP algorithms minimize their products' makespan, but virtual memory systems employing demand paging deem such a criterion inappropriate. We view an allocator's placement decisions as a solution to a 2DBP instance, optimizing some unknown criterion particular to that allocator's policy. Our end product is a compact data structure that fits e.g. the simulation of 80 million requests in a 350 MiB file. By design, it is concerned with events residing entirely in virtual memory; no information on memory accesses, indexing costs or any other factor is kept. We bootstrap our contribution's significance by exploring its relationship to maximum resident set size (RSS). Our baseline is the assumption that less fragmentation amounts to smaller peak RSS. We thus define a fragmentation metric in the 2DBP substrate and compute it for 28 workloads linked to 4 modern allocators. We also measure peak RSS for the 112 resulting pairs. Our metric exhibits a strong monotonic relationship (Spearman coefficient > 0.65) in half of those cases: allocators achieving better 2DBP placements yield 9%-30% smaller peak RSS, with the trends remaining consistent across two different machines. Considering our representation's minimalism, the presented empirical evidence is a robust indicator of its potency. If workload-allocator interplay in the virtual address space suffices to evaluate a novel fragmentation definition, numerous other useful applications of our tool can be studied. Both augmenting 2DBP and exploring alternative computations on it provide ample fertile ground for future research.
ACM Transactions in Embedded Computing Systems, Sep 27, 2017
Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computi... more Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computing systems. To provide the necessary computing power, the number of Processing Elements integrated onchip increases and NoC based infrastructures are adopted to address the interconnection scalability. The advent of these new architectures surfaces the need for more sophisticated, distributed resource management paradigms, which in addition to the extreme integration scaling, make the new systems more prone to errors manifested both at hardware and software. In this work, we highlight the need for Run-Time Resource management to be enhanced with fault tolerance features and propose SoftRM, a resource management framework which can dynamically adapt to permanent failures in a self-organized, workload-aware manner. Self-organization allows the resource management agents to recover from a failure in a coordinated way by electing a new agent to replace the failed one, while workload awareness optimizes this choice according to the status of each core. We evaluate the proposed framework on Intel Single-chip Cloud Computer (SCC), a NoC based many-core system and customize it to achieve minimum interference on the resource allocation process. We showcase that its workload-aware features manage to utilize free resources in more that 90% of the conducted experiments. Comparison with relevant state-of-the-art fault tolerant frameworks shows decrease of up to 67% in the imposed overhead on application execution. CCS Concepts: • General and reference → Cross-computing tools and techniques; • Computer systems organization → Multicore architectures; • Networks → Network on chip; • Computing methodologies → Self-organization;
Multiplication is an arithmetic operation that has a significant impact on the performance of var... more Multiplication is an arithmetic operation that has a significant impact on the performance of various real-life applications, such as digital signal processing, image processing and computer vision. In this study, targeting to exploit the efficiency of alternative number representation formats, the authors propose an energy-efficient scheme for multiplying 2'scomplement binary numbers with two least significant bits (LSBs). The double-LSB (DLSB) arithmetic delivers several benefits, such as the symmetric representation range, the number negation performed only by bitwise inversion, and the facilitation of the rounding process in the results of floating point architectures. The hardware overhead of the proposed circuit, when implemented at 45 nm, is negligible in comparison with the conventional Modified Booth multiplier for the ordinary 2'scomplement numbers (3.1% area and 3.3% energy average overhead for different multiplier's bit-width). Moreover, the proposed DLSB multiplier outperforms the previous state-of-the-art implementation by providing 10.2% energy and 7.8% area average gains. Finally, they demonstrate how the DLSB multipliers can be effectively used as a building block for the implementation of larger multiplications, delivering area and energy savings.
The advent of many-accelerator Systems-on-Chip (SoC), as a result of the ever increasing demands ... more The advent of many-accelerator Systems-on-Chip (SoC), as a result of the ever increasing demands for high performance and energy efficiency, has lead to the need for new interconnection schemes among the system components, which minimize the communication overhead. Towards this need, Hierarchical Networks-on-Chip (HNoCs) can provide an efficient communication paradigm for such systems: Each node is an autonomous sub-network including the hardware accelerators needed by the respective application thread, thus retaining data locality and minimizing congestion. However, HNoC design may lead to exponential increase in the design space size, due to the numerous parameter combinations of the sub-networks and the overall HNoC. In addition, the need for a prototyping framework supporting HNoC simulation with real stimuli is crucial for the accurate system evaluation. Therefore, the goal of this paper is to present (a) a SystemC framework for cycle-accurate simulation of Hierarchical NoCs, accompanied with a NoC API for node mapping on the HNoC; and (b) an exploration flow that targets to reduce the increased design space size. By using the Rician Denoising algorithm for MRI scans as a case study, the proposed DSE flow could achieve up to 2× and 1.48× time and power improvements respectively, as compared to a typical DSE flow.
Nowadays, big data and machine learning are transforming the way we realize and manage our data. ... more Nowadays, big data and machine learning are transforming the way we realize and manage our data. Even though the healthcare domain has recognized big data analytics as a prominent candidate, it has not yet fully grasped their promising benefits that allow medical information to be converted to useful knowledge. In this paper, we introduce AEGLE's big data infrastructure provided as a Platform as a Service. Utilizing the suite of genomic analytics from the Chronic Lymphocytic Leukaemia (CLL) use case, we show that on-demand acceleration is profitable w.r.t a pure software cloud-based solution. However, we further show that on-demand acceleration is not offered as a "free-lunch" and we provide an in-depth analysis and lessons learnt on the co-design implications to be carefully considered for enabling cost-effective acceleration at the cloud-level.
Energy efficiency is considered today as a first class design principle of modern many-core compu... more Energy efficiency is considered today as a first class design principle of modern many-core computing systems in the effort to overcome the limited power envelope. However, many-core processors are characterised by high micro-architectural complexity, which is propagated up to the application level affecting both performance and energy consumption. In this paper, we present CF-TUNE, an online and scalable auto-tuning framework for energy aware applications mapping on emerging many-core architectures. CF-TUNE enables the extraction of an energy-efficient tuning configuration point with minimal application characterisation on the whole tuning configuration space. Instead of analyzing every application against every tuning configuration, it adopts a collaborative filtering technique that quickly and with high accuracy configures the application's tuning parameters by identifying similarities with previously optimized applications. We evaluate CF-TUNE's efficiency against a set of demanding and diverse applications mapped on Intel Many Integrated Core processor and we show that with minimal characterization, e.g. only either two or four evaluations, CF-TUNE recommends a tuning configuration that performs at least at the 94% level of the optimal one.
Uploads
Papers by Sotirios Xydis