-
DECICE: Device-Edge-Cloud Intelligent Collaboration Framework
Authors:
Julian Kunkel,
Christian Boehme,
Jonathan Decker,
Fabrizio Magugliani,
Dirk Pleiter,
Bastian Koller,
Karthee Sivalingam,
Sabri Pllana,
Alexander Nikolov,
Mujdat Soyturk,
Christian Racca,
Andrea Bartolini,
Adrian Tate,
Berkay Yaman
Abstract:
DECICE is a Horizon Europe project that is developing an AI-enabled open and portable management framework for automatic and adaptive optimization and deployment of applications in computing continuum encompassing from IoT sensors on the Edge to large-scale Cloud / HPC computing infrastructures. In this paper, we describe the DECICE framework and architecture. Furthermore, we highlight use-cases f…
▽ More
DECICE is a Horizon Europe project that is developing an AI-enabled open and portable management framework for automatic and adaptive optimization and deployment of applications in computing continuum encompassing from IoT sensors on the Edge to large-scale Cloud / HPC computing infrastructures. In this paper, we describe the DECICE framework and architecture. Furthermore, we highlight use-cases for framework evaluation: intelligent traffic intersection, magnetic resonance imaging, and emergency response.
△ Less
Submitted 4 May, 2023;
originally announced May 2023.
-
Monte Cimone: Paving the Road for the First Generation of RISC-V High-Performance Computers
Authors:
Andrea Bartolini,
Federico Ficarelli,
Emanuele Parisi,
Francesco Beneventi,
Francesco Barchi,
Daniele Gregori,
Fabrizio Magugliani,
Marco Cicala,
Cosimo Gianfreda,
Daniele Cesarini,
Andrea Acquaviva,
Luca Benini
Abstract:
The new open and royalty-free RISC-V ISA is attracting interest across the whole computing continuum, from microcontrollers to supercomputers. High-performance RISC-V processors and accelerators have been announced, but RISC-V-based HPC systems will need a holistic co-design effort, spanning memory, storage hierarchy interconnects and full software stack. In this paper, we describe Monte Cimone, a…
▽ More
The new open and royalty-free RISC-V ISA is attracting interest across the whole computing continuum, from microcontrollers to supercomputers. High-performance RISC-V processors and accelerators have been announced, but RISC-V-based HPC systems will need a holistic co-design effort, spanning memory, storage hierarchy interconnects and full software stack. In this paper, we describe Monte Cimone, a fully-operational multi-blade computer prototype and hardware-software test-bed based on U740, a double-precision capable multi-core, 64-bit RISC-V SoC. Monte Cimone does not aim to achieve strong floating-point performance, but it was built with the purpose of "priming the pipe" and exploring the challenges of integrating a multi-node RISC-V cluster capable of providing an HPC production stack including interconnect, storage and power monitoring infrastructure on RISC-V hardware. We present the results of our hardware/software integration effort, which demonstrate a remarkable level of software and hardware readiness and maturity - showing that the first generation of RISC-V HPC machines may not be so far in the future.
△ Less
Submitted 7 May, 2022;
originally announced May 2022.
-
ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted resource clusterS
Authors:
Federica Filippini,
Danilo Ardagna,
Marco Lattuada,
Edoardo Amaldi,
Michele Ciavotta,
Maciek Riedl,
Katarzyna Materka,
Paweł Skrzypek,
Fabrizio Magugliani,
Marco Cicala
Abstract:
Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently applied to a wide range of products and solutions. DL training jobs are highly resource demanding and they experience great benefits when exploiting AI accelerators (e.g., GPUs). However, the effective management of GPU-powered clusters comes with great challenges. Among these, efficient scheduling and resource allocation…
▽ More
Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently applied to a wide range of products and solutions. DL training jobs are highly resource demanding and they experience great benefits when exploiting AI accelerators (e.g., GPUs). However, the effective management of GPU-powered clusters comes with great challenges. Among these, efficient scheduling and resource allocation solutions are crucial to maximize performance and minimize Data Centers operational costs. In this paper we propose ANDREAS, an advanced scheduling solution that tackles these problems jointly, aiming at optimizing DL training runtime workloads and their energy consumption in accelerated clusters. Experiments based on simulation demostrate that we can achieve a cost reduction between 30 and 62% on average with respect to first-principle methods while the validation on a real cluster shows a worst case deviation below 13% between actual and predicted costs, proving the effectiveness of ANDREAS solution in practical scenarios.
△ Less
Submitted 11 May, 2021;
originally announced May 2021.