Deep Convolutional Neural Networks (DCNNs) achieve state of the art results compared to classic m... more Deep Convolutional Neural Networks (DCNNs) achieve state of the art results compared to classic machine learning in many applications that need recognition, identification and classification. An ever-increasing embedded deployment of DCNNs inference engines thus supporting the intelligence close to the sensor paradigm has been observed, overcoming limitations of cloud-based computing as bandwidth requirements, security, privacy, scalability, and responsiveness. However, increasing the robustness and accuracy of DCNNs comes at the price of increased computational cost. As result, implementing CNNs on embedded devices with real-time constraints is a challenge if the lowest power consumption shall be achieved. A solution to the challenge is to take advantage of the intra-device massive fine grain parallelism offered by these systems and benefit from the extensive concurrency exhibited by DCNN processing pipelines. The trick is to divide intensive tasks into smaller, weakly interacting batches subject to parallel processing. Referred to that, this paper has mainly two goals: 1) describe the implementation of a state-of-art technique to map DCNN most intensive tasks (dominated by multiply-and-accumulate ops) onto Orlando SoC, an ultra-low power heterogeneous multi cores developed by STMicroelectronics; 2) integrate the proposed implementation on a toolchain that allows deep learning developers to deploy DCNNs on low-power applications.
Abstract. Software development productivity for embedded systems is greatly limited by the high f... more Abstract. Software development productivity for embedded systems is greatly limited by the high fragmentation of platforms and their associ-ated development tools. Platform virtualization environments, like Java and Microsoft.NET, help to alleviate the problem, but because of their advanced run-time features and libraries, they are limited to host func-tionalities running on the system microcontroller and on top of the op-erating system. Due to the ever increasing demand for processing power, it is highly desirable to extend the benefits of platform virtualization to the rest of the system programmable resources, media processors in par-ticular, that can boost performance to a great extent. To this aim we are developing a virtualization framework targeting high performance me-dia applications running on deeply embedded media processors, which we combine with traditional host virtualization environments in order to offer a system-wide virtualization solution. In this paper we present...
Abstract-Embedded systems contain a wide variety of processors. Economical and technological fact... more Abstract-Embedded systems contain a wide variety of processors. Economical and technological factors favor systems made of a combination of diverse but programmable processors. Software has a longer lifetime than the hardware for which it is initially designed. Application portability is thus of utmost importance for the embedded systems industry. The Common Language Infrastructure (CLI) is a rich virtualization environment for the execution of applications written in multiple languages. CLI efficiently captures the semantics of unmanaged languages, such as C. We investigate the use of CLI as a deployment format for embedded systems to reconcile apparently contradictory constraints: the need for portability, the need for high performance and the existence of a large base of legacy C code. In this paper, we motivate our CLI-based compilation environment for C, and its different use scenarios. We then focus on the specific challenges of effectively mapping the C language to CLI, and o...
The design of embedded systems is driven by strong constraints in terms of performance, silicon a... more The design of embedded systems is driven by strong constraints in terms of performance, silicon area and power consumption, as well as pressure on the cost and time-to-market. This has three consequences: 1) many-core systems are becoming mainstream, but there is still no satisfactory approach for distributing software applications on these platforms; 2) these systems integrate heterogeneous processors for efficiency reasons, thus programming them requires complex compilation environments; 3) hardware resources are precious and low-level languages are still a must to exploit them fully. These factors negatively impact the programmability of many-core platforms and limit our ability to address the challenges of the next decade. This paper devises a new programming approach leveraging processor virtualization and component-based software engineering paradigms to address these issues all together. We present a programming model based on C for developing fine grain component-based appli...
ACM Transactions on Architecture and Code Optimization (TACO), 2020
Recent trends in deep convolutional neural networks (DCNNs) impose hardware accelerators as a via... more Recent trends in deep convolutional neural networks (DCNNs) impose hardware accelerators as a viable solution for computer vision and speech recognition. The Orlando SoC architecture from STMicroelectronics targets exactly this class of problems by integrating hardware-accelerated convolutional blocks together with DSPs and on-chip memory resources to enable energy-efficient designs of DCNNs. The main advantage of the Orlando platform is to have runtime configurable convolutional accelerators that can adapt to different DCNN workloads. This opens new challenges for mapping the computation to the accelerators and for managing the on-chip resources efficiently. In this work, we propose a runtime design space exploration and mapping methodology for runtime resource management in terms of on-chip memory, convolutional accelerators, and external bandwidth. Experimental results are reported in terms of power/performance scalability, Pareto analysis, mapping adaptivity, and accelerator uti...
Th eme COM Syst emes communicantsEquipe-Projet Alf´Rapport de recherche n° 6933 Mai 2009 ... more Th eme COM Syst emes communicantsEquipe-Projet Alf´Rapport de recherche n° 6933 Mai 2009 17 pagesAbstract: Embedded system design is driven by strong efciency constraints in termsofperformance,siliconareaandpowerconsumption,aswellaspressureonthecostandtime-to-market. As of today, this forms three tough problems: 1) many-core systemsare becoming mainstream, however there is still no decent approach for distributingsoftwareapplicationsonthoseplatforms; 2)thesesystemsstillintegrateheterogeneousprocessors for efciency reasons, thus programming them requires complex compila-tionenvironments; 3)hardwareresourcesarepreciousandlow-levellanguagesarestilla must to exploit them subtly. These factors have negative impact on the programma-bility of many-core platforms and limit our ability to address the challenges of the nextdecade.This paper devises a new programming approach leveraging processor virtual-ization and component-based software engineering technologies to address these is-...
In a world of ubiquitous, heterogeneous parallelism, achieving portable performance is a challeng... more In a world of ubiquitous, heterogeneous parallelism, achieving portable performance is a challenge. It requires finely tuned coordination, from the programming language to the hardware, through the compiler and multiple layers of the run-time system. This document presents our work in split compilation and parallelization. Split compilation relies on automatically generated semantical annotations to enrich the intermediate format, decoupling costly offline analyzes from lighter, online or just-in-time program transformations. Our work focuses on automatic vectorization, a key optimization playing an increasing role in modern, power-efficient architectures. Our research platform uses GCC’s support for the Common Language Infrastructure (CLI ECMA-335 [8] and ISO 23271:2006 [10]); this choice is motivated by the unique combination of optimizations and portability of GCC, through a semantically rich and performance-friendly intermediate format. Implementation is still in progress.
With the recent advances in machine learning, Deep Convolutional Neural Networks (DCNNs) represen... more With the recent advances in machine learning, Deep Convolutional Neural Networks (DCNNs) represent state-of-the-art solutions especially in image and speech recognition and classification. The most important enabler factor of deep learning consists of the massive computing power offered by programmable GPUs for training DCNNs on large amounts of data. Even the complexity of DCNN deployment scenarios, where trained models are used for inference, have started to require powerful computing systems. Especially in the embedded systems domain, the computational requirements along with ultra low-power and memory constraints exacerbate the situation even further. The STM Orlando ultra low-power processor architecture with convolutional neural network acceleration targets exactly this class of problems. Orlando SoC integrates HW-accelerated blocks together with DSPs and on-chip memory resources to enable energy-efficient convolutions for future generations of DCNNs. Although Orlando platform...
Deep Convolutional Neural Networks (DCNNs) achieve state of the art results compared to classic m... more Deep Convolutional Neural Networks (DCNNs) achieve state of the art results compared to classic machine learning in many applications that need recognition, identification and classification. An ever-increasing embedded deployment of DCNNs inference engines thus supporting the intelligence close to the sensor paradigm has been observed, overcoming limitations of cloud-based computing as bandwidth requirements, security, privacy, scalability, and responsiveness. However, increasing the robustness and accuracy of DCNNs comes at the price of increased computational cost. As result, implementing CNNs on embedded devices with real-time constraints is a challenge if the lowest power consumption shall be achieved. A solution to the challenge is to take advantage of the intra-device massive fine grain parallelism offered by these systems and benefit from the extensive concurrency exhibited by DCNN processing pipelines. The trick is to divide intensive tasks into smaller, weakly interacting ...
Complex embedded systems have always been heterogeneous, and it is unlikely that this situation w... more Complex embedded systems have always been heterogeneous, and it is unlikely that this situation will change any time soon. Still, the huge non-recurring engineering cost of silicon products tends to make more parts of embedded systems programmable. Our research proposes to address this complexity through processor virtualization. We decided to rely on the CLI format, and we developed a GCC back-end for it. Even though we were able to generate reasonable code, we noticed that we were lacking some important optimizations that exploit the evaluation stack of the virtual machine. Since GCC internals do not provide any support for stack-based instruction set, we introduced our own. We review the limitations of our previous prototype, and we present the data structures of our internal representation, as well as its API. We also describe a number of optimizations that this representation enabled. To exemplify its convenience, we report the code size improvements we obtained with little eff...
CLI Back-End in GCC Roberto Costa STMicroelectronics Andrea C. Ornstein STMicroelectronics andrea... more CLI Back-End in GCC Roberto Costa STMicroelectronics Andrea C. Ornstein STMicroelectronics andrea. ornstein@ st. com Erven Rohou STMicroelectronics erven. rohou@ st. com Abstract CLI is a framework that defines a platform independent format for executables. The ...
CLI Back-End in GCC Roberto Costa STMicroelectronics Andrea C. Ornstein STMicroelectronics andrea... more CLI Back-End in GCC Roberto Costa STMicroelectronics Andrea C. Ornstein STMicroelectronics andrea. ornstein@ st. com Erven Rohou STMicroelectronics erven. rohou@ st. com Abstract CLI is a framework that defines a platform independent format for executables. The ...
Deep Convolutional Neural Networks (DCNNs) achieve state of the art results compared to classic m... more Deep Convolutional Neural Networks (DCNNs) achieve state of the art results compared to classic machine learning in many applications that need recognition, identification and classification. An ever-increasing embedded deployment of DCNNs inference engines thus supporting the intelligence close to the sensor paradigm has been observed, overcoming limitations of cloud-based computing as bandwidth requirements, security, privacy, scalability, and responsiveness. However, increasing the robustness and accuracy of DCNNs comes at the price of increased computational cost. As result, implementing CNNs on embedded devices with real-time constraints is a challenge if the lowest power consumption shall be achieved. A solution to the challenge is to take advantage of the intra-device massive fine grain parallelism offered by these systems and benefit from the extensive concurrency exhibited by DCNN processing pipelines. The trick is to divide intensive tasks into smaller, weakly interacting batches subject to parallel processing. Referred to that, this paper has mainly two goals: 1) describe the implementation of a state-of-art technique to map DCNN most intensive tasks (dominated by multiply-and-accumulate ops) onto Orlando SoC, an ultra-low power heterogeneous multi cores developed by STMicroelectronics; 2) integrate the proposed implementation on a toolchain that allows deep learning developers to deploy DCNNs on low-power applications.
Abstract. Software development productivity for embedded systems is greatly limited by the high f... more Abstract. Software development productivity for embedded systems is greatly limited by the high fragmentation of platforms and their associ-ated development tools. Platform virtualization environments, like Java and Microsoft.NET, help to alleviate the problem, but because of their advanced run-time features and libraries, they are limited to host func-tionalities running on the system microcontroller and on top of the op-erating system. Due to the ever increasing demand for processing power, it is highly desirable to extend the benefits of platform virtualization to the rest of the system programmable resources, media processors in par-ticular, that can boost performance to a great extent. To this aim we are developing a virtualization framework targeting high performance me-dia applications running on deeply embedded media processors, which we combine with traditional host virtualization environments in order to offer a system-wide virtualization solution. In this paper we present...
Abstract-Embedded systems contain a wide variety of processors. Economical and technological fact... more Abstract-Embedded systems contain a wide variety of processors. Economical and technological factors favor systems made of a combination of diverse but programmable processors. Software has a longer lifetime than the hardware for which it is initially designed. Application portability is thus of utmost importance for the embedded systems industry. The Common Language Infrastructure (CLI) is a rich virtualization environment for the execution of applications written in multiple languages. CLI efficiently captures the semantics of unmanaged languages, such as C. We investigate the use of CLI as a deployment format for embedded systems to reconcile apparently contradictory constraints: the need for portability, the need for high performance and the existence of a large base of legacy C code. In this paper, we motivate our CLI-based compilation environment for C, and its different use scenarios. We then focus on the specific challenges of effectively mapping the C language to CLI, and o...
The design of embedded systems is driven by strong constraints in terms of performance, silicon a... more The design of embedded systems is driven by strong constraints in terms of performance, silicon area and power consumption, as well as pressure on the cost and time-to-market. This has three consequences: 1) many-core systems are becoming mainstream, but there is still no satisfactory approach for distributing software applications on these platforms; 2) these systems integrate heterogeneous processors for efficiency reasons, thus programming them requires complex compilation environments; 3) hardware resources are precious and low-level languages are still a must to exploit them fully. These factors negatively impact the programmability of many-core platforms and limit our ability to address the challenges of the next decade. This paper devises a new programming approach leveraging processor virtualization and component-based software engineering paradigms to address these issues all together. We present a programming model based on C for developing fine grain component-based appli...
ACM Transactions on Architecture and Code Optimization (TACO), 2020
Recent trends in deep convolutional neural networks (DCNNs) impose hardware accelerators as a via... more Recent trends in deep convolutional neural networks (DCNNs) impose hardware accelerators as a viable solution for computer vision and speech recognition. The Orlando SoC architecture from STMicroelectronics targets exactly this class of problems by integrating hardware-accelerated convolutional blocks together with DSPs and on-chip memory resources to enable energy-efficient designs of DCNNs. The main advantage of the Orlando platform is to have runtime configurable convolutional accelerators that can adapt to different DCNN workloads. This opens new challenges for mapping the computation to the accelerators and for managing the on-chip resources efficiently. In this work, we propose a runtime design space exploration and mapping methodology for runtime resource management in terms of on-chip memory, convolutional accelerators, and external bandwidth. Experimental results are reported in terms of power/performance scalability, Pareto analysis, mapping adaptivity, and accelerator uti...
Th eme COM Syst emes communicantsEquipe-Projet Alf´Rapport de recherche n° 6933 Mai 2009 ... more Th eme COM Syst emes communicantsEquipe-Projet Alf´Rapport de recherche n° 6933 Mai 2009 17 pagesAbstract: Embedded system design is driven by strong efciency constraints in termsofperformance,siliconareaandpowerconsumption,aswellaspressureonthecostandtime-to-market. As of today, this forms three tough problems: 1) many-core systemsare becoming mainstream, however there is still no decent approach for distributingsoftwareapplicationsonthoseplatforms; 2)thesesystemsstillintegrateheterogeneousprocessors for efciency reasons, thus programming them requires complex compila-tionenvironments; 3)hardwareresourcesarepreciousandlow-levellanguagesarestilla must to exploit them subtly. These factors have negative impact on the programma-bility of many-core platforms and limit our ability to address the challenges of the nextdecade.This paper devises a new programming approach leveraging processor virtual-ization and component-based software engineering technologies to address these is-...
In a world of ubiquitous, heterogeneous parallelism, achieving portable performance is a challeng... more In a world of ubiquitous, heterogeneous parallelism, achieving portable performance is a challenge. It requires finely tuned coordination, from the programming language to the hardware, through the compiler and multiple layers of the run-time system. This document presents our work in split compilation and parallelization. Split compilation relies on automatically generated semantical annotations to enrich the intermediate format, decoupling costly offline analyzes from lighter, online or just-in-time program transformations. Our work focuses on automatic vectorization, a key optimization playing an increasing role in modern, power-efficient architectures. Our research platform uses GCC’s support for the Common Language Infrastructure (CLI ECMA-335 [8] and ISO 23271:2006 [10]); this choice is motivated by the unique combination of optimizations and portability of GCC, through a semantically rich and performance-friendly intermediate format. Implementation is still in progress.
With the recent advances in machine learning, Deep Convolutional Neural Networks (DCNNs) represen... more With the recent advances in machine learning, Deep Convolutional Neural Networks (DCNNs) represent state-of-the-art solutions especially in image and speech recognition and classification. The most important enabler factor of deep learning consists of the massive computing power offered by programmable GPUs for training DCNNs on large amounts of data. Even the complexity of DCNN deployment scenarios, where trained models are used for inference, have started to require powerful computing systems. Especially in the embedded systems domain, the computational requirements along with ultra low-power and memory constraints exacerbate the situation even further. The STM Orlando ultra low-power processor architecture with convolutional neural network acceleration targets exactly this class of problems. Orlando SoC integrates HW-accelerated blocks together with DSPs and on-chip memory resources to enable energy-efficient convolutions for future generations of DCNNs. Although Orlando platform...
Deep Convolutional Neural Networks (DCNNs) achieve state of the art results compared to classic m... more Deep Convolutional Neural Networks (DCNNs) achieve state of the art results compared to classic machine learning in many applications that need recognition, identification and classification. An ever-increasing embedded deployment of DCNNs inference engines thus supporting the intelligence close to the sensor paradigm has been observed, overcoming limitations of cloud-based computing as bandwidth requirements, security, privacy, scalability, and responsiveness. However, increasing the robustness and accuracy of DCNNs comes at the price of increased computational cost. As result, implementing CNNs on embedded devices with real-time constraints is a challenge if the lowest power consumption shall be achieved. A solution to the challenge is to take advantage of the intra-device massive fine grain parallelism offered by these systems and benefit from the extensive concurrency exhibited by DCNN processing pipelines. The trick is to divide intensive tasks into smaller, weakly interacting ...
Complex embedded systems have always been heterogeneous, and it is unlikely that this situation w... more Complex embedded systems have always been heterogeneous, and it is unlikely that this situation will change any time soon. Still, the huge non-recurring engineering cost of silicon products tends to make more parts of embedded systems programmable. Our research proposes to address this complexity through processor virtualization. We decided to rely on the CLI format, and we developed a GCC back-end for it. Even though we were able to generate reasonable code, we noticed that we were lacking some important optimizations that exploit the evaluation stack of the virtual machine. Since GCC internals do not provide any support for stack-based instruction set, we introduced our own. We review the limitations of our previous prototype, and we present the data structures of our internal representation, as well as its API. We also describe a number of optimizations that this representation enabled. To exemplify its convenience, we report the code size improvements we obtained with little eff...
CLI Back-End in GCC Roberto Costa STMicroelectronics Andrea C. Ornstein STMicroelectronics andrea... more CLI Back-End in GCC Roberto Costa STMicroelectronics Andrea C. Ornstein STMicroelectronics andrea. ornstein@ st. com Erven Rohou STMicroelectronics erven. rohou@ st. com Abstract CLI is a framework that defines a platform independent format for executables. The ...
CLI Back-End in GCC Roberto Costa STMicroelectronics Andrea C. Ornstein STMicroelectronics andrea... more CLI Back-End in GCC Roberto Costa STMicroelectronics Andrea C. Ornstein STMicroelectronics andrea. ornstein@ st. com Erven Rohou STMicroelectronics erven. rohou@ st. com Abstract CLI is a framework that defines a platform independent format for executables. The ...
Uploads