Abstract
In this paper, we present StreamDrive, a dynamic dataflow framework for programming clustered embedded multicore architectures. StreamDrive simplifies development of dynamic dataflow applications starting from sequential reference C code and allows seamless handling of heterogeneous and application-specific processing elements by applications. We address issues of efficient implementation of the dynamic dataflow runtime system in the context of constrained embedded environments, which have not been sufficiently addressed by previous research. We conducted a detailed performance evaluation of the StreamDrive implementation on our Application Specific MultiProcessor (ASMP) cluster using the Oriented FAST and Rotated BRIEF (ORB) algorithm typical of image processing domain. We have used the proposed incremental development flow for the transformation of the ORB original reference C code into an optimized dynamic dataflow implementation. Our implementation has less than 10% parallelization overhead, near-linear speedup when the number of processors increases from 1 to 8, and achieves the performance of 15 VGA frames per second with a small cluster configuration of 4 processing elements and 64KB of shared memory, and of 30 VGA frames per second with 8 processors and 128KB of shared memory.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
This set includes relatively generic instructions, such as a MAC4CLIP which performs SIMD multiplication on bytes of two input operands, saturates the two 16-bit results, and accumulates them with the result operand; as well as instructions dedicated to specific image processing functions, such as a XORSBCW, used in Support Vector Machine (SVM), which calculate the Hamming distance between two vectors.
In this paper, we also use term actor for the KPN processes for the sake of coherence.
In actual implementation, we have implemented two variants of the ORB: (1) with a rescaler tightly-coupled HW block and where the pyramid construction is part of the dataflow graph, and (2) with the pyramid construction as a pre-processing step.
Any field of the Keyp_t structure can be used to communicate the number of corners.
The synchronization overhead includes actions required to verify the token availability, and the associated scheduler actions.
Unless there is uncontrolled accumulation of tokens in a channel.
This real-time requirement also takes the match part of the application into account.
References
Bezati, E. (2015). High-level synthesis of dataflow programs for heterogeneous platforms: design flow tools and design space exploration. PhD Thesis, ÉCOLE POLYTECHNIQUE FÉdÉRALE DE LAUSANNE.
Bezati, E., Brunet, S.C., Mattavelli, M., Janneck, J.W. (2016). High-level system synthesis and optimization of dataflow programs for mpsocs. In Matthews, M.B. (Ed.) ACSSC (pp. 417–421). IEEE.
Bhattacharya, B., & Battacharyya, S. (2001). Parameterized dataflow modelling for dsp systems. IEEE Transactions on Signal Processing, 49(10), 2408–2421.
Bhattacharyya, S.S., Deprettere, E.F., Leupers, R., Takala, J. (Eds.). (Berlin). Handbook of signal processing systems: Springer.
Bilsen, G., Engels, M., Lauwereins, R., Peperstraete, J.A. (1995). Cyclo-static data flow. In ICASSP, (Vol. 5 pp. 3255–3258).
Buck, J. (1993). Scheduling dynamic dataflow graphs with bounded memory using the token flow model. PhD Thesis, Department of Electrical Enginnering and Computer Science, University of California at Berkeley.
Buck, J.T. (1994). A dynamic dataflow model suitable for efficient mixed hardware and software implementations of dsp applications. In HSCD Workshop (pp. 165–172).
Cockx, J., Denolf, K., Vanhoof, B., Stahl, R. (2007). Sprint: a tool to generate concurrent transaction-level models from sequential code. EURASIP Journal on Applied Signal Processing, 1, 213.
Dehyadegari, M., Marongiu, A., Kakoee, M., Benini, L., Mohammadi, S., Yazdani, N. (2012). A tightly-coupled multi-core cluster with shared memory hw accelerators. In ISCAMOS (pp. 96–103).
Dennis, J. (1974). First version data flow procedure language. Tech. Rep. MAC TM61, MIT laboratory for computer science.
de Dinechin, B.D., Ayrignac, R., Beaucamps, P.E., Couvert, P., Ganne, B., de Massas, P.G., Jacquet, F., Jones, S., Chaisemartin, N.M., Riss, F., Strudel, T. (2013). A clustered manycore processor architecture for embedded and accelerated applications. In HPEC (pp. 1–6): IEEE.
de Kock, E.A., Smits, W., van der Wolf, P., Brunel, J.Y., Kruijtzer, W., Lieverse, P., Vissers, K.A., Essink, G. (2000). Yapi: application modeling for signal processing systems. In DAC (pp. 402–405).
Dunkels, A., Schmidt, O., Voigt, T., Ali, M. (2006). Protothreads: simplifying event-driven programming of memory-constrained embedded systems. In Sensys (pp. 29–42).
Edwards, S.A., & Tardieu, O. (2006). Shim: a deterministic model for heterogeneous embedded systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 14(8), 854– 867.
Edwards, S.A., Vasudevan, N., Tardieu, O. (2008). Programming shared memory multiprocessors with deterministic message-passing concurrency: compiling shim to pthreads. In Sciuto, D. (Ed.) DATE (pp. 1498–1503). ACM.
Eker, J., & Janneck, J. (2002). Caltrop—language report (draft). Technical Memorandum, Electronics Research Lab, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley California, Berkeley, CA 94720, USA, http://www.gigascale.org/caltrop.
Eker, J., & Janneck, J.W. (2012). Dataflow programming in cal – balancing expressiveness, analyzability, and implementability. In Asilomar conference on signals, systems and computers (pp. 1120–1124).
Gangwal, O.P., Nieuwland, A., Lippens, P.E.R. (2001). A scalable and flexible data synchronization scheme for embedded hw-sw shared-memory systems. In Hermida, R., & Aboulhamid, E.M. (Eds.) ISSS (pp. 1–6). ACM / IEEE Computer Society.
Gautier, T., Besseron, X., Pigeon, L. (2007). Kaapi: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In PASCO (pp. 15–23).
Gebrewahid, E., Yang, M., Cedersjö, G., Abdin, Z.U., Gaspes, V., Janneck, J.W., Svensson, B. (2014). Realizing efficient execution of dataflow actors on manycores. In EUC (pp. 321–328).
Geilen, M., & Basten, T. (2003). Requirements on the execution of kahn process networks. In Degano, P. (Ed.) ESOP, Springer, lecture notes in computer science, (Vol. 2618 pp. 319–334).
Girault, A., Lee, B., Lee, E.A. (1999). Hierarchical finite state machines with multiple concurrency models. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(6), 742–760.
Goubier, T., Sirdey, R., Louise, S., David, V. (2011). ΣC: a programming model and language for embedded manycores. In ICA3PP (pp. 385–394).
Haid, W. (2010). Design and performance analysis of multiprocessor streaming applications. PhD Thesis, ETH, Zurich.
Haid, W., Schor, L., Huang, K., Bacivarov, I., Thiele, L. (2009). Efficient execution of kahn process networks on multi-processor systems using protothreads and windowed fifos. In ESTIMEdia (pp. 35–44).
Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proceedings of the 4th alvey vision conference (pp. 147–151).
Huang, K., Grunert, D., Thiele, L. (2007). Windowed fifos for fpga-based multiprocessor systems. In ASAP (pp. 36–41).
Kahn, G. (1974). The semantics of a simple language for parallel programming. In IFIP Congress.
Lee, E. (1997). A denotational semantics for dataflow with firing Memorandum UCB/ERL M97/3. Electronics Research Laboratory, U. C. Berkeley.
Lee, E.A., & Messerschmitt, D.G. (1987). Synchronous data flow. Proceedings of the IEEE, 75(9), 1235–1245.
Mattavelli, M., Amer, I., Raulet, M. (2010). The reconfigurable video coding standard [standards in a nutshell]. IEEE Signal Processing Magazine, 27(3), 159–167.
Mattavelli, M., Raulet, M., Janneck, J.W. (2013). Mpeg reconfigurable video coding. In Bhattacharyya, S.S., Deprettere, E.F., Leupers, R., Takala, J. (Eds.) Handbook of signal processing systems (pp. 281–314). Springer.
Melpignano, D., Benini, L., Flamand, E., Jego, B., Lepley, T., Haugou, G., Clermidy, F., Dutoit, D. (2012). Platform 2012, a many-core computing accelerator for embedded socs: performance evaluation of visual analytics applications. In DAC (pp. 1137–1142).
Michalska, M., Bezati, E., Brunet, S.C., Mattavelli, M. (2016). A partition scheduler model for dynamic dataflow programs. In Connolly, M. (Ed.) ICCS, Elsevier, procedia computer science, (Vol. 80 pp. 2287–2291).
Michalska, M., Zufferey, N., Boutellier, J., Bezati, E., Mattavelli, M. (2016). Efficient scheduling policies for dynamic data flow programs executed on multi-core. In 11th international meeting on logistics research.
NVIDIA. (2010). Next generation cuda compute architecture: Fermi - white paper. http://www.nvidia.com.
Olofsson, A., Nordström, T, Ul-Abdin, Z. (2014). Kickstarting high-performance energy-efficient manycore architectures with epiphany. In Asilomar conference on signals, systems and computers (pp. 1719–1726). IEEE.
Orozco, D., Garcia, E., Pavel, R., Khan, R., Gao, G. (2011). Tideflow: the time iterated dependency flow execution model. In Workshop on data-flow execution models for extreme scale computing (DFM) (pp. 1–9).
Pelcat, M., Desnos, K., Heulot, J., Guy, C., Nezan, J.F., Aridhi, S. (2014). Preesm: a dataflow-based rapid prototyping framework for simplifying multicore dsp programming. In EDERC (pp. 36–40).
Pimentel, A.D. (2008). The artemis workbench for system-level performance evaluation of embedded systems. International Journal of Embedded Systems, 3(3), 181–196.
Plishker, W., Sane, N., Kiemb, M., Anand, K., Bhattacharyya, S.S. (2008). Functional dif for rapid prototyping. In IEEE international workshop on rapid system prototyping (pp. 17–23). IEEE Computer Society.
Plishker, W., Sane, N., Bhattacharyya, S.S. (2009). A generalized scheduling approach for dynamic dataflow applications. In Benini, L., Micheli, G.D., Al-Hashimi, B.M., Müller, W. (Eds.) DATE (pp. 111–116). IEEE.
Plurality. (2011). Plurality hypercore. http://www.plurality.com.
Pop, A., & Cohen, A. (2013). Openstream: Expressiveness and data-flow compilation of openmp streaming programs. ACM Transactions on Architecture and Code Optimization, 9(4), 53.
Rahimi, A., Loi, I., Kakoee, M.R., Benini, L. (2011). A fully-synthesizable single-cycle interconnection network for shared-l1 processor clusters. In Design, automation & test in europe conference & exhibition (DATE), 2011 (pp. 1–6). IEEE.
Rahman, A.A.H.A., Brunet, S.C., Alberti, C., Mattavelli, M. (2014). A methodology for optimizing buffer sizes of dynamic dataflow fpgas implementations. In ICASSP (pp. 5003–5007). IEEE.
Rahman, A.A.H.B.A. (2014). Optimizing dataflow programs for hardware synthesis. PhD Thesis, ÉCOLE POLYTECHNIQUE FÉdÉRALE DE LAUSANNE.
Rosten, E., Porter, R., Drummond, T. (2010). Faster and better: a machine learning approach to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 105– 119.
Rublee, E., Rabaud, V., Konolige, K., Bradski, G. (2011). Orb: an efficient alternative to sift or surf. In ICCV (pp. 2564–2571).
Sane, N., Hsu, C.J., Pino, J.L., Bhattacharyya, S.S. (2010). Simulating dynamic communication systems using the core functional dataflow model. In ICASSP (pp. 1538–1541). IEEE.
Sau, C., Meloni, P., Raffo, L., Palumbo, F., Bezati, E., Brunet, S.C., Mattavelli, M. (2016). Automated design flow for multi-functional dataflow-based platforms. Signal Processing Systems, 85(1), 143–165.
Schwambach, V., Cleyet-Merle, S., Issard, A., Mancini, S. (2015). Estimating the potential speedup of computer vision applications on embedded multiprocessors. arXiv:1502.07446.
Shen, C., Plishker, W., Bhattacharyya, S.S. (2012). Dataflow-based design and implementation of image processing applications. In Guan, L., He, Y., Kung, S.-Y. (Eds.) Multimedia Image and Video Processing, 2nd edn. Chapter 24 (pp. 609–629). Boca Raton: CRC Press.
Sriram, S., & Bhattacharyya, S.S. (2009). Embedded multiprocessors: scheduling and synchronization. Boca Raton: CRC Press.
Sérot, J., Berry, F., Bourrasset, C. (2016). High-level dataflow programming for real-time image processing on smart cameras. Journal of Real-Time Image Processing, 12(4), 635–647.
Stoutchinin, A., & Benini, L. (2017). Stream drive: a dynamic dataflow framework for clustered embedded architectures. In Conference on computing frontiers (pp. 1–8). ACM.
Stuijk, S., Geilen, M., Thelen, B., Basten, T. (2011). Scenario-aware dataflow: modeling, analysis and implementation of dynamic applications. In International conference on embedded computer systems (pp. 404–411).
Ul-Abdin, Z., & Yang, M. (2015). A radar signal processing case study for dataflow programming of manycores. Journal of Signal Processing Systems, 87(1), 49–62.
Vasudevan, N., & Edwards, S.A. (2009). Celling shim: compiling deterministic concurrency to a heterogeneous multicore. In ACM symposium on applied computing (pp. 1626–1631).
Vrba, Z., Halvorsen, P., Griwodz, C., Beskow, P., Espeland, H., Johansen, D. (2013). The nornir run-time system for parallel programs using kahn process networks on multi-core machines - a flexible alternative to mapreduce. The Journal of Supercomputing, 63(1), 191–217.
YarKhan, A. (2012). Dynamic task execution on shared and distributed memory architectures. PhD Thesis, The University of Tennessee, Knoxville.
Yviquel, H., Sanchez, A., Jääskeläinen, P., Takala, J., Raulet, M., Casseau, E. (2014). Efficient software synthesis of dynamic dataflow programs. In ICASSP (pp. 4988–4992). IEEE.
Yviquel, H., Sanchez, A., Jääskeläinen, P., Takala, J., Raulet, M., Casseau, E. (2015). Embedded multi-core systems dedicated to dynamic dataflow programs. Signal Processing Systems, 80(1), 121–136.
Zaki, G.F., Plishker, W., Bhattacharyya, S.S., Fruth, F. (2017). Implementation, scheduling, and adaptation of partial expansion graphs on multicore platforms. Signal Processing Systems, 87(1), 107–125.
Acknowledgements
This research was partially funded by the H2020 Project Opecomp (CA 732631) and by the ERC-ADG Project Multitherman (CA 291125). Authors would also like to thank the ST Microelectronics’ Embedded Computing Systems management for supporting this research.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Stoutchinin, A., Benini, L. StreamDrive: a Dynamic Dataflow Framework for Clustered Embedded Architectures. J Sign Process Syst 91, 275–301 (2019). https://doi.org/10.1007/s11265-018-1351-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-018-1351-1