Abstract
Embedded system applications, with their inherently limited parallelism, rarely exploit all available processing resources in large DSM-based manycore architectures. From a cache coherence perspective, this provides an opportunity to move away from global coherence spanning across all tiles, which does not scale well. Therefore, we favor a region-based cache coherence (RBCC) approach that enables coherence among a selectable cluster of tiles in accordance with application requirements. We present the design and hardware implementation of a flexibly configurable coherency region manager (CRM) that enables RBCC. We introduce two novel features that enhance RBCC, namely, runtime coherency region re-configuration and RBCC-malloc(), that dynamically tailor coherence to actually shared application working sets. Further, we propose, implement and evaluate additional CRM functions such as a non-intrusive barrier synchronization mechanism and a false sharing resolution strategy for our DSM-based manycore architecture. We have synthesized the CRM on an FPGA prototype for a 64-core system and observe a 38% reduction in BRAM-utilization compared to a global coherence directory for regions with up to 32 cores. Experiments using a video streaming application reveal a speed-up of up to 42% compared to an alternative message passing based implementation. We also evaluate the benefits of runtime coherency region re-configuration using two scenarios and present a formal analysis on when a re-configuration is beneficial.
Similar content being viewed by others
Notes
In our system, coherence and their acknowledgement messages are not re-ordered.
Multiple coherence barriers can be supported by increasing the number of barrier and shadow registers per tile.
For some applications, this can additionally contain state transfers.
References
Fleisch, B., Popek, G.: Mirage: a coherent distributed shared memory design. In: Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, pp. 211–223. SOSP ’89, Association for Computing Machinery, New York (1989). https://doi.org/10.1145/74850.74871
Bennett, J.K., Carter, J.B., Zwaenepoel, W.: Munin: distributed shared memory based on type-specific memory coherence. In: Proceedings of the Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 168–176. PPOPP ’90, Association for Computing Machinery, New York (1990). https://doi.org/10.1145/99163.99182
de Dinechin, B.D.: Kalray mppa\(\textregistered\): massively parallel processor array: revisiting dsp acceleration with the kalray mppa manycore processor. In: 2015 IEEE Hot Chips 27 Symposium, pp. 1–27 (2015). https://doi.org/10.1109/HOTCHIPS.2015.7477332
Lenoski, D., Laudon, J., Gharachorloo, K., Weber, W., Gupta, A., Hennessy, J., Horowitz, M., Lam, M.S.: The stanford dash multiprocessor. Computer 25(3), 63–79 (1992)
Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao, C.C., Brown III, J.F., Agarwal, A.: On-chip interconnection architecture of the tile processor. IEEE Micro. 27(5), 15–31 (2007)
Kessler, R.E.: The cavium 32 core octeon ii 68xx. In: 2011 IEEE Hot Chips 23 Symposium (HCS), pp. 1–33 (2011). https://doi.org/10.1109/HOTCHIPS.2011.7477487
Srivatsa, A., Rheindt, S., Wild, T., Herkersdorf, A.: Region based cache coherence for tiled mpsocs. In: 2017 30th IEEE International System-on-Chip Conference (SOCC), pp. 286–291 (2017)
Southern, G., Renau, J.: Analysis of parsec workload scalability. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 133–142 (2016). https://doi.org/10.1109/ISPASS.2016.7482081
Srivatsa, A., Rheindt, S., Gabriel, D., Wild, T., Herkersdorf, A.: Cod: coherence-on-demand-runtime adaptable working set coherence for dsm-based manycore architectures. In: Pnevmatikatos, D.N., Pelcat, M., Jung, M. (eds.) Embedded Computer Systems: Architectures, Modeling, and Simulation, pp. 18–33. Springer, Cham (2019)
Eggers, S.J., Katz, R.H.: Evaluating the performance of four snooping cache coherency protocols. In: Proceedings of the 16th Annual International Symposium on Computer Architecture, pp. 2–15. ISCA ’89, Association for Computing Machinery, New York (1989). https://doi.org/10.1145/74925.74927
Hennessy, J., Heinrich, M., Gupta, A.: Cache-coherent distributed shared memory: perspectives on its development and future challenges. Proc. IEEE 87(3), 418–429 (1999). https://doi.org/10.1109/5.747863
Gupta, A., dietrich Weber, W., Mowry, T.: Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In: International Conference on Parallel Processing, pp. 312–321 (1990)
Yao, Y., Wang, G., Ge, Z., Mitra, T., Chen, W., Zhang, N.: Selectdirectory: a selective directory for cache coherence in many-core architectures. In: 2015 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 175–180 (2015)
Ferdman, M., Lotfi-Kamran, P., Balet, K., Falsafi, B.: Cuckoo directory: a scalable directory for many-core systems. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp. 169–180 (2011)
Chaiken, D., Kubiatowicz, J., Agarwal, A.: Limitless Directories: A Scalable Cache Coherence Scheme, pp. 224–234. ASPLOS IV, ACM, New York (1991). https://doi.org/10.1145/106972.106995
Sodani, A., Gramunt, R., Corbal, J., Kim, H., Vinod, K., Chinthamani, S., Hutsell, S., Agarwal, R., Liu, Y.: Knights landing: Second-generation intel xeon phi product. IEEE Micro 36(2), 34–46 (2016)
Fu, Y., Nguyen, T.M., Wentzlaff, D.: Coherence domain restriction on large scale systems. In: 48th International Symposium on Microarchitecture, pp. 686–698. MICRO-48, ACM, New York (2015). https://doi.org/10.1145/2830772.2830832
Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., Schröder-Preikschat, W., Snelting, G.: Invasive computing: an overview. In: Multiprocessor System-on-Chip: Hardware Design and Tool Integration. https://doi.org/10.1007/978-1-4419-6460-1_11
Torrellas, J., Lam, H.S., Hennessy, J.L.: False sharing and spatial locality in multiprocessor caches. IEEE Trans. Comput. 43(6), 651–663 (1994). https://doi.org/10.1109/12.286299
Jeremiassen, T.E., Eggers, S.J.: Reducing false sharing on shared memory multiprocessors through compile time data transformations. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 179–188. PPOPP ’95, Association for Computing Machinery, New York (1995). https://doi.org/10.1145/209936.209955
Liu, T., Tian, C., Hu, Z., Berger, E.D.: Predator: predictive false sharing detection. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 3–14. PPoPP ’14, Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2555243.2555244
Liu, T., Liu, X.: Cheetah: detecting false sharing efficiently and effectively. In: Proceedings of the 2016 International Symposium on Code Generation and Optimization, pp. 1–11. CGO ’16, Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2854038.2854039
Liu, T., Berger, E.D.: Sheriff: precise detection and automatic mitigation of false sharing. SIGPLAN Not. 46(10), 3–18 (2011). https://doi.org/10.1145/2076021.2048070
Freeh, V.W., Andrews, G.R.: Dynamically controlling false sharing in distributed shared memory. In: Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing, pp. 403–411 (1996). https://doi.org/10.1109/HPDC.1996.546211
Waliullah, M., Stenstrom, P.: Classification and elimination of conflicts in hardware transactional memory systems. In: 2011 23rd International Symposium on Computer Architecture and High Performance Computing, pp. 96–103 (2011). https://doi.org/10.1109/SBAC-PAD.2011.18
Acknowledgements
This work was partly funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—Project Number 146371743-TRR 89: Invasive Computing.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors would like to thank Sai Varun Brahmadevara, Li-Yu Peng and Miguel Montoya Rendon for their contributions as master and internship students at the Chair of Integrated Systems, TUM. We would also like to thank Sebastian Maier at the Computer Science 4 department, FAU, Erlangen-Nuremberg for his OS support.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Srivatsa, A., Mansour, M., Rheindt, S. et al. DynaCo: Dynamic Coherence Management for Tiled Manycore Architectures. Int J Parallel Prog 49, 570–599 (2021). https://doi.org/10.1007/s10766-020-00688-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-020-00688-6