Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3626183.3659961acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article
Open access

Fault-Tolerant Parallel Integer Multiplication

Published: 17 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Exascale machines have a small mean time between failures, necessitating fault tolerance. Out-of-the-box fault-tolerant solutions, such as checkpoint-restart and replication, apply to any algorithm but incur significant overhead costs. Long integer multiplication is a fundamental kernel in numerical linear algebra and cryptography. The naive, schoolbook multiplication algorithm runs in Θ(n2) while Toom-Cook algorithms runs in Θ(nlogk (2k-1)) for 2 ≤ k. We obtain the first efficient fault-tolerant parallel Toom-Cook algorithm. While asymptotically faster FFT-based algorithms exist, Toom-Cook algorithms are often favored in practice on small scale and on supercomputers. Our algorithm enables fault tolerance with negligible overhead costs. Compared to existing, general-purpose, fault-tolerant solutions, our algorithm reduces the arithmetic and communication (bandwidth) overhead costs by a factor of Θ(P/(2k-1)) (where P is the number of processors). To this end, we adapt the fault-tolerant BFS-DFS method of Birnbaum et al. (2020) for fast matrix multiplication and combine it with a coding strategy tailored for Toom-Cook. This eliminates the need for recomputations, resulting in a much faster algorithm.

    References

    [1]
    Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective straggler mitigation: Attack of the clones. In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). 185--198.
    [2]
    Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the outliers in {Map-Reduce} clusters using mantri. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10).
    [3]
    Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE transactions on dependable and secure computing 1, 1 (2004), 11--33.
    [4]
    Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal parallel algorithm for Strassen's matrix multiplication. In Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures. ACM, 193--204.
    [5]
    Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl. 32, 3 (2011), 866--901.
    [6]
    Jose Maria Bermudo Mera, Angshuman Karmakar, and Ingrid Verbauwhede. 2020. Time-memory trade-off in Toom-Cook multiplication: an application to modulelattice based cryptography. IACR Transactions on Cryptographic Hardware and Embedded Systems 2020, 2 (2020), 222--244.
    [7]
    Gianfranco Bilardi and Lorenzo De Stefani. 2019. The I/O complexity of Toom- Cook integer multiplication. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2034--2052.
    [8]
    Noam Birnbaum, Roy Nissim, and Oded Schwartz. 2020. Fault Tolerance with High Performance for Fast Matrix Multiplication. In 2020 Proceedings of the SIAM Workshop on Combinatorial Scientific Computing. SIAM, 106--117.
    [9]
    Noam Birnbaum and Oded Schwartz. 2018. Fault Tolerant Resource Efficient Matrix Multiplication. In Proceedings of the Eighth SIAM Workshop on Combinatorial Scientific Computing. SIAM, 23--34.
    [10]
    Noam Birnbaum and Oded Schwartz. 2018. Fault tolerant resource efficient matrix multiplication. In 2018 Proceedings of the Seventh SIAMWorkshop on Combinatorial Scientific Computing. SIAM, 23--34.
    [11]
    Marco Bodrato and Alberto Zanoni. 2006. What about Toom-Cook matrices optimality. Centro "Vito Volterra" Università di Roma Tor Vergata (2006).
    [12]
    Marco Bodrato and Alberto Zanoni. 2007. Integer and polynomial multiplication: Towards optimal Toom-Cook matrices. In Proceedings of the 2007 international symposium on Symbolic and algebraic computation. 17--24.
    [13]
    Franck Cappello, Al Geist, William Gropp, Sanjay Kale, Bill Kramer, and Marc Snir. 2014. Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations 1, 1 (2014), 5--28.
    [14]
    Henri Casanova. 2007. Benefits and drawbacks of redundant batch requests. Journal of Grid Computing 5, 2 (2007), 235--250.
    [15]
    Zizhong Chen and Jack Dongarra. 2006. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE, 10--pp.
    [16]
    Zizhong Chen and Jack Dongarra. 2008. Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems 19, 12 (2008), 1628--1641.
    [17]
    Stephen A Cook and StalO Aanderaa. 1969. On the minimum computation time of functions. Trans. Amer. Math. Soc. 142 (1969), 291--314.
    [18]
    Son Hoang Dau,Wentu Song, Alex Sprintson, and Chau Yuen. 2015. Constructions of MDS codes via random Vandermonde and cauchy matrices over small fields. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 949--955.
    [19]
    Lorenzo De Stefani. 2020. Communication-optimal parallel standard and karatsuba integer multiplication in the distributed memory model. arXiv preprint arXiv:2009.14590 (2020).
    [20]
    Lorenzo De Stefani. 2022. Brief Announcement: On the I/O Complexity of Sequential and Parallel Hybrid Integer Multiplication Algorithms. In Proceedings of the Twenty-Fifth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '22). Association for Computing Machinery, New York, NY, USA.
    [21]
    Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
    [22]
    James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-optimal parallel recursive rectangular matrix multiplication. (2013), 261--272.
    [23]
    Jinnan Ding, Shuguo Li, and Zhen Gu. 2018. High-speed ECC processor over NIST prime fields applied with Toom--Cook multiplication. IEEE Transactions on Circuits and Systems I: Regular Papers 66, 3 (2018), 1003--1016.
    [24]
    Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. 2016. Short-dot: Computing large linear transforms distributedly using coded short dot products. Advances In Neural Information Processing Systems 29 (2016).
    [25]
    Dutta, Sanghamitra and Fahim, Mohammad and Haddadpour, Farzin and Jeong, Haewon and Cadambe, Viveck and Grover, Pulkit. 2020. On the Optimal Recovery Threshold of Coded Matrix Multiplication. IEEE Transactions on Information Theory 66, 1 (2020), 278--301.
    [26]
    ElmootazbellahNElnozahy and James S Plank. 2004. Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1, 2 (2004), 97--108.
    [27]
    R¯usin, ? Freivalds. 1979. Fast probabilistic algorithms. In International Symposium on Mathematical Foundations of Computer Science. Springer, 57--69.
    [28]
    Martin Fürer. 2009. Faster integer multiplication. SIAM J. Comput. 39, 3 (2009), 979--1005.
    [29]
    Kristen Gardner, Samuel Zbarsky, Sherwin Doroudi, Mor Harchol-Balter, and Esa Hyytia. 2015. Reducing latency via redundant requests: Exact analysis. ACM SIGMETRICS Performance Evaluation Review 43, 1 (2015), 347--360.
    [30]
    Torbjörn Granlund. 2004. GNU MP: The GNU multiple precision arithmetic library. http://gmplib.org/ (2004).
    [31]
    Torbjörn Granlund and Peter L Montgomery. 1994. Division by invariant integers using multiplication. In Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation. 61--72.
    [32]
    Zhen Gu and Shuguo Li. 2018. A division-free Toom--Cook multiplication-based montgomery modular multiplication. IEEE Transactions on Circuits and Systems II: Express Briefs 66, 8 (2018), 1401--1405.
    [33]
    John A Gunnels, Daniel S Katz, Enrique S Quintana-Orti, and RA Van de Gejin. 2001. Fault-tolerant high-performance matrix multiplication: Theory and practice. In Dependable Systems and Networks, 2001. DSN 2001. International Conference on. IEEE, 47--56.
    [34]
    David Harvey and Joris Van Der Hoeven. 2021. Integer multiplication in time O(n log n). Annals of Mathematics 193, 2 (2021), 563--617.
    [35]
    Sangwoo Hong, Heecheol Yang, Youngseok Yoon, Taehyun Cho, and Jungwoo Lee. 2021. Chebyshev Polynomial Codes: Task Entanglement-based Coding for Distributed Matrix Multiplication. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 4319--4327.
    [36]
    Kuang-Hua Huang and Jacob A Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE transactions on computers 100, 6 (1984), 518--528.
    [37]
    Longbo Huang, Sameer Pawar, Hao Zhang, and Kannan Ramchandran. 2012. Codes can reduce queueing delay in data centers. In 2012 IEEE International Symposium on Information Theory Proceedings. IEEE, 2766--2770.
    [38]
    Dror Irony, Sivan Toledo, and Alexander Tiskin. 2004. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel and Distrib. Comput. 64, 9 (2004), 1017--1026.
    [39]
    Gauri Joshi, Yanpei Liu, and Emina Soljanin. 2014. On the delay-storage trade-off in content download from coded distributed storage systems. IEEE Journal on Selected Areas in Communications 32, 5 (2014), 989--997.
    [40]
    Gauri Joshi, Emina Soljanin, and Gregory Wornell. 2017. Efficient redundancy techniques for latency reduction in cloud systems. ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS) 2, 2 (2017), 1--30.
    [41]
    Anatoly Karatsuba and Yu Ofman. 1963. Multiplication of MultiDigit Numbers on Automata. Soviet Physics-Doklady 7 (1963), 595--596.
    [42]
    Richard Koo and Sam Toueg. 1987. Checkpointing and rollback-recovery for distributed systems. IEEE Transactions on software Engineering 1 (1987), 23--31.
    [43]
    Jérome Lacan and Jérome Fimes. 2004. Systematic MDS erasure codes based on Vandermonde matrices. IEEE Communications Letters 8, 9 (2004), 570--572.
    [44]
    Harashta Tatimma Larasati, Asep Muhamad Awaludin, Janghyun Ji, and Howon Kim. 2021. Quantum Circuit Design of Toom 3-Way Multiplication. Applied Sciences 11, 9 (2021), 3752.
    [45]
    Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2017. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory 64, 3 (2017), 1514--1529.
    [46]
    Kangwook Lee, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2017. Coded computation for multicore setups. In 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2413--2417.
    [47]
    Kangwook Lee, Changho Suh, and Kannan Ramchandran. 2017. Highdimensional coded matrix multiplication. In 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2418--2422.
    [48]
    Ku Leuven, Reviewers Marco Lewandowsky, and Miha Stopar. 2020. D5. 3 Final Report on Hardware-Optimized Schemes. FENTEC (2020).
    [49]
    S. Li, M. A. Maddah-Ali, and A. S. Avestimehr. 2015. Coded MapReduce. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton). 964--971.
    [50]
    Songze Li, Mohammad Ali Maddah-Ali, Qian Yu, and A Salman Avestimehr. 2017. A fundamental tradeoff between computation and communication in distributed computing. IEEE Transactions on Information Theory 64, 1 (2017), 109--128.
    [51]
    Songze Li, Sucha Supittayapornpong, Mohammad Ali Maddah-Ali, and Salman Avestimehr. 2017. Coded terasort. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 389--398.
    [52]
    Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz. 2012. Communication-avoiding parallel strassen: Implementation and performance. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 101.
    [53]
    Qingshan Luo and John B Drake. 1995. A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers. In Proceedings of the 1995 ACM symposium on Applied computing. ACM, 221--226.
    [54]
    Ankur Mallick, Malhar Chaudhari, Utsav Sheth, Ganesh Palanikumar, and Gauri Joshi. 2019. Rateless codes for near-perfect load balancing in distributed matrixvector multiplication. Proceedings of the ACM on Measurement and Analysis of Computing Systems 3, 3 (2019), 1--40.
    [55]
    Ankur Mallick, Malhar Chaudhari, Utsav Sheth, Ganesh Palanikumar, and Gauri Joshi. 2022. Rateless codes for near-perfect load balancing in distributed matrixvector multiplication. Commun. ACM 65, 5 (2022), 111--118.
    [56]
    William F McColl and Alexandre Tiskin. 1999. Memory-efficient matrix multiplication in the BSP model. Algorithmica 24, 3 (1999), 287--297.
    [57]
    Michael Moldaschl, Karl E Prikopa, and Wilfried N Gansterer. 2017. Fault tolerant communication-optimal 2.5D matrix multiplication. J. Parallel and Distrib. Comput. 104 (2017), 179--190.
    [58]
    Peter L Montgomery. 2005. Five, six, and seven-term Karatsuba-like formulae. IEEE Trans. Comput. 54, 3 (2005), 362--369.
    [59]
    Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R De Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In SC'10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.
    [60]
    Duc Tri Nguyen and Kris Gaj. 2021. Fast NEON-based multiplication for latticebased NIST Post-Quantum Cryptography finalists. In International Conference on Post-Quantum Cryptography. Springer, 234--254.
    [61]
    Roy Nissim and Oded Schwartz. 2023. Accelerating Distributed Matrix Multiplication with 4-Dimensional Polynomial Codes. In SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23). SIAM, 134--146.
    [62]
    James S Plank, Kai Li, and Michael A Puening. 1998. Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems 9, 10 (1998), 972--986.
    [63]
    Dedy Septono Catur Putranto, Rini WisnuWardhani, Harashta Tatimma Larasati, and Howon Kim. 2023. Space and Time-Efficient Quantum Multiplier in Post Quantum Cryptography Era. IEEE Access 11 (2023), 21848--21862.
    [64]
    Amirhossein Reisizadeh, Saurav Prakash, Ramtin Pedarsani, and Amir Salman Avestimehr. 2019. Coded computation over heterogeneous clusters. IEEE Transactions on Information Theory 65, 7 (2019), 4227--4242.
    [65]
    Mahdi Sajadieh, Mohammad Dakhilalian, Hamid Mala, and Behnaz Omoomi. 2012. On construction of involutory MDS matrices from Vandermonde Matrices in GF (2 q). Designs, Codes and Cryptography 64 (2012), 287--308.
    [66]
    Peter Sanders and Jop F Sibeyn. 2003. A bandwidth latency tradeoff for broadcast and reduction. Inform. Process. Lett. 86, 1 (2003), 33--38.
    [67]
    Albin Severinson, Alexandre Graell i Amat, and Eirik Rosnes. 2018. Blockdiagonal and LT codes for distributed computing with straggling servers. IEEE Transactions on Communications 67, 3 (2018), 1739--1753.
    [68]
    Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A Chien, Paul Coteus, Nathan A DeBardeleben, Pedro C Diniz, Christian Engelmanm, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. 2014. Addressing failures in exascale computing. The International Journal of High Performance Computing Applications 28, 2 (2014), 129--173.
    [69]
    Kyungrak Son and Wan Choi. 2022. Distributed Matrix Multiplication Based on Frame Quantization for Straggler Mitigation. IEEE Transactions on Signal Processing 70 (2022).
    [70]
    Pedro Soto, Jun Li, and Xiaodi Fan. 2019. Dual Entangled Polynomial Code: Three-Dimensional Coding for Distributed Matrix Multiplication. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 5937--5945.
    [71]
    Rashish Tandon, Qi Lei, Alexandros G Dimakis, and Nikos Karampatziakis. 2017. Gradient coding: Avoiding stragglers in distributed learning. In International Conference on Machine Learning. PMLR, 3368--3376.
    [72]
    Andrei L Toom. 1963. The complexity of a scheme of functional elements realizing the multiplication of integers. In Soviet Mathematics Doklady, Vol. 3. 714--716.
    [73]
    Shih-hsiung Tung. 1964. Harnack's inequality and theorems on matrix spaces. Proc. Amer. Math. Soc. 15, 3 (1964), 375--381.
    [74]
    Robert A Van de Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency-Practice and Experience 9, 4 (1997), 255--274.
    [75]
    Ashish Vulimiri, Philip Brighten Godfrey, Radhika Mittal, Justine Sherry, Sylvia Ratnasamy, and Scott Shenker. 2013. Low latency via redundancy. In Proceedings of the ninth ACM conference on Emerging networking experiments and technologies. 283--294.
    [76]
    Da Wang, Gauri Joshi, and Gregory Wornell. 2014. Efficient task replication for fast response times in parallel computation. In The 2014 ACM international conference on Measurement and modeling of computer systems. 599--600.
    [77]
    Da Wang, Gauri Joshi, and Gregory Wornell. 2015. Using straggler replication to reduce latency in large-scale parallel computing. ACM SIGMETRICS Performance Evaluation Review 43, 3 (2015), 7--11.
    [78]
    Sinong Wang, Jiashang Liu, and Ness Shroff. 2018. Coded sparse matrix multiplication. In International Conference on Machine Learning. PMLR, 5152--5160.
    [79]
    Yi-Min Wang, Yennun Huang, Kiem-Phong Vo, Pe-Yu Chung, and Chandra Kintala. 1995. Checkpointing and its applications. In Twenty-fifth International Symposium on fault-tolerant Computing. Digest of papers. IEEE, 22--31.
    [80]
    Paul B Yale. 2014. Geometry and symmetry. Courier Corporation.
    [81]
    C-Q Yang and Barton P Miller. 1988. Critical path analysis for the execution of parallel and distributed programs. In The 8th International Conference on Distributed. IEEE Computer Society, 366--367.
    [82]
    Qian Yu, Mohammad Maddah-Ali, and Salman Avestimehr. 2017. Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.
    [83]
    Qian Yu, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr. 2020. Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding. IEEE Transactions on Information Theory 66, 3 (2020), 1920--1933.
    [84]
    Alberto Zanoni. 2009. Toom-Cook 8-way for long integers multiplication. In 2009 11th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. IEEE, 54--57.
    [85]
    Alberto Zanoni. 2010. Iterative Toom-Cook methods for very unbalanced long integer multiplication. In Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation. 319--323.
    [86]
    Dan Zuras. 1993. On squaring and multiplying large integers. In Proceedings of IEEE 11th Symposium on Computer Arithmetic. IEEE, 260--271.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SPAA '24: Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures
    June 2024
    510 pages
    ISBN:9798400704161
    DOI:10.1145/3626183
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 June 2024

    Check for updates

    Author Tags

    1. fault tolerance
    2. i/o complexity
    3. long integer multiplication
    4. parallel computing
    5. toom-cook

    Qualifiers

    • Research-article

    Funding Sources

    • Science Accelerator and by the Frontiers in Science initiative of the Ministry of Innovation, Science and Technology
    • European Research Council under the European Union?s Horizon 2020 research and innovation programme
    • Israel Science Foundation
    • European Union?s Horizon 2020 research and innovation programme

    Conference

    SPAA '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 447 of 1,461 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 18
      Total Downloads
    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)18

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media