research-article

Modular SIMD arithmetic in Mathemagix

Authors:

Joris Van Der Hoeven,

Grégoire Lecerf,

Guillaume QuintinAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 43, Issue 1

Article No.: 5, Pages 1 - 37

https://doi.org/10.1145/2876503

Published: 29 August 2016 Publication History

Abstract

Modular integer arithmetic occurs in many algorithms for computer algebra, cryptography, and error correcting codes. Although recent microprocessors typically offer a wide range of highly optimized arithmetic functions, modular integer operations still require dedicated implementations. In this article, we survey existing algorithms for modular integer arithmetic and present detailed vectorized counterparts. We also describe several applications, such as fast modular Fourier transforms and multiplication of integer polynomials and matrices. The vectorized algorithms have been implemented in C++ inside the free computer algebra and analysis system Mathemagix. The performance of our implementation is illustrated by various benchmarks.

References

[1]

D. Abrahams and A. Gurtovoy. 2004. C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond. Addison Wesley.

Digital Library

[2]

A. V. Aho, J. E. Hopcroft, and J. D. Ullman. 1974. The Design and Analysis of Computer Algorithms. Addison-Wesley.

Digital Library

[3]

R. Alverson. 1991. Integer division using reciprocals. In Proceedings of the Tenth Symposium on Computer Arithmetic. IEEE Computer Society Press, 186--190.

[4]

H. G. Baker. 1992. Computing A&ast;B (mod N) efficiently in ANSI C. SIGPLAN Not. 27, 1 (1992), 95--98.

Digital Library

[5]

B. Bank, M. Giusti, J. Heintz, G. Lecerf, G. Matera, and P. Solernó. 2015. Degeneracy loci and polynomial equation solving. Found. Comput. Math. 15, 1 (2015), 159--184.

Digital Library

[6]

N. Bardis, A. Drigas, A. Markovskyy, and J. Vrettaros. 2010. Accelerated modular multiplication algorithm of large word length numbers with a fixed module. In Organizational, Business, and Technological Aspects of the Knowledge Society, M. D. Lytras, P. Ordonez de Pablos, A. Ziderman, A. Roulstone, H. Maurer, and J. B. Imber (Eds.). Communications in Computer and Information Science, Vol. 112. Springer, Berlin, 497--505.

[7]

P. Barrett. 1987. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In Advances in Cryptology -- CRYPTO 86, A. Odlyzko (Ed.). Lect. Notes Comput. Sci., Vol. 263. Springer, Berlin, 311--323.

Digital Library

[8]

D. J. Bernstein, Hsueh-Chung Chen, Ming-Shing Chen, Chen-Mou Cheng, Chun-Hung Hsiao, Tanja Lange, Zong-Cing Lin, and Bo-Yin Yang. 2009a. The billion-mulmod-per-second PC. In SHARCS09 Special-purpose Hardware for Attacking Cryptographic Systems: 131. 131--144. http://cr.yp.to/djb.html.

[9]

D. J. Bernstein, Tien-Ren Chen, Chen-Mou Cheng, Tanja Lange, and Bo-Yin Yang. 2009b. ECM on graphics cards. In Advances in Cryptology - EUROCRYPT 2009, A. Joux (Ed.). Lect. Notes Comput. Sci., Vol. 5479. Springer, Berlin, 483--501.

Digital Library

[10]

J. Berthomieu, G. Lecerf, and G. Quintin. 2013. Polynomial root finding over local rings and application to error correcting codes. Appl. Alg. Eng. Comm. Comp. 24, 6 (2013), 413--443.

[11]

J. Berthomieu, J. van der Hoeven, and G. Lecerf. 2011. Relaxed algorithms for p-adic numbers. J. Théor. Nombr. Bord. 23, 3 (2011), 541--577.

[12]

D. Bini and V. Y. Pan. 2012. Polynomial and Matrix Computations: Fundamental Algorithms. Birkhauser Verlag GmbH.

[13]

Boost team. From 1999. Boost (C++ libraries). Software available at http://www.boost.org. (From 1999).

[14]

W. Bosma, J. Cannon, and C. Playoust. 1997. The Magma algebra system. I. The user language. J. Symbol. Comput. 24, 3--4 (1997), 235--265.

Digital Library

[15]

A. Bosselaers, R. Govaerts, and J. Vandewalle. 1994. Comparison of three modular reduction functions. In Advances in Cryptology CRYPTO 93, D. R. Stinson (Ed.). Lect. Notes Comput. Sci., Vol. 773. Springer, Berlin, 175--186.

Digital Library

[16]

British Standards Institution. 2003. The C Standard: Incorporating Technical Corrigendum 1: BS ISO/IEC 9899/1999. John Wiley.

[17]

CLANG From 2007. CLANG, a C language family frontend for LLVM. Software available at http://clang.llvm.org. (From 2007).

[18]

J.-G. Dumas, T. Gautier, C. Pernet, and B. D. Saunders. 2010. LinBox founding scope allocation, parallel building blocks, and separate compilation. In Mathematical Software ICMS 2010, K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama (Eds.). Lect. Notes Comput. Sci., Vol. 6327. Springer, Berlin, 77--83.

Digital Library

[19]

J.-G. Dumas, P. Giorgi, and C. Pernet. 2004. FFPACK: Finite field linear algebra package. In Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation (ISSAC’04), J. Schicho (Ed.). ACM, 119--126.

Digital Library

[20]

J.-G. Dumas, P. Giorgi, and C. Pernet. 2008. Dense linear algebra over word-size prime fields: The FFLAS and FFPACK packages. ACM Trans. Math. Softw. 35, 3 (2008), 19:1--19:42.

Digital Library

[21]

A. Fog. 2012a. Instruction Tables. Lists of Instruction Latencies, Throughputs and Micro-operation Breakdowns for Intel, AMD and VIA CPUs. http://www.agner.org/optimize, Copenhagen University College of Engineering.

[22]

A. Fog. 2012b. Optimizing Software in C++. An Optimization Guide for Windows, Linux and Mac Platforms. http://www.agner.org/optimize, Copenhagen University College of Engineering.

[23]

A. Fog. 2012c. Optimizing Subroutines in Assembly Language. An Optimization Guide for x86 Platforms. http://www.agner.org/optimize, Copenhagen University College of Engineering.

[24]

L. Fousse, G. Hanrot, V. Lefèvre, P. Pélissier, and P. Zimmermann. 2007. MPFR: A multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33, 2 (2007), Article No. 13. Software available at http://www.mpfr.org.

Digital Library

[25]

M. Frigo and S. G. Johnson. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216--231.

[26]

J. von zur Gathen and J. Gerhard. 2003. Modern Computer Algebra (2nd ed.). Cambridge University Press.

Digital Library

[27]

P. Gaudry and E. Thomé. 2007. The mpFq library and implementing curve-based key exchanges. In SPEED: Software Performance Enhancement for Encryption and Decryption. ECRYPT Network of Excellence in Cryptology, Amsterdam, Netherlands, 49--64.

[28]

GCC From 1987. GCC, the GNU Compiler Collection. Software available at http://gcc.gnu.org. (From 1987).

[29]

K. Geddes, G. Gonnet, and Maplesoft. From 1980. Maple (TM). http://www.maplesoft.com/products/maple. (From 1980).

[30]

P. Giorgi, Th. Izard, and A. Tisserand. 2010. Comparison of modular arithmetic algorithms on GPUs. In Parallel Computing: From Multicores and GPU’s to Petascale, B. Chapman, F. Desprez, G. R. Joubert, A. Lichnewsky, F. Peters, and Th. Priol (Eds.). Advances in Parallel Computing, Vol. 19. IOS Press, 315--322.

[31]

P. Giorgi and R. Lebreton. 2014. Online order basis algorithm and its impact on block wiedemann algorithm. In Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation (ISSAC’14), K. Nabeshima (Ed.). ACM, 202--209.

Digital Library

[32]

T. Granlund and others. From 1991. GMP, the GNU multiple precision arithmetic library. (From 1991). Software available at http://gmplib.org.

[33]

T. Granlund and P. L. Montgomery. 1994. Division by invariant integers using multiplication. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation (PLDI’94). ACM, 61--72.

Digital Library

[34]

S. Anisul Haque and M. Moreno Maza. 2012. Plain polynomial arithmetic on GPU. J. Phys.: Conf. Ser. 385, 1 (2012), 012014.

[35]

W. Hart and the FLINT Team. From 2008. FLINT: Fast Library for Number Theory. (From 2008). Software available at http://www.flintlib.org.

[36]

W. Hart and the MPIR Team. From 2010. MPIR, Multiple Precision Integers and Rationals. (From 2010). Software available at http://www.mpir.org.

[37]

D. Harvey. 2009. A cache-friendly truncated FFT. Theoret. Comput. Sci. 410, 27--29 (2009), 2649--2658.

Digital Library

[38]

D. Harvey. 2014. Faster arithmetic for number-theoretic transforms. J. Symbol. Comput. 60 (2014), 113--119.

Digital Library

[39]

D. Harvey and J. van der Hoeven. 2014. On the complexity of integer matrix multiplication. (2014). Preprint available at https://hal.archives-ouvertes.fr/hal-01071191.

[40]

D. Harvey and D. S. Roche. 2010. An In-place truncated Fourier transform and applications to polynomial multiplication. In Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation (ISSAC’10), S. M. Watt (Ed.). ACM, 325--329.

Digital Library

[41]

D. Harvey and A. V. Sutherland. 2014. Computing Hasse--Witt matrices of hyperelliptic curves in average polynomial time. LMS J. Comput. Math. 17 (2014), 257--273. Special Issue A (Algorithmic Number Theory Symposium XI).

[42]

W. Hasenplaugh, G. Gaubatz, and V. Gopal. 2007. Fast modular reduction. In 18th IEEE Symposium on Computer Arithmetic, ARITH ’07, P. Kornerup and J.-M. Muller (Eds.). IEEE Computer Society, 225--229.

Digital Library

[43]

J. van der Hoeven. 2004. The truncated Fourier transform and applications. In Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation (ISSAC’04), J. Schicho (Ed.). ACM, 290--296.

Digital Library

[44]

J. van der Hoeven and G. Lecerf. 2013a. Interfacing Mathemagix with C++. In Proceedings of the 38th International Symposium on Symbolic and Algebraic Computation (ISSAC’13), M. Monagan, G. Cooperman, and M. Giesbrecht (Eds.). ACM, 363--370.

Digital Library

[45]

J. van der Hoeven and G. Lecerf. 2013b. Mathemagix User Guide. CNRS & ÉÉcole polytechnique. http://hal.archives-ouvertes.fr/hal-00785549.

[46]

J. van der Hoeven and G. Lecerf. 2013c. On the bit-complexity of sparse polynomial and series multiplication. J. Symbol. Comput. 50 (2013), 227--254.

Digital Library

[47]

J. van der Hoeven and G. Lecerf. 2015. Faster FFTs in medium precision. In IEEE 22nd Symposium on Computer Arithmetic, A. Tisserand and J. Villalba (Eds.). IEEE, 75--82.

Digital Library

[48]

J. van der Hoeven, G. Lecerf, B. Mourain, Ph. Trébuchet, J. Berthomieu, D. Diatta, and A. Mantzaflaris. 2011. Mathemagix, the quest of modularity and efficiency for symbolic and certified numeric computation. ACM SIGSAM Communications in Computer Algebra 177, 3 (2011). In Section “ISSAC 2011 Software Demonstrations”, edited by M. Stillman, p. 166--188.

[49]

J. van der Hoeven, G. Lecerf, B. Mourrain, and others. From 2002. Mathemagix. (From 2002). Software available at http://www.mathemagix.org.

[50]

Intel Corporation. 2013a. Intel (R) Architecture Instruction Set Extensions Programming Reference. (2013). Ref. 319433-015. 2200 Mission College Blvd., Santa Clara, CA 95052-8119, USA. http://software.intel.com/en-us/intel-isa-extensions.

[51]

Intel Corporation. 2013b. Intel (R) Intrinsics Guide. (2013). Version 3.0.1, released 7/23/2013. http://software.intel.com/en-us/articles/intel-intrinsics-guide.

[52]

Ç. Kaya Koç. 1994. Montgomery reduction with even modulus. IEE Proc. Comput. Dig. Techn. 141, 5 (1994), 314--316.

[53]

Ç. Kaya Koç, T. Acar, and Jr. Kaliski, B. S. 1996. Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro 16, 3 (1996), 26--33.

Digital Library

[54]

D. E. Knuth. 1997. The Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd ed.). Pearson Education.

Digital Library

[55]

G. Lecerf. 2010. Mathemagix: Towards large scale programming for symbolic and certified numeric computations. In Mathematical Software, ICMS 2010, Third International Congress on Mathematical Software, Kobe, Japan, September 13-17, 2010 (Lect. Notes Comput. Sci.), K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama (Eds.), Vol. 6327. Springer, 329--332.

Digital Library

[56]

D. S. McFarlin, V. Arbatov, F. Franchetti, and M. Püschel. 2011. Automatic SIMD vectorization of fast Fourier transforms for the Larrabee and AVX instruction sets. In Proceedings of the International Conference on Supercomputing (ICS’11). ACM, 265--274.

Digital Library

[57]

L. Meng and J. Johnson. 2014. High performance implementation of the TFT. In Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation (ISSAC’14), K. Nabeshima (Ed.). ACM, 328--334.

Digital Library

[58]

L. Meng, Y. Voronenko, J. R. Johnson, M. Moreno Maza, F. Franchetti, and Y. Xie. 2010. Spiral-generated modular FFT algorithms. In Proceedings of the 4th International Workshop on Parallel and Symbolic Computation (PASCO’10). ACM, 169--170.

Digital Library

[59]

N. Möller and T. Granlund. 2011. Improved division by invariant integers. IEEE Trans. Comput. 60, 2 (2011), 165--175.

Digital Library

[60]

P. L. Montgomery. 1985. Modular multiplication without trial division. Math. Comp. 44, 170 (1985), 519--521.

[61]

M. Moreno Maza and Y. Xie. 2010. FFT-based dense polynomial arithmetic on multi-cores. In High Performance Computing Systems and Applications, D. J. K. Mewhort, N. M. Cann, G. W. Slater, and T. J. Naughton (Eds.). Lect. Notes Comput. Sci., Vol. 5976. Springer, Berlin, 378--399.

Digital Library

[62]

N. Nedjah and L. de Macedo Mourelle. 2006. A review of modular multiplication methods and respective hardware implementations. Informatica 30, 1 (2006), 111--129.

[63]

T. Ogita, S. M. Rump, and S. Oishi. 2005. Accurate sum and dot product. SIAM J. Sci. Comput. 26, 6 (2005), 1955--1988.

Digital Library

[64]

J. M. Pollard. 1971. The fast Fourier transform in a finite field. Math. Comp. 25, 114 (1971), 365--374.

[65]

G. van Rossum and J. de Boer. 1991. Interactively testing remote servers using the Python programming language. CWI Quart, 4, 4 (1991), 283--303. Software available at http://www.python.org.

[66]

M. J. Schulte, J. Omar, and E. E. Jr. Swartzlander. 1994. Optimal initial approximations for the Newton-Raphson division algorithm. Computing 53, 3--4 (1994), 233--242.

[67]

V. Shoup. 2015. NTL: A Library for Doing Number Theory. Software, version 9.1.0. http://www.shoup.net/ntl.

[68]

W. A. Stein and others. From 2004. Sage Mathematics Software. The Sage Development Team. Software available at http://www.sagemath.org.

[69]

E. Thomé. 2012. Théorie algorithmique des nombres et applications à la cryptanalyse de primitives cryptographiques. http://www.loria.fr/&sim;thome/files/hdr.pdf. (2012). Mémoire d’habilitation à diriger des recherches de l’Université de Lorraine, France.

Cited By

Takahashi D(2025)Parallel Implementation of Number-Theoretic Transform on GPU ClustersAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1542-1_12(204-218)Online publication date: 15-Feb-2025
https://doi.org/10.1007/978-981-96-1542-1_12
van der Hoeven JLecerf G(2024)Fast interpolation of multivariate polynomials with sparse exponentsJournal of Complexity10.1016/j.jco.2024.101922(101922)Online publication date: Dec-2024
https://doi.org/10.1016/j.jco.2024.101922
Berthomieu JGraillat SLesnoff DMary T(2023)Modular Matrix Multiplication on GPU for Polynomial System SolvingACM Communications in Computer Algebra10.1145/3614408.361441157:2(35-38)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3614408.3614411
Show More Cited By

Index Terms

Modular SIMD arithmetic in Mathemagix
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices
      2. Computations on polynomials
  2. Mathematical software
    1. Mathematical software performance
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Vector / streaming algorithms

Recommendations

Fast Integer Multiplication Using Modular Arithmetic

We give an $N\cdot \log N\cdot 2^{O(\log^*N)}$ time algorithm to multiply two $N$-bit integers that uses modular arithmetic for intermediate computations instead of arithmetic over complex numbers as in Fürer's algorithm, which also has the same and so far ...
A Radix-4 FFT Using Complex RNS Arithmetic

Recent advancements in residue arithmetic have given rise to a complex number system variant which better than halves RNS multiplication complexity. This advantage is applied to the problem of implementing a high-speed radix-4 RNS FFT. It is shown that ...
Bit-Parallel Arithmetic in a Massively-Parallel Associative Processor

A simple but powerful architecture based on the classical associative processor model is proposed. By distributing logic among slices of storage cells such that a number of bit-planes share a simple logic unit, bit-parallel arithmetic for massively ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 43, Issue 1

March 2017

202 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/2987591

Editor:
Michael A. Heroux
Sandia National Laboratories, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 August 2016

Accepted: 01 January 2016

Revised: 01 June 2015

Received: 01 July 2014

Published in TOMS Volume 43, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
246
Total Downloads

Downloads (Last 12 months)52
Downloads (Last 6 weeks)4

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Takahashi D(2025)Parallel Implementation of Number-Theoretic Transform on GPU ClustersAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1542-1_12(204-218)Online publication date: 15-Feb-2025
https://doi.org/10.1007/978-981-96-1542-1_12
van der Hoeven JLecerf G(2024)Fast interpolation of multivariate polynomials with sparse exponentsJournal of Complexity10.1016/j.jco.2024.101922(101922)Online publication date: Dec-2024
https://doi.org/10.1016/j.jco.2024.101922
Berthomieu JGraillat SLesnoff DMary T(2023)Modular Matrix Multiplication on GPU for Polynomial System SolvingACM Communications in Computer Algebra10.1145/3614408.361441157:2(35-38)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3614408.3614411
Jesus ROliveira e Silva TWeiland M(2023)Vectorizing and distributing number‐theoretic transform to count Goldbach partitions on Arm‐based supercomputersConcurrency and Computation: Practice and Experience10.1002/cpe.788235:28Online publication date: 14-Aug-2023
https://doi.org/10.1002/cpe.7882
Boemer FKim SSeifu GD.M. de Souza FGopal VBrenner MPlayer RRohloff K(2021)Intel HEXLProceedings of the 9th on Workshop on Encrypted Computing & Applied Homomorphic Cryptography10.1145/3474366.3486926(57-62)Online publication date: 15-Nov-2021
https://dl.acm.org/doi/10.1145/3474366.3486926
van der Hoeven JMonagan M(2021)Computing one billion roots using the tangent Graeffe methodACM Communications in Computer Algebra10.1145/3457341.345734254:3(65-85)Online publication date: 15-Mar-2021
https://dl.acm.org/doi/10.1145/3457341.3457342
Doliskani JGiorgi PLebreton RSchost E(2018)Simultaneous Conversions with the Residue Number System Using Linear AlgebraACM Transactions on Mathematical Software10.1145/314557344:3(1-21)Online publication date: 3-Jan-2018
https://dl.acm.org/doi/10.1145/3145573

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents