Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Modular SIMD arithmetic in Mathemagix

Published: 29 August 2016 Publication History

Abstract

Modular integer arithmetic occurs in many algorithms for computer algebra, cryptography, and error correcting codes. Although recent microprocessors typically offer a wide range of highly optimized arithmetic functions, modular integer operations still require dedicated implementations. In this article, we survey existing algorithms for modular integer arithmetic and present detailed vectorized counterparts. We also describe several applications, such as fast modular Fourier transforms and multiplication of integer polynomials and matrices. The vectorized algorithms have been implemented in C++ inside the free computer algebra and analysis system Mathemagix. The performance of our implementation is illustrated by various benchmarks.

References

[1]
D. Abrahams and A. Gurtovoy. 2004. C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond. Addison Wesley.
[2]
A. V. Aho, J. E. Hopcroft, and J. D. Ullman. 1974. The Design and Analysis of Computer Algorithms. Addison-Wesley.
[3]
R. Alverson. 1991. Integer division using reciprocals. In Proceedings of the Tenth Symposium on Computer Arithmetic. IEEE Computer Society Press, 186--190.
[4]
H. G. Baker. 1992. Computing A*B (mod N) efficiently in ANSI C. SIGPLAN Not. 27, 1 (1992), 95--98.
[5]
B. Bank, M. Giusti, J. Heintz, G. Lecerf, G. Matera, and P. Solernó. 2015. Degeneracy loci and polynomial equation solving. Found. Comput. Math. 15, 1 (2015), 159--184.
[6]
N. Bardis, A. Drigas, A. Markovskyy, and J. Vrettaros. 2010. Accelerated modular multiplication algorithm of large word length numbers with a fixed module. In Organizational, Business, and Technological Aspects of the Knowledge Society, M. D. Lytras, P. Ordonez de Pablos, A. Ziderman, A. Roulstone, H. Maurer, and J. B. Imber (Eds.). Communications in Computer and Information Science, Vol. 112. Springer, Berlin, 497--505.
[7]
P. Barrett. 1987. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In Advances in Cryptology -- CRYPTO 86, A. Odlyzko (Ed.). Lect. Notes Comput. Sci., Vol. 263. Springer, Berlin, 311--323.
[8]
D. J. Bernstein, Hsueh-Chung Chen, Ming-Shing Chen, Chen-Mou Cheng, Chun-Hung Hsiao, Tanja Lange, Zong-Cing Lin, and Bo-Yin Yang. 2009a. The billion-mulmod-per-second PC. In SHARCS09 Special-purpose Hardware for Attacking Cryptographic Systems: 131. 131--144. http://cr.yp.to/djb.html.
[9]
D. J. Bernstein, Tien-Ren Chen, Chen-Mou Cheng, Tanja Lange, and Bo-Yin Yang. 2009b. ECM on graphics cards. In Advances in Cryptology - EUROCRYPT 2009, A. Joux (Ed.). Lect. Notes Comput. Sci., Vol. 5479. Springer, Berlin, 483--501.
[10]
J. Berthomieu, G. Lecerf, and G. Quintin. 2013. Polynomial root finding over local rings and application to error correcting codes. Appl. Alg. Eng. Comm. Comp. 24, 6 (2013), 413--443.
[11]
J. Berthomieu, J. van der Hoeven, and G. Lecerf. 2011. Relaxed algorithms for p-adic numbers. J. Théor. Nombr. Bord. 23, 3 (2011), 541--577.
[12]
D. Bini and V. Y. Pan. 2012. Polynomial and Matrix Computations: Fundamental Algorithms. Birkhauser Verlag GmbH.
[13]
Boost team. From 1999. Boost (C++ libraries). Software available at http://www.boost.org. (From 1999).
[14]
W. Bosma, J. Cannon, and C. Playoust. 1997. The Magma algebra system. I. The user language. J. Symbol. Comput. 24, 3--4 (1997), 235--265.
[15]
A. Bosselaers, R. Govaerts, and J. Vandewalle. 1994. Comparison of three modular reduction functions. In Advances in Cryptology CRYPTO 93, D. R. Stinson (Ed.). Lect. Notes Comput. Sci., Vol. 773. Springer, Berlin, 175--186.
[16]
British Standards Institution. 2003. The C Standard: Incorporating Technical Corrigendum 1: BS ISO/IEC 9899/1999. John Wiley.
[17]
CLANG From 2007. CLANG, a C language family frontend for LLVM. Software available at http://clang.llvm.org. (From 2007).
[18]
J.-G. Dumas, T. Gautier, C. Pernet, and B. D. Saunders. 2010. LinBox founding scope allocation, parallel building blocks, and separate compilation. In Mathematical Software ICMS 2010, K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama (Eds.). Lect. Notes Comput. Sci., Vol. 6327. Springer, Berlin, 77--83.
[19]
J.-G. Dumas, P. Giorgi, and C. Pernet. 2004. FFPACK: Finite field linear algebra package. In Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation (ISSAC’04), J. Schicho (Ed.). ACM, 119--126.
[20]
J.-G. Dumas, P. Giorgi, and C. Pernet. 2008. Dense linear algebra over word-size prime fields: The FFLAS and FFPACK packages. ACM Trans. Math. Softw. 35, 3 (2008), 19:1--19:42.
[21]
A. Fog. 2012a. Instruction Tables. Lists of Instruction Latencies, Throughputs and Micro-operation Breakdowns for Intel, AMD and VIA CPUs. http://www.agner.org/optimize, Copenhagen University College of Engineering.
[22]
A. Fog. 2012b. Optimizing Software in C++. An Optimization Guide for Windows, Linux and Mac Platforms. http://www.agner.org/optimize, Copenhagen University College of Engineering.
[23]
A. Fog. 2012c. Optimizing Subroutines in Assembly Language. An Optimization Guide for x86 Platforms. http://www.agner.org/optimize, Copenhagen University College of Engineering.
[24]
L. Fousse, G. Hanrot, V. Lefèvre, P. Pélissier, and P. Zimmermann. 2007. MPFR: A multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33, 2 (2007), Article No. 13. Software available at http://www.mpfr.org.
[25]
M. Frigo and S. G. Johnson. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216--231.
[26]
J. von zur Gathen and J. Gerhard. 2003. Modern Computer Algebra (2nd ed.). Cambridge University Press.
[27]
P. Gaudry and E. Thomé. 2007. The mpFq library and implementing curve-based key exchanges. In SPEED: Software Performance Enhancement for Encryption and Decryption. ECRYPT Network of Excellence in Cryptology, Amsterdam, Netherlands, 49--64.
[28]
GCC From 1987. GCC, the GNU Compiler Collection. Software available at http://gcc.gnu.org. (From 1987).
[29]
K. Geddes, G. Gonnet, and Maplesoft. From 1980. Maple (TM). http://www.maplesoft.com/products/maple. (From 1980).
[30]
P. Giorgi, Th. Izard, and A. Tisserand. 2010. Comparison of modular arithmetic algorithms on GPUs. In Parallel Computing: From Multicores and GPU’s to Petascale, B. Chapman, F. Desprez, G. R. Joubert, A. Lichnewsky, F. Peters, and Th. Priol (Eds.). Advances in Parallel Computing, Vol. 19. IOS Press, 315--322.
[31]
P. Giorgi and R. Lebreton. 2014. Online order basis algorithm and its impact on block wiedemann algorithm. In Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation (ISSAC’14), K. Nabeshima (Ed.). ACM, 202--209.
[32]
T. Granlund and others. From 1991. GMP, the GNU multiple precision arithmetic library. (From 1991). Software available at http://gmplib.org.
[33]
T. Granlund and P. L. Montgomery. 1994. Division by invariant integers using multiplication. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation (PLDI’94). ACM, 61--72.
[34]
S. Anisul Haque and M. Moreno Maza. 2012. Plain polynomial arithmetic on GPU. J. Phys.: Conf. Ser. 385, 1 (2012), 012014.
[35]
W. Hart and the FLINT Team. From 2008. FLINT: Fast Library for Number Theory. (From 2008). Software available at http://www.flintlib.org.
[36]
W. Hart and the MPIR Team. From 2010. MPIR, Multiple Precision Integers and Rationals. (From 2010). Software available at http://www.mpir.org.
[37]
D. Harvey. 2009. A cache-friendly truncated FFT. Theoret. Comput. Sci. 410, 27--29 (2009), 2649--2658.
[38]
D. Harvey. 2014. Faster arithmetic for number-theoretic transforms. J. Symbol. Comput. 60 (2014), 113--119.
[39]
D. Harvey and J. van der Hoeven. 2014. On the complexity of integer matrix multiplication. (2014). Preprint available at https://hal.archives-ouvertes.fr/hal-01071191.
[40]
D. Harvey and D. S. Roche. 2010. An In-place truncated Fourier transform and applications to polynomial multiplication. In Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation (ISSAC’10), S. M. Watt (Ed.). ACM, 325--329.
[41]
D. Harvey and A. V. Sutherland. 2014. Computing Hasse--Witt matrices of hyperelliptic curves in average polynomial time. LMS J. Comput. Math. 17 (2014), 257--273. Special Issue A (Algorithmic Number Theory Symposium XI).
[42]
W. Hasenplaugh, G. Gaubatz, and V. Gopal. 2007. Fast modular reduction. In 18th IEEE Symposium on Computer Arithmetic, ARITH ’07, P. Kornerup and J.-M. Muller (Eds.). IEEE Computer Society, 225--229.
[43]
J. van der Hoeven. 2004. The truncated Fourier transform and applications. In Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation (ISSAC’04), J. Schicho (Ed.). ACM, 290--296.
[44]
J. van der Hoeven and G. Lecerf. 2013a. Interfacing Mathemagix with C++. In Proceedings of the 38th International Symposium on Symbolic and Algebraic Computation (ISSAC’13), M. Monagan, G. Cooperman, and M. Giesbrecht (Eds.). ACM, 363--370.
[45]
J. van der Hoeven and G. Lecerf. 2013b. Mathemagix User Guide. CNRS & ÉÉcole polytechnique. http://hal.archives-ouvertes.fr/hal-00785549.
[46]
J. van der Hoeven and G. Lecerf. 2013c. On the bit-complexity of sparse polynomial and series multiplication. J. Symbol. Comput. 50 (2013), 227--254.
[47]
J. van der Hoeven and G. Lecerf. 2015. Faster FFTs in medium precision. In IEEE 22nd Symposium on Computer Arithmetic, A. Tisserand and J. Villalba (Eds.). IEEE, 75--82.
[48]
J. van der Hoeven, G. Lecerf, B. Mourain, Ph. Trébuchet, J. Berthomieu, D. Diatta, and A. Mantzaflaris. 2011. Mathemagix, the quest of modularity and efficiency for symbolic and certified numeric computation. ACM SIGSAM Communications in Computer Algebra 177, 3 (2011). In Section “ISSAC 2011 Software Demonstrations”, edited by M. Stillman, p. 166--188.
[49]
J. van der Hoeven, G. Lecerf, B. Mourrain, and others. From 2002. Mathemagix. (From 2002). Software available at http://www.mathemagix.org.
[50]
Intel Corporation. 2013a. Intel (R) Architecture Instruction Set Extensions Programming Reference. (2013). Ref. 319433-015. 2200 Mission College Blvd., Santa Clara, CA 95052-8119, USA. http://software.intel.com/en-us/intel-isa-extensions.
[51]
Intel Corporation. 2013b. Intel (R) Intrinsics Guide. (2013). Version 3.0.1, released 7/23/2013. http://software.intel.com/en-us/articles/intel-intrinsics-guide.
[52]
Ç. Kaya Koç. 1994. Montgomery reduction with even modulus. IEE Proc. Comput. Dig. Techn. 141, 5 (1994), 314--316.
[53]
Ç. Kaya Koç, T. Acar, and Jr. Kaliski, B. S. 1996. Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro 16, 3 (1996), 26--33.
[54]
D. E. Knuth. 1997. The Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd ed.). Pearson Education.
[55]
G. Lecerf. 2010. Mathemagix: Towards large scale programming for symbolic and certified numeric computations. In Mathematical Software, ICMS 2010, Third International Congress on Mathematical Software, Kobe, Japan, September 13-17, 2010 (Lect. Notes Comput. Sci.), K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama (Eds.), Vol. 6327. Springer, 329--332.
[56]
D. S. McFarlin, V. Arbatov, F. Franchetti, and M. Püschel. 2011. Automatic SIMD vectorization of fast Fourier transforms for the Larrabee and AVX instruction sets. In Proceedings of the International Conference on Supercomputing (ICS’11). ACM, 265--274.
[57]
L. Meng and J. Johnson. 2014. High performance implementation of the TFT. In Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation (ISSAC’14), K. Nabeshima (Ed.). ACM, 328--334.
[58]
L. Meng, Y. Voronenko, J. R. Johnson, M. Moreno Maza, F. Franchetti, and Y. Xie. 2010. Spiral-generated modular FFT algorithms. In Proceedings of the 4th International Workshop on Parallel and Symbolic Computation (PASCO’10). ACM, 169--170.
[59]
N. Möller and T. Granlund. 2011. Improved division by invariant integers. IEEE Trans. Comput. 60, 2 (2011), 165--175.
[60]
P. L. Montgomery. 1985. Modular multiplication without trial division. Math. Comp. 44, 170 (1985), 519--521.
[61]
M. Moreno Maza and Y. Xie. 2010. FFT-based dense polynomial arithmetic on multi-cores. In High Performance Computing Systems and Applications, D. J. K. Mewhort, N. M. Cann, G. W. Slater, and T. J. Naughton (Eds.). Lect. Notes Comput. Sci., Vol. 5976. Springer, Berlin, 378--399.
[62]
N. Nedjah and L. de Macedo Mourelle. 2006. A review of modular multiplication methods and respective hardware implementations. Informatica 30, 1 (2006), 111--129.
[63]
T. Ogita, S. M. Rump, and S. Oishi. 2005. Accurate sum and dot product. SIAM J. Sci. Comput. 26, 6 (2005), 1955--1988.
[64]
J. M. Pollard. 1971. The fast Fourier transform in a finite field. Math. Comp. 25, 114 (1971), 365--374.
[65]
G. van Rossum and J. de Boer. 1991. Interactively testing remote servers using the Python programming language. CWI Quart, 4, 4 (1991), 283--303. Software available at http://www.python.org.
[66]
M. J. Schulte, J. Omar, and E. E. Jr. Swartzlander. 1994. Optimal initial approximations for the Newton-Raphson division algorithm. Computing 53, 3--4 (1994), 233--242.
[67]
V. Shoup. 2015. NTL: A Library for Doing Number Theory. Software, version 9.1.0. http://www.shoup.net/ntl.
[68]
W. A. Stein and others. From 2004. Sage Mathematics Software. The Sage Development Team. Software available at http://www.sagemath.org.
[69]
E. Thomé. 2012. Théorie algorithmique des nombres et applications à la cryptanalyse de primitives cryptographiques. http://www.loria.fr/∼thome/files/hdr.pdf. (2012). Mémoire d’habilitation à diriger des recherches de l’Université de Lorraine, France.

Cited By

View all
  • (2023)Modular Matrix Multiplication on GPU for Polynomial System SolvingACM Communications in Computer Algebra10.1145/3614408.361441157:2(35-38)Online publication date: 7-Aug-2023
  • (2023)Vectorizing and distributing number‐theoretic transform to count Goldbach partitions on Arm‐based supercomputersConcurrency and Computation: Practice and Experience10.1002/cpe.788235:28Online publication date: 14-Aug-2023
  • (2021)Intel HEXLProceedings of the 9th on Workshop on Encrypted Computing & Applied Homomorphic Cryptography10.1145/3474366.3486926(57-62)Online publication date: 15-Nov-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software  Volume 43, Issue 1
March 2017
202 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/2987591
Issue’s Table of Contents
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 August 2016
Accepted: 01 January 2016
Revised: 01 June 2015
Received: 01 July 2014
Published in TOMS Volume 43, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Mathemagix
  2. Modular integer arithmetic
  3. fast Fourier transform
  4. integer product
  5. matrix product
  6. polynomial product

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)1
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Modular Matrix Multiplication on GPU for Polynomial System SolvingACM Communications in Computer Algebra10.1145/3614408.361441157:2(35-38)Online publication date: 7-Aug-2023
  • (2023)Vectorizing and distributing number‐theoretic transform to count Goldbach partitions on Arm‐based supercomputersConcurrency and Computation: Practice and Experience10.1002/cpe.788235:28Online publication date: 14-Aug-2023
  • (2021)Intel HEXLProceedings of the 9th on Workshop on Encrypted Computing & Applied Homomorphic Cryptography10.1145/3474366.3486926(57-62)Online publication date: 15-Nov-2021
  • (2021)Computing one billion roots using the tangent Graeffe methodACM Communications in Computer Algebra10.1145/3457341.345734254:3(65-85)Online publication date: 15-Mar-2021
  • (2018)Simultaneous Conversions with the Residue Number System Using Linear AlgebraACM Transactions on Mathematical Software10.1145/314557344:3(1-21)Online publication date: 3-Jan-2018

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media