Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3546918.3546919acmotherconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article
Open access

Automatic Array Transformation to Columnar Storage at Run Time

Published: 30 November 2022 Publication History

Abstract

Today’s huge memories make it possible to store and process large data structures in memory instead of in a database. Hence, accesses to this data should be optimized, which is normally relegated either to the runtimes and compilers or is left to the developers, who often lack the knowledge about optimization strategies. As arrays are often part of the language, developers frequently use them as an underlying storage mechanism. Thus, optimization of arrays may be vital to improve performance of data-intensive applications. While compilers can apply numerous optimizations to speed up accesses, it would also be beneficial to adapt the actual layout of the data in memory to improve cache utilization. However, runtimes and compilers typically do not perform such memory layout optimizations. In this work, we present an approach to dynamically perform memory layout optimizations on arrays of objects to transform them into a columnar memory layout, a storage layout frequently used in analytical applications that enables faster processing of read-intensive workloads. By integration into a state-of-the-art JavaScript runtime, our approach can speed up queries for large workloads by up to 9x, where the initial transformation overhead is amortized over time.

References

[1]
Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating Compression and Execution in Column-Oriented Database Systems. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data(SIGMOD ’06). ACM, New York, NY, USA, 671–682. https://doi.org/10.1145/1142473.1142548
[2]
Daniel Abadi, Daniel Myers, David DeWitt, and Samuel Madden. 2007. Materialization Strategies in a Column-Oriented DBMS. In 2007 IEEE 23rd International Conference on Data Engineering. IEEE, Istanbul, Turkey, 466–475. https://doi.org/10.1109/ICDE.2007.367892
[3]
Wonsun Ahn, Jiho Choi, Thomas Shull, María J. Garzarán, and Josep Torrellas. 2014. Improving JavaScript Performance by Deconstructing the Type System. SIGPLAN Not. 49, 6 (June 2014), 496–507. https://doi.org/10.1145/2666356.2594332
[4]
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers, Principles, Techniques, and Tools. Addison-Wesley Pub. Co, Reading, Mass.
[5]
Apache Software Foundation. 2022. Apache Arrow. The Apache Software Foundation.
[6]
Lars Bak. 2022. V8 JavaScript Engine. https://v8.dev/. (accessed 2022-06-29).
[7]
Carl Friedrich Bolz, Lukas Diekmann, and Laurence Tratt. 2013. Storage Strategies for Collections in Dynamically Typed Languages. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications. ACM, Indianapolis Indiana USA, 167–182. https://doi.org/10.1145/2509136.2509531
[8]
Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In Second Biennial Conference on Innovative Data Systems Research, CIDR 2005, Asilomar, CA, USA, January 4-7, 2005, Online Proceedings. www.cidrdb.org, Asilomar, CA, USA, 225–237.
[9]
Mike Bostock. 2022. D3.Js - Data-Driven Documents. https://d3js.org/. (accessed 2022-06-29).
[10]
Brad Calder, Chandra Krintz, Simmi John, and Todd Austin. 1998. Cache-Conscious Data Placement. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS VIII). Association for Computing Machinery, New York, NY, USA, 139–149. https://doi.org/10.1145/291069.291036
[11]
D. Callahan, J. Dongarra, and D. Levine. 1988. Vectorizing Compilers: A Test Suite and Results. In Proceedings of the 1988 ACM/IEEE Conference on Supercomputing(Supercomputing ’88). IEEE Computer Society Press, Washington, DC, USA, 98–105.
[12]
Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler Optimizations for Improving Data Locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS VI). Association for Computing Machinery, New York, NY, USA, 252–262. https://doi.org/10.1145/195473.195557
[13]
Richard L. Cole and Goetz Graefe. 1994. Optimization of Dynamic Query Evaluation Plans. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data(SIGMOD ’94). ACM, New York, NY, USA, 150–160. https://doi.org/10.1145/191839.191872
[14]
Transaction Processing Performance Counci. 2021. TPC Benchmark H - Standard Specification. Technical Report 3.0.0. Transaction Processing Performance Counci (TPC), San Francisco, CA, USA. 138 pages.
[15]
Ryan Dahl. 2022. Node.Js. https://github.com/nodejs/node. (accessed 2022-06-29).
[16]
John-David Dalton. 2022. Lodash. https://github.com/lodash/lodash. (accessed 2022-06-29).
[17]
Mattias De Wael, Stefan Marr, Joeri De Koster, Jennifer B. Sartor, and Wolfgang De Meuter. 2015. Just-in-Time Data Structures. In 2015 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!)(Onward! 2015). ACM, New York, NY, USA, 61–75. https://doi.org/10.1145/2814228.2814231
[18]
Ulrich Drepper. 2007. What Every Programmer Should Know about Memory. Red Hat, Inc 11(2007), 2007.
[19]
Gilles Duboscq, Lukas Stadler, Thomas Wuerthinger, Doug Simon, Christian Wimmer, and Hanspeter Mössenböck. 2013. Graal IR: An Extensible Declarative Intermediate Representation. In Proceedings of the Asia-Pacific Programming Languages and Compilers Workshop. Shenzhen, China, 9.
[20]
Gilles Duboscq, Thomas Würthinger, and Hanspeter Mössenböck. 2014. Speculation without Regret: Reducing Deoptimization Meta-Data in the Graal Compiler. In Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools(PPPJ ’14). ACM, Cracow, Poland, 187–193. https://doi.org/10.1145/2647508.2647521
[21]
Gilles Duboscq, Thomas Würthinger, Lukas Stadler, Christian Wimmer, Doug Simon, and Hanspeter Mössenböck. 2013. An Intermediate Representation for Speculative Optimizations in a Dynamic Compiler. In Proceedings of the 7th ACM Workshop on Virtual Machines and Intermediate Languages(VMIL ’13). ACM, New York, NY, USA, 1–10. https://doi.org/10.1145/2542142.2542143
[22]
Amit Dwivedi, C. Lamba, and Shweta Shukla. 2012. Performance Analysis of Column Oriented Database Vs Row Oriented Database. International Journal of Computer Applications 50 (July 2012), 31–34. https://doi.org/10.5120/7841-1050
[23]
ECMA International. 2020. ECMA-262, 12th Edition, June 2021. Technical Report 12. ECMA (European Association for Standardizing Information and Communication Systems), San Francisco, CA, USA. 879pages.
[24]
ECMA International. 2022. Tc39/Test262. Ecma TC39.
[25]
Juliana Franco, Martin Hagelin, Tobias Wrigstad, Sophia Drossopoulou, and Susan Eisenbach. 2017. You Can Have It All: Abstraction and Good Cache Performance. In Proceedings of the 2017 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software(Onward! 2017). Association for Computing Machinery, New York, NY, USA, 148–167. https://doi.org/10.1145/3133850.3133861
[26]
Michael Franz and Thomas Kistler. 1998. Splitting Data Objects to Increase Cache Utilization. Technical Report.
[27]
Google. 2022. Angular. https://angular.io/. (accessed 2022-06-29).
[28]
Goetz Graefe. 1993. Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25, 2 (June 1993), 73–169. https://doi.org/10.1145/152610.152611
[29]
Matthias Grimmer, Chris Seaton, Roland Schatz, Thomas Würthinger, and Hanspeter Mössenböck. 2015. High-Performance Cross-Language Interoperability in a Multi-Language Runtime. In Proceedings of the 11th Symposium on Dynamic Languages(DLS 2015). Association for Computing Machinery, New York, NY, USA, 78–90. https://doi.org/10.1145/2816707.2816714
[30]
Philipp Marian Grulich, Steffen Zeuch, and Volker Markl. 2021. Babelfish: Efficient Execution of Polyglot Queries. Proc. VLDB Endow. 15, 2 (Oct. 2021), 196–210. https://doi.org/10.14778/3489496.3489501
[31]
Urs Hölzle, Craig Chambers, and David Ungar. 1992. Debugging Optimized Code with Dynamic Deoptimization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation(PLDI ’92). Association for Computing Machinery, New York, NY, USA, 32–43. https://doi.org/10.1145/143095.143114
[32]
Holger Homann and Francois Laenen. 2018. SoAx: A Generic C++ Structure of Arrays for Handling Particles in HPC Codes. Computer Physics Communications 224 (March 2018), 325–332. https://doi.org/10.1016/j.cpc.2017.11.015
[33]
Liang Hong, Mengqi Luo, Ruixue Wang, Peixin Lu, Wei Lu, and Long Lu. 2018. Big Data in Health Care: Applications and Challenges. Data and Information Management 2, 3 (Dec. 2018), 175–197. https://doi.org/10.2478/dim-2018-0014
[34]
Christian Humer, Christian Wimmer, Christian Wirth, Andreas Wöß, and Thomas Würthinger. 2014. A Domain-Specific Language for Building Self-Optimizing AST Interpreters. In Proceedings of the 2014 International Conference on Generative Programming: Concepts and Experiences(GPCE 2014). Association for Computing Machinery, New York, NY, USA, 123–132. https://doi.org/10.1145/2658761.2658776
[35]
Intel. 2010. A Guide to Vectorization with Intel® C++ Compilers.
[36]
intel. 2022. Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture, Chapter 5 - Instruction Set Summary.
[37]
Xiaolong Jin, Benjamin W. Wah, Xueqi Cheng, and Yuanzhuo Wang. 2015. Significance and Challenges of Big Data Research. Big Data Research 2, 2 (June 2015), 59–64. https://doi.org/10.1016/j.bdr.2015.01.006
[38]
Oliver Kennedy and Lukasz Ziarek. 2015. Just-In-Time Data Structures. In CIDR. www.cidrdb.org, Monterey, CA, USA, 11.
[39]
Tanvir Ahmed Khan, Ian Neal, Gilles Pokam, Barzan Mozafari, and Baris Kasikci. 2021. DMon: Efficient Detection and Correction of Data Locality Problems Using Selective Profiling. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, Virtual, 163–181.
[40]
Sebastian Kloibhofer. 2021. Run-Time Data Analysis to Drive Compiler Optimizations. In Companion Proceedings of the 2021 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity(SPLASH Companion 2021). Association for Computing Machinery, New York, NY, USA, 9–12. https://doi.org/10.1145/3484271.3484974
[41]
Alexandros Labrinidis and H. V. Jagadish. 2012. Challenges and Opportunities with Big Data. Proc. VLDB Endow. 5, 12 (Aug. 2012), 2032–2033. https://doi.org/10.14778/2367502.2367572
[42]
Florian Latifi, David Leopoldseder, Christian Wimmer, and Hanspeter Mössenböck. 2021. CompGen: Generation of Fast JIT Compilers in a Multi-Language VM. In Proceedings of the 17th ACM SIGPLAN International Symposium on Dynamic Languages(DLS 2021). Association for Computing Machinery, New York, NY, USA, 35–47. https://doi.org/10.1145/3486602.3486930
[43]
Chris Lattner and Vikram Adve. 2005. Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap. SIGPLAN Not. 40, 6 (June 2005), 129–142. https://doi.org/10.1145/1064978.1065027
[44]
David Leopoldseder, Lukas Stadler, Thomas Würthinger, Josef Eisl, Doug Simon, and Hanspeter Mössenböck. 2018. Dominance-Based Duplication Simulation (DBDS): Code Duplication to Enable Compiler Optimizations. In Proceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 2018. ACM Press, Vienna, Austria, 126–137. https://doi.org/10.1145/3168811
[45]
Lukas Makor. 2021. Run-Time Data Analysis in Dynamic Runtimes. In Companion Proceedings of the 2021 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity(SPLASH Companion 2021). Association for Computing Machinery, New York, NY, USA, 6–8. https://doi.org/10.1145/3484271.3484973
[46]
MarketsandMarkets. 2022. Big Data Market Size, Share and Global Market Forecast to 2026 | MarketsandMarkets. https://www.marketsandmarkets.com/Market-Reports/big-data-market-1068.html. (accessed 2022-04-27).
[47]
Toni Mattis, Johannes Henning, Patrick Rein, Robert Hirschfeld, and Malte Appeltauer. 2015. Columnar Objects: Improving the Performance of Analytical Applications. In 2015 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!). ACM, Pittsburgh PA USA, 197–210. https://doi.org/10.1145/2814228.2814230
[48]
Joe Minichino. 2022. LokiJS. https://github.com/techfort/LokiJS. (accessed 2022-06-29).
[49]
Fabian Nagel, Gavin Bierman, Aleksandar Dragojevic, and Stratis Viglas. 17. Self-Managed Collections: Off-heap Memory Management for Scalable Query-Dominated Collections., 71 pages. https://doi.org/10.5441/002/edbt.2017.07
[50]
Simone Ferlin Oliveira, Karl Fürlinger, and Dieter Kranzlmüller. 2012. Trends in Computation, Communication and Storage and the Consequences for Data-intensive Science. In 2012 IEEE 14th International Conference on High Performance Computing and Communication 2012 IEEE 9th International Conference on Embedded Software and Systems. IEEE, Liverpool, UK, 572–579. https://doi.org/10.1109/HPCC.2012.83
[51]
Oracle. 2021. Graal.Js. https://github.com/graalvm/graaljs. (accessed 2020-09-09).
[52]
Oracle. 2021. GraalPython. https://github.com/graalvm/graalpython. (accessed 2020-09-09).
[53]
Oracle. 2021. GraalVM. https://www.graalvm.org/. (accessed 2020-07-23).
[54]
Oracle. 2021. TruffleRuby. https://github.com/oracle/truffleruby. (accessed 2020-09-09).
[55]
Oracle. 2022. Node.Js Runtime. https://www.graalvm.org/22.0/reference-manual/js/NodeJS/. (accessed 2022-04-25).
[56]
Jim Pivarski, Peter Elmer, Brian Bockelman, and Zhe Zhang. 2017. Fast Access to Columnar, Hierarchically Nested Data via Code Transformation. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, Boston, MA, USA, 253–262. https://doi.org/10.1109/BigData.2017.8257933
[57]
Samira Pouyanfar, Yimin Yang, Shu-Ching Chen, Mei-Ling Shyu, and S. S. Iyengar. 2018. Multimedia Big Data Analytics: A Survey. ACM Comput. Surv. 51, 1 (Jan. 2018), 10:1–10:34. https://doi.org/10.1145/3150226
[58]
RAPIDS Development Team. 2018. RAPIDS: Collection of Libraries for End to End GPU Data Science.
[59]
David Reinsel, John Gantz, and John Rydning. 2018. The Digitization of the World from Edge to Core. International Data Corporation, Framingham 16 (2018), 28.
[60]
Manuel Rigger, Matthias Grimmer, and Hanspeter Mössenböck. 2016. Sulong - Execution of LLVM-based Languages on the JVM: Position Paper. In Proceedings of the 11th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems - ICOOOLPS ’16. ACM Press, Rome, Italy, 1–4. https://doi.org/10.1145/3012408.3012416
[61]
Filippo Schiavio, Daniele Bonetta, and Walter Binder. 2021. Language-Agnostic Integrated Queries in a Managed Polyglot Runtime. Proc. VLDB Endow. 14, 8 (April 2021), 1414–1426. https://doi.org/10.14778/3457390.3457405
[62]
Lukas Stadler, Thomas Würthinger, and Hanspeter Mössenböck. 2014. Partial Escape Analysis and Scalar Replacement for Java. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization(CGO ’14). ACM, Orlando, FL, USA, 165–174. https://doi.org/10.1145/2581122.2544157
[63]
Statista. 2021. Total Data Volume Worldwide 2010-2025. https://www.statista.com/statistics/871513/worldwide-data-created/. (accessed 2022-04-27).
[64]
Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2005. C-Store: A Column-Oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases(VLDB ’05). VLDB Endowment, Trondheim, Norway, 553–564.
[65]
Christian Wimmer and Thomas Würthinger. 2012. Truffle: A Self-Optimizing Runtime System. In Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity(SPLASH ’12). ACM, Tucson, Arizona, USA, 13–14. https://doi.org/10.1145/2384716.2384723
[66]
Andreas Wöß, Christian Wirth, Daniele Bonetta, Chris Seaton, Christian Humer, and Hanspeter Mössenböck. 2014. An Object Storage Model for the Truffle Language Implementation Framework. In Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages, and Tools - PPPJ ’14. ACM Press, Cracow, Poland, 133–144. https://doi.org/10.1145/2647508.2647517
[67]
Thomas Würthinger, Christian Wimmer, Christian Humer, Andreas Wöß, Lukas Stadler, Chris Seaton, Gilles Duboscq, Doug Simon, and Matthias Grimmer. 2017. Practical Partial Evaluation for High-Performance Dynamic Language Runtimes. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI 2017). Association for Computing Machinery, New York, NY, USA, 662–676. https://doi.org/10.1145/3062341.3062381
[68]
Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko. 2013. One VM to Rule Them All. In Proceedings of the 2013 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software(Onward! 2013). ACM, Indianapolis, Indiana, USA, 187–204. https://doi.org/10.1145/2509578.2509581
[69]
Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. 2012. Self-Optimizing AST Interpreters. In Proceedings of the 8th Symposium on Dynamic Languages(DLS ’12). Association for Computing Machinery, New York, NY, USA, 73–82. https://doi.org/10.1145/2384577.2384587
[70]
Rui Zhang, Saumya Debray, and Richard T. Snodgrass. 2012. Micro-Specialization: Dynamic Code Specialization of Database Management Systems. In Proceedings of the Tenth International Symposium on Code Generation and Optimization(CGO ’12). Association for Computing Machinery, New York, NY, USA, 63–73. https://doi.org/10.1145/2259016.2259025
[71]
Wangda Zhang, Junyoung Kim, Kenneth A. Ross, Eric Sedlar, and Lukas Stadler. 2021. Adaptive Code Generation for Data-Intensive Analytics. Proc. VLDB Endow. 14, 6 (Feb. 2021), 929–942. https://doi.org/10.14778/3447689.3447697

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MPLR '22: Proceedings of the 19th International Conference on Managed Programming Languages and Runtimes
September 2022
161 pages
ISBN:9781450396967
DOI:10.1145/3546918
This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2022

Check for updates

Author Tags

  1. Array Storage
  2. Columnar Storage
  3. Dynamic Compilation
  4. Dynamic Language
  5. Program optimization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

MPLR '22

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 355
    Total Downloads
  • Downloads (Last 12 months)198
  • Downloads (Last 6 weeks)38
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media