research-article

Open access

Automatic Array Transformation to Columnar Storage at Run Time

Authors:

Sebastian Kloibhofer,

David Leopoldseder,

Daniele Bonetta,

Hanspeter MössenböckAuthors Info & Claims

MPLR '22: Proceedings of the 19th International Conference on Managed Programming Languages and Runtimes

Pages 16 - 28

https://doi.org/10.1145/3546918.3546919

Published: 30 November 2022 Publication History

All formats PDF

Abstract

Today’s huge memories make it possible to store and process large data structures in memory instead of in a database. Hence, accesses to this data should be optimized, which is normally relegated either to the runtimes and compilers or is left to the developers, who often lack the knowledge about optimization strategies. As arrays are often part of the language, developers frequently use them as an underlying storage mechanism. Thus, optimization of arrays may be vital to improve performance of data-intensive applications. While compilers can apply numerous optimizations to speed up accesses, it would also be beneficial to adapt the actual layout of the data in memory to improve cache utilization. However, runtimes and compilers typically do not perform such memory layout optimizations. In this work, we present an approach to dynamically perform memory layout optimizations on arrays of objects to transform them into a columnar memory layout, a storage layout frequently used in analytical applications that enables faster processing of read-intensive workloads. By integration into a state-of-the-art JavaScript runtime, our approach can speed up queries for large workloads by up to 9x, where the initial transformation overhead is amortized over time.

References

[1]

Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating Compression and Execution in Column-Oriented Database Systems. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data(SIGMOD ’06). ACM, New York, NY, USA, 671–682. https://doi.org/10.1145/1142473.1142548

Digital Library

[2]

Daniel Abadi, Daniel Myers, David DeWitt, and Samuel Madden. 2007. Materialization Strategies in a Column-Oriented DBMS. In 2007 IEEE 23rd International Conference on Data Engineering. IEEE, Istanbul, Turkey, 466–475. https://doi.org/10.1109/ICDE.2007.367892

[3]

Wonsun Ahn, Jiho Choi, Thomas Shull, María J. Garzarán, and Josep Torrellas. 2014. Improving JavaScript Performance by Deconstructing the Type System. SIGPLAN Not. 49, 6 (June 2014), 496–507. https://doi.org/10.1145/2666356.2594332

Digital Library

[4]

Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers, Principles, Techniques, and Tools. Addison-Wesley Pub. Co, Reading, Mass.

[5]

Apache Software Foundation. 2022. Apache Arrow. The Apache Software Foundation.

[6]

Lars Bak. 2022. V8 JavaScript Engine. https://v8.dev/. (accessed 2022-06-29).

[7]

Carl Friedrich Bolz, Lukas Diekmann, and Laurence Tratt. 2013. Storage Strategies for Collections in Dynamically Typed Languages. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications. ACM, Indianapolis Indiana USA, 167–182. https://doi.org/10.1145/2509136.2509531

Digital Library

[8]

Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In Second Biennial Conference on Innovative Data Systems Research, CIDR 2005, Asilomar, CA, USA, January 4-7, 2005, Online Proceedings. www.cidrdb.org, Asilomar, CA, USA, 225–237.

[9]

Mike Bostock. 2022. D3.Js - Data-Driven Documents. https://d3js.org/. (accessed 2022-06-29).

[10]

Brad Calder, Chandra Krintz, Simmi John, and Todd Austin. 1998. Cache-Conscious Data Placement. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS VIII). Association for Computing Machinery, New York, NY, USA, 139–149. https://doi.org/10.1145/291069.291036

Digital Library

[11]

D. Callahan, J. Dongarra, and D. Levine. 1988. Vectorizing Compilers: A Test Suite and Results. In Proceedings of the 1988 ACM/IEEE Conference on Supercomputing(Supercomputing ’88). IEEE Computer Society Press, Washington, DC, USA, 98–105.

[12]

Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler Optimizations for Improving Data Locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS VI). Association for Computing Machinery, New York, NY, USA, 252–262. https://doi.org/10.1145/195473.195557

Digital Library

[13]

Richard L. Cole and Goetz Graefe. 1994. Optimization of Dynamic Query Evaluation Plans. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data(SIGMOD ’94). ACM, New York, NY, USA, 150–160. https://doi.org/10.1145/191839.191872

Digital Library

[14]

Transaction Processing Performance Counci. 2021. TPC Benchmark H - Standard Specification. Technical Report 3.0.0. Transaction Processing Performance Counci (TPC), San Francisco, CA, USA. 138 pages.

[15]

Ryan Dahl. 2022. Node.Js. https://github.com/nodejs/node. (accessed 2022-06-29).

[16]

John-David Dalton. 2022. Lodash. https://github.com/lodash/lodash. (accessed 2022-06-29).

[17]

Mattias De Wael, Stefan Marr, Joeri De Koster, Jennifer B. Sartor, and Wolfgang De Meuter. 2015. Just-in-Time Data Structures. In 2015 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!)(Onward! 2015). ACM, New York, NY, USA, 61–75. https://doi.org/10.1145/2814228.2814231

Digital Library

[18]

Ulrich Drepper. 2007. What Every Programmer Should Know about Memory. Red Hat, Inc 11(2007), 2007.

[19]

Gilles Duboscq, Lukas Stadler, Thomas Wuerthinger, Doug Simon, Christian Wimmer, and Hanspeter Mössenböck. 2013. Graal IR: An Extensible Declarative Intermediate Representation. In Proceedings of the Asia-Pacific Programming Languages and Compilers Workshop. Shenzhen, China, 9.

[20]

Gilles Duboscq, Thomas Würthinger, and Hanspeter Mössenböck. 2014. Speculation without Regret: Reducing Deoptimization Meta-Data in the Graal Compiler. In Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools(PPPJ ’14). ACM, Cracow, Poland, 187–193. https://doi.org/10.1145/2647508.2647521

[21]

Gilles Duboscq, Thomas Würthinger, Lukas Stadler, Christian Wimmer, Doug Simon, and Hanspeter Mössenböck. 2013. An Intermediate Representation for Speculative Optimizations in a Dynamic Compiler. In Proceedings of the 7th ACM Workshop on Virtual Machines and Intermediate Languages(VMIL ’13). ACM, New York, NY, USA, 1–10. https://doi.org/10.1145/2542142.2542143

Digital Library

[22]

Amit Dwivedi, C. Lamba, and Shweta Shukla. 2012. Performance Analysis of Column Oriented Database Vs Row Oriented Database. International Journal of Computer Applications 50 (July 2012), 31–34. https://doi.org/10.5120/7841-1050

[23]

ECMA International. 2020. ECMA-262, 12th Edition, June 2021. Technical Report 12. ECMA (European Association for Standardizing Information and Communication Systems), San Francisco, CA, USA. 879pages.

[24]

ECMA International. 2022. Tc39/Test262. Ecma TC39.

[25]

Juliana Franco, Martin Hagelin, Tobias Wrigstad, Sophia Drossopoulou, and Susan Eisenbach. 2017. You Can Have It All: Abstraction and Good Cache Performance. In Proceedings of the 2017 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software(Onward! 2017). Association for Computing Machinery, New York, NY, USA, 148–167. https://doi.org/10.1145/3133850.3133861

Digital Library

[26]

Michael Franz and Thomas Kistler. 1998. Splitting Data Objects to Increase Cache Utilization. Technical Report.

[27]

Google. 2022. Angular. https://angular.io/. (accessed 2022-06-29).

[28]

Goetz Graefe. 1993. Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25, 2 (June 1993), 73–169. https://doi.org/10.1145/152610.152611

Digital Library

[29]

Matthias Grimmer, Chris Seaton, Roland Schatz, Thomas Würthinger, and Hanspeter Mössenböck. 2015. High-Performance Cross-Language Interoperability in a Multi-Language Runtime. In Proceedings of the 11th Symposium on Dynamic Languages(DLS 2015). Association for Computing Machinery, New York, NY, USA, 78–90. https://doi.org/10.1145/2816707.2816714

Digital Library

[30]

Philipp Marian Grulich, Steffen Zeuch, and Volker Markl. 2021. Babelfish: Efficient Execution of Polyglot Queries. Proc. VLDB Endow. 15, 2 (Oct. 2021), 196–210. https://doi.org/10.14778/3489496.3489501

Digital Library

[31]

Urs Hölzle, Craig Chambers, and David Ungar. 1992. Debugging Optimized Code with Dynamic Deoptimization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation(PLDI ’92). Association for Computing Machinery, New York, NY, USA, 32–43. https://doi.org/10.1145/143095.143114

Digital Library

[32]

Holger Homann and Francois Laenen. 2018. SoAx: A Generic C++ Structure of Arrays for Handling Particles in HPC Codes. Computer Physics Communications 224 (March 2018), 325–332. https://doi.org/10.1016/j.cpc.2017.11.015

[33]

Liang Hong, Mengqi Luo, Ruixue Wang, Peixin Lu, Wei Lu, and Long Lu. 2018. Big Data in Health Care: Applications and Challenges. Data and Information Management 2, 3 (Dec. 2018), 175–197. https://doi.org/10.2478/dim-2018-0014

[34]

Christian Humer, Christian Wimmer, Christian Wirth, Andreas Wöß, and Thomas Würthinger. 2014. A Domain-Specific Language for Building Self-Optimizing AST Interpreters. In Proceedings of the 2014 International Conference on Generative Programming: Concepts and Experiences(GPCE 2014). Association for Computing Machinery, New York, NY, USA, 123–132. https://doi.org/10.1145/2658761.2658776

Digital Library

[35]

Intel. 2010. A Guide to Vectorization with Intel® C++ Compilers.

[36]

intel. 2022. Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture, Chapter 5 - Instruction Set Summary.

[37]

Xiaolong Jin, Benjamin W. Wah, Xueqi Cheng, and Yuanzhuo Wang. 2015. Significance and Challenges of Big Data Research. Big Data Research 2, 2 (June 2015), 59–64. https://doi.org/10.1016/j.bdr.2015.01.006

Digital Library

[38]

Oliver Kennedy and Lukasz Ziarek. 2015. Just-In-Time Data Structures. In CIDR. www.cidrdb.org, Monterey, CA, USA, 11.

[39]

Tanvir Ahmed Khan, Ian Neal, Gilles Pokam, Barzan Mozafari, and Baris Kasikci. 2021. DMon: Efficient Detection and Correction of Data Locality Problems Using Selective Profiling. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, Virtual, 163–181.

[40]

Sebastian Kloibhofer. 2021. Run-Time Data Analysis to Drive Compiler Optimizations. In Companion Proceedings of the 2021 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity(SPLASH Companion 2021). Association for Computing Machinery, New York, NY, USA, 9–12. https://doi.org/10.1145/3484271.3484974

Digital Library

[41]

Alexandros Labrinidis and H. V. Jagadish. 2012. Challenges and Opportunities with Big Data. Proc. VLDB Endow. 5, 12 (Aug. 2012), 2032–2033. https://doi.org/10.14778/2367502.2367572

Digital Library

[42]

Florian Latifi, David Leopoldseder, Christian Wimmer, and Hanspeter Mössenböck. 2021. CompGen: Generation of Fast JIT Compilers in a Multi-Language VM. In Proceedings of the 17th ACM SIGPLAN International Symposium on Dynamic Languages(DLS 2021). Association for Computing Machinery, New York, NY, USA, 35–47. https://doi.org/10.1145/3486602.3486930

Digital Library

[43]

Chris Lattner and Vikram Adve. 2005. Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap. SIGPLAN Not. 40, 6 (June 2005), 129–142. https://doi.org/10.1145/1064978.1065027

Digital Library

[44]

David Leopoldseder, Lukas Stadler, Thomas Würthinger, Josef Eisl, Doug Simon, and Hanspeter Mössenböck. 2018. Dominance-Based Duplication Simulation (DBDS): Code Duplication to Enable Compiler Optimizations. In Proceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 2018. ACM Press, Vienna, Austria, 126–137. https://doi.org/10.1145/3168811

Digital Library

[45]

Lukas Makor. 2021. Run-Time Data Analysis in Dynamic Runtimes. In Companion Proceedings of the 2021 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity(SPLASH Companion 2021). Association for Computing Machinery, New York, NY, USA, 6–8. https://doi.org/10.1145/3484271.3484973

Digital Library

[46]

MarketsandMarkets. 2022. Big Data Market Size, Share and Global Market Forecast to 2026 | MarketsandMarkets. https://www.marketsandmarkets.com/Market-Reports/big-data-market-1068.html. (accessed 2022-04-27).

[47]

Toni Mattis, Johannes Henning, Patrick Rein, Robert Hirschfeld, and Malte Appeltauer. 2015. Columnar Objects: Improving the Performance of Analytical Applications. In 2015 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!). ACM, Pittsburgh PA USA, 197–210. https://doi.org/10.1145/2814228.2814230

Digital Library

[48]

Joe Minichino. 2022. LokiJS. https://github.com/techfort/LokiJS. (accessed 2022-06-29).

[49]

Fabian Nagel, Gavin Bierman, Aleksandar Dragojevic, and Stratis Viglas. 17. Self-Managed Collections: Off-heap Memory Management for Scalable Query-Dominated Collections., 71 pages. https://doi.org/10.5441/002/edbt.2017.07

[50]

Simone Ferlin Oliveira, Karl Fürlinger, and Dieter Kranzlmüller. 2012. Trends in Computation, Communication and Storage and the Consequences for Data-intensive Science. In 2012 IEEE 14th International Conference on High Performance Computing and Communication 2012 IEEE 9th International Conference on Embedded Software and Systems. IEEE, Liverpool, UK, 572–579. https://doi.org/10.1109/HPCC.2012.83

Digital Library

[51]

Oracle. 2021. Graal.Js. https://github.com/graalvm/graaljs. (accessed 2020-09-09).

[52]

Oracle. 2021. GraalPython. https://github.com/graalvm/graalpython. (accessed 2020-09-09).

[53]

Oracle. 2021. GraalVM. https://www.graalvm.org/. (accessed 2020-07-23).

[54]

Oracle. 2021. TruffleRuby. https://github.com/oracle/truffleruby. (accessed 2020-09-09).

[55]

Oracle. 2022. Node.Js Runtime. https://www.graalvm.org/22.0/reference-manual/js/NodeJS/. (accessed 2022-04-25).

[56]

Jim Pivarski, Peter Elmer, Brian Bockelman, and Zhe Zhang. 2017. Fast Access to Columnar, Hierarchically Nested Data via Code Transformation. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, Boston, MA, USA, 253–262. https://doi.org/10.1109/BigData.2017.8257933

[57]

Samira Pouyanfar, Yimin Yang, Shu-Ching Chen, Mei-Ling Shyu, and S. S. Iyengar. 2018. Multimedia Big Data Analytics: A Survey. ACM Comput. Surv. 51, 1 (Jan. 2018), 10:1–10:34. https://doi.org/10.1145/3150226

Digital Library

[58]

RAPIDS Development Team. 2018. RAPIDS: Collection of Libraries for End to End GPU Data Science.

[59]

David Reinsel, John Gantz, and John Rydning. 2018. The Digitization of the World from Edge to Core. International Data Corporation, Framingham 16 (2018), 28.

[60]

Manuel Rigger, Matthias Grimmer, and Hanspeter Mössenböck. 2016. Sulong - Execution of LLVM-based Languages on the JVM: Position Paper. In Proceedings of the 11th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems - ICOOOLPS ’16. ACM Press, Rome, Italy, 1–4. https://doi.org/10.1145/3012408.3012416

Digital Library

[61]

Filippo Schiavio, Daniele Bonetta, and Walter Binder. 2021. Language-Agnostic Integrated Queries in a Managed Polyglot Runtime. Proc. VLDB Endow. 14, 8 (April 2021), 1414–1426. https://doi.org/10.14778/3457390.3457405

Digital Library

[62]

Lukas Stadler, Thomas Würthinger, and Hanspeter Mössenböck. 2014. Partial Escape Analysis and Scalar Replacement for Java. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization(CGO ’14). ACM, Orlando, FL, USA, 165–174. https://doi.org/10.1145/2581122.2544157

Digital Library

[63]

Statista. 2021. Total Data Volume Worldwide 2010-2025. https://www.statista.com/statistics/871513/worldwide-data-created/. (accessed 2022-04-27).

[64]

Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2005. C-Store: A Column-Oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases(VLDB ’05). VLDB Endowment, Trondheim, Norway, 553–564.

[65]

Christian Wimmer and Thomas Würthinger. 2012. Truffle: A Self-Optimizing Runtime System. In Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity(SPLASH ’12). ACM, Tucson, Arizona, USA, 13–14. https://doi.org/10.1145/2384716.2384723

Digital Library

[66]

Andreas Wöß, Christian Wirth, Daniele Bonetta, Chris Seaton, Christian Humer, and Hanspeter Mössenböck. 2014. An Object Storage Model for the Truffle Language Implementation Framework. In Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages, and Tools - PPPJ ’14. ACM Press, Cracow, Poland, 133–144. https://doi.org/10.1145/2647508.2647517

[67]

Thomas Würthinger, Christian Wimmer, Christian Humer, Andreas Wöß, Lukas Stadler, Chris Seaton, Gilles Duboscq, Doug Simon, and Matthias Grimmer. 2017. Practical Partial Evaluation for High-Performance Dynamic Language Runtimes. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI 2017). Association for Computing Machinery, New York, NY, USA, 662–676. https://doi.org/10.1145/3062341.3062381

Digital Library

[68]

Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko. 2013. One VM to Rule Them All. In Proceedings of the 2013 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software(Onward! 2013). ACM, Indianapolis, Indiana, USA, 187–204. https://doi.org/10.1145/2509578.2509581

Digital Library

[69]

Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. 2012. Self-Optimizing AST Interpreters. In Proceedings of the 8th Symposium on Dynamic Languages(DLS ’12). Association for Computing Machinery, New York, NY, USA, 73–82. https://doi.org/10.1145/2384577.2384587

Digital Library

[70]

Rui Zhang, Saumya Debray, and Richard T. Snodgrass. 2012. Micro-Specialization: Dynamic Code Specialization of Database Management Systems. In Proceedings of the Tenth International Symposium on Code Generation and Optimization(CGO ’12). Association for Computing Machinery, New York, NY, USA, 63–73. https://doi.org/10.1145/2259016.2259025

Digital Library

[71]

Wangda Zhang, Junyoung Kim, Kenneth A. Ross, Eric Sedlar, and Lukas Stadler. 2021. Adaptive Code Generation for Data-Intensive Analytics. Proc. VLDB Endow. 14, 6 (Feb. 2021), 929–942. https://doi.org/10.14778/3447689.3447697

Digital Library

Cited By

Index Terms

Automatic Array Transformation to Columnar Storage at Run Time
1. Information systems
  1. Information storage systems
    1. Record storage systems
      1. Relational storage
        Column based storage
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Automatically Transforming Arrays to Columnar Storage at Run Time✱
MPLR '22: Proceedings of the 19th International Conference on Managed Programming Languages and Runtimes

Picking the right data structure for the right job is one of the key challenges for every developer. However, especially in the realm of object-oriented programming, the memory layout of data structures is often still suboptimal for certain data access ...
Run-time data analysis to drive compiler optimizations
SPLASH Companion 2021: Companion Proceedings of the 2021 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity

Dynamic compilers collect a variety of information to optimize programs and achieve peak performance. Nevertheless, particularly in data-heavy applications, analysis of the processed data - data structures, metrics, relations - could enable additional ...
A flash-based decomposition storage model
DASFAA'12: Proceedings of the 17th international conference on Database Systems for Advanced Applications

The traditional HDD-based columnar storage is an important technology to improve the performance of query-intensive database. However, some features of HDD weaken the advantages of columnar storage. In this paper, we study the advantages of SSD over HDD ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

MPLR '22: Proceedings of the 19th International Conference on Managed Programming Languages and Runtimes

September 2022

161 pages

ISBN:9781450396967

DOI:10.1145/3546918

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2022

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Oracle

Conference

MPLR '22

MPLR '22: 19th International Conference on Managed Programming Languages and Runtimes

September 14 - 15, 2022

Brussels, Belgium

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
355
Total Downloads

Downloads (Last 12 months)198
Downloads (Last 6 weeks)38

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents