research-article

Open access

Grouping Time Series for Efficient Columnar Storage

Authors:

Chenguang Fang,

Xiangdong Huang,

Jianmin WangAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 1

Article No.: 23, Pages 1 - 26

https://doi.org/10.1145/3588703

Published: 30 May 2023 Publication History

Abstract

Columnar storage is now an industry standard design in most open-source or commercial time series database products, making them HTAP systems. The time column of a time series serves as the key for identifying the other value column, namely single-column storage scheme. When multiple time series share a similar set of timestamps, very likely in a module of multiple sensors, it is natural to group them together, i.e., one time column identifies multiple value columns in a single-group storage scheme. While multiple value columns sharing the same time column reduce the space cost of repeating timestamps, it may introduce extra space cost for recording null values. The reason is that time series may not be exactly aligned on each timestamp, owing to missing values, distinct data collection frequencies, unsynchronized clocks and so on. The columngroups storage scheme is thus to divide columns into multiple groups, within which the value columns share the same time column. Unfortunately, the problem of finding the optimal column groups for the minimum space cost is highly challenging, NP-hard according to our analysis. Thereby, we propose a heuristic algorithm for automatically grouping time series for efficient columnar storage. The column groups storage has been deployed in Apache IoTDB, an open-source time series database. The extensive performance analysis, over real-world data from our industrial partners, demonstrates that the proposed column groups achieve near optimal storage, more concise than the storage of single-column or single-group schemes. Interestingly, both the flushing and querying time costs of column groups are comparable to those of single-column or singlegroup, i.e., without incurring extra time cost.

Supplemental Material

MP4 File

Presentation video of SIGMOD 2023 paper: Grouping Time Series for Efficient Columnar Storage

Download
20.46 MB

References

[1]

Apache HBase. https://hbase.apache.org/.

[2]

Apache HBase Implementation. https://github.com/iotdbColumnGroup/HBase.

[3]

Apache IoTDB. http://iotdb.apache.org.

[4]

Appendix. https://iotdbcolumngroup.github.io/iotdbColumnGroup/appendix.pdf.

[5]

Code and Data. https://github.com/iotdbColumnGroup/iotdbColumnGroup.

[6]

InfluxDB. https://www.influxdata.com/.

[7]

OpenTSDB. http://opentsdb.net/.

[8]

Prometheus. https://prometheus.io.

[9]

Source Code. https://github.com/apache/iotdb/tree/research/auto-aligned.

[10]

TDengine. https://github.com/taosdata/TDengine/.

[11]

TDengine Implementation. https://github.com/iotdbColumnGroup/TDengine.

[12]

TimescaleDB. https://www.timescale.com.

[13]

Zomato Dataset. https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants.

[14]

Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. 2006. Efficient Exact Set-Similarity Joins. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006. ACM, 918--929. http://dl.acm.org/citation.cfm?id=1164206

Digital Library

[15]

Esther M Arkin and Refael Hassin. 1998. On local search for weighted k-set packing. Mathematics of Operations Research 23, 3 (1998), 640--648.

[16]

Haoqiong Bian, Ying Yan, Wenbo Tao, Liang Jeff Chen, Yueguo Chen, Xiaoyong Du, and Thomas Moscibroda. 2017. Wide Table Layout Optimization based on Column Ordering and Duplication. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017. ACM, 299--314. https://doi.org/10.1145/3035918.3035930

Digital Library

[17]

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!). In 7th Symposium on Operating Systems Design and Implementation (OSDI '06), November 6--8, Seattle, WA, USA, Brian N. Bershad and Jeffrey C. Mogul (Eds.). USENIX Association, 205--218. http://www.usenix.org/events/osdi06/tech/chang.html

[18]

Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A Primitive Operator for Similarity Joins in Data Cleaning. In Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3--8 April 2006, Atlanta, GA, USA. IEEE Computer Society, 5. https://doi.org/10.1109/ICDE.2006.9

Digital Library

[19]

Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 2011. Column-Oriented Storage Techniques for MapReduce. Proc. VLDB Endow. 4, 7 (2011), 419--429. https://doi.org/10.14778/1988776.1988778

Digital Library

[20]

Pranjal Gupta, Amine Mhedhbi, and Semih Salihoglu. 2021. Columnar Storage and List-based Processing for Graph Database Management Systems. Proc. VLDB Endow. 14, 11 (2021), 2491--2504. http://www.vldb.org/pvldb/vol14/p2491-gupta.pdf

Digital Library

[21]

Muon Ha and Yulia A. Shichkina. 2022. Translating a Distributed Relational Database to a Document Database. Data Sci. Eng. 7, 2 (2022), 136--155. https://doi.org/10.1007/s41019-022-00181--9

[22]

Shuai Han, Mingxia Liu, and Jian-Zhong Li. 2022. Efficient Partitioning Method for Optimizing the Compression on Array Data. J. Comput. Sci. Technol. 37, 5 (2022), 1049--1067. https://doi.org/10.1007/s11390-022--2371--7

Digital Library

[23]

Donghe Kang, Ruochen Jiang, and Spyros Blanas. 2021. Jigsaw: A Data Storage and Query Processing Engine for Irregular Table Partitioning. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021. 898--911. https://doi.org/10.1145/3448016.3457547

Digital Library

[24]

Per-Åke Larson, Eric N. Hanson, and Susan L. Price. 2012. Columnar Storage in SQL Server 2012. IEEE Data Eng. Bull. 35, 1 (2012), 15--20. http://sites.computer.org/debull/A12mar/apollo.pdf

[25]

Yinan Mei, Shaoxu Song, Yunsu Lee, Jungho Park, Soo-Hyung Kim, and Sungmin Yi. 2020. Representing Temporal Attributes for Schema Matching. In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23--27, 2020. ACM, 709--719. https://doi.org/10.1145/3394486.3403115

Digital Library

[26]

Jeffrey C. Mogul, Fred Douglis, Anja Feldmann, and Balachander Krishnamurthy. 1997. Potential Benefits of Delta Encoding and Data Compression for HTTP. In Proceedings of the ACM SIGCOMM 1997 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, September 14--18, 1997, Cannes, France. ACM, 181--194. https://doi.org/10.1145/263105.263162

Digital Library

[27]

M. Tamer Özsu. 2016. A survey of RDF data management systems. Frontiers Comput. Sci. 10, 3 (2016), 418--432. https://doi.org/10.1007/s11704-016--5554-y

Digital Library

[28]

Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin Teller, and Kaushik Veeraraghavan. 2015. Gorilla: A Fast, Scalable, In-Memory Time Series Database. Proc. VLDB Endow. 8, 12 (2015), 1816--1827. https://doi.org/10.14778/2824032.2824078

Digital Library

[29]

Horst Samulowitz, Chandra Reddy, Ashish Sabharwal, and Meinolf Sellmann. 2013. Snappy: A Simple Algorithm Portfolio. In Theory and Applications of Satisfiability Testing - SAT 2013 - 16th International Conference, Helsinki, Finland, July 8--12, 2013. Proceedings (Lecture Notes in Computer Science, Vol. 7962), Matti Järvisalo and Allen Van Gelder (Eds.). Springer, 422--428. https://doi.org/10.1007/978--3--642--39071--5_33

[30]

Dilbag Singh, Yavuz Selim Taspinar, Ramazan Kursun, Ilkay Cinar, Murat Koklu, Ilker Ali Ozkan, and Heung-No Lee. 2022. Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models. Electronics 11, 7 (2022), 981.

[31]

Shaoxu Song, Yue Cao, and Jianmin Wang. 2016. Cleaning Timestamps with Temporal Constraints. Proc. VLDB Endow. 9, 10 (2016), 708--719. https://doi.org/10.14778/2977797.2977798

Digital Library

[32]

Shaoxu Song and Lei Chen. 2010. Efficient set-correlation operator inside databases. In Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26--30, 2010. ACM, 139--148. https://doi.org/10.1145/1871437.1871459

Digital Library

[33]

Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang, Rong Kang, Julian Feinauer, Kevin Mcgrail, Peng Wang, Diaohan Luo, Jun Yuan, Jianmin Wang, and Jiaguang Sun. 2020. Apache IoTDB: Time-series database for Internet of Things. Proc. VLDB Endow. 13, 12 (2020), 2901--2904. https://doi.org/10.14778/3415478.3415504

Digital Library

[34]

Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012. ACM, 85--96. https://doi.org/10.1145/2213836.2213847

Digital Library

[35]

Zhiqi Wang, Jin Xue, and Zili Shao. 2021. Heracles: An Efficient Storage Model And Data Flushing For Performance Monitoring Timeseries. Proc. VLDB Endow. 14, 6 (2021), 1080--1092. https://doi.org/10.14778/3447689.3447710

Digital Library

[36]

Jiaye Wu, Peng Wang, Ningting Pan, Chen Wang, Wei Wang, and Jianmin Wang. 2019. KV-Match: A Subsequence Matching Approach Supporting Normalization and Time Warping. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8--11, 2019. IEEE, 866--877. https://doi.org/10.1109/ICDE.2019.00082

Cited By

Fang CChen ZSong SHuang XWang CWang J(2024)On Reducing Space Amplification with Multi-Column Compaction in Apache IoTDBProceedings of the VLDB Endowment10.14778/3681954.368197717:11(2974-2986)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.14778/3681954.3681977
Xia TXiao JHuang YHu CSong SHuang XWang J(2024)Time series data encoding in Apache IoTDB: comparative analysis and recommendationThe VLDB Journal10.1007/s00778-024-00840-533:3(727-752)Online publication date: 12-Feb-2024
https://doi.org/10.1007/s00778-024-00840-5
Liang ZZhang TZheng BWang H(2024)Towards Real-Time Data Ingestion for Industrial Internet of ThingsDatabase Systems for Advanced Applications10.1007/978-981-97-5575-2_38(457-461)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1007/978-981-97-5575-2_38
Show More Cited By

Index Terms

Grouping Time Series for Efficient Columnar Storage
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Stream management

Recommendations

A flash-based decomposition storage model
DASFAA'12: Proceedings of the 17th international conference on Database Systems for Advanced Applications

The traditional HDD-based columnar storage is an important technology to improve the performance of query-intensive database. However, some features of HDD weaken the advantages of columnar storage. In this paper, we study the advantages of SSD over HDD ...
A Comparative Study of Row and Column Storage for Time Series Data
Spatial Data and Intelligence
Abstract
Over the past few decades, researchers have done massive research on data storage structures of relational databases. The existing research mainly focuses on the analysis and optimization of row and column storage structures in relational ...
Automatic Array Transformation to Columnar Storage at Run Time
MPLR '22: Proceedings of the 19th International Conference on Managed Programming Languages and Runtimes

Today’s huge memories make it possible to store and process large data structures in memory instead of in a database. Hence, accesses to this data should be optimized, which is normally relegated either to the runtimes and compilers or is left to the ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 1

PACMMOD

May 2023

2807 pages

EISSN:2836-6573

DOI:10.1145/3603164

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023

Published in PACMMOD Volume 1, Issue 1

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Plan (2021YFB3300500)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
1,103
Total Downloads

Downloads (Last 12 months)715
Downloads (Last 6 weeks)52

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fang CChen ZSong SHuang XWang CWang J(2024)On Reducing Space Amplification with Multi-Column Compaction in Apache IoTDBProceedings of the VLDB Endowment10.14778/3681954.368197717:11(2974-2986)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.14778/3681954.3681977
Xia TXiao JHuang YHu CSong SHuang XWang J(2024)Time series data encoding in Apache IoTDB: comparative analysis and recommendationThe VLDB Journal10.1007/s00778-024-00840-533:3(727-752)Online publication date: 12-Feb-2024
https://doi.org/10.1007/s00778-024-00840-5
Liang ZZhang TZheng BWang H(2024)Towards Real-Time Data Ingestion for Industrial Internet of ThingsDatabase Systems for Advanced Applications10.1007/978-981-97-5575-2_38(457-461)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1007/978-981-97-5575-2_38
Wang CQiao JHuang XSong SHou HJiang TRui LWang JSun J(2023)Apache IoTDB: A Time Series Database for IoT ApplicationsProceedings of the ACM on Management of Data10.1145/35897751:2(1-27)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589775

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents