Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Grouping Time Series for Efficient Columnar Storage

Published: 30 May 2023 Publication History

Abstract

Columnar storage is now an industry standard design in most open-source or commercial time series database products, making them HTAP systems. The time column of a time series serves as the key for identifying the other value column, namely single-column storage scheme. When multiple time series share a similar set of timestamps, very likely in a module of multiple sensors, it is natural to group them together, i.e., one time column identifies multiple value columns in a single-group storage scheme. While multiple value columns sharing the same time column reduce the space cost of repeating timestamps, it may introduce extra space cost for recording null values. The reason is that time series may not be exactly aligned on each timestamp, owing to missing values, distinct data collection frequencies, unsynchronized clocks and so on. The columngroups storage scheme is thus to divide columns into multiple groups, within which the value columns share the same time column. Unfortunately, the problem of finding the optimal column groups for the minimum space cost is highly challenging, NP-hard according to our analysis. Thereby, we propose a heuristic algorithm for automatically grouping time series for efficient columnar storage. The column groups storage has been deployed in Apache IoTDB, an open-source time series database. The extensive performance analysis, over real-world data from our industrial partners, demonstrates that the proposed column groups achieve near optimal storage, more concise than the storage of single-column or single-group schemes. Interestingly, both the flushing and querying time costs of column groups are comparable to those of single-column or singlegroup, i.e., without incurring extra time cost.

Supplemental Material

MP4 File
Presentation video of SIGMOD 2023 paper: Grouping Time Series for Efficient Columnar Storage

References

[1]
Apache HBase. https://hbase.apache.org/.
[2]
Apache HBase Implementation. https://github.com/iotdbColumnGroup/HBase.
[3]
Apache IoTDB. http://iotdb.apache.org.
[4]
Appendix. https://iotdbcolumngroup.github.io/iotdbColumnGroup/appendix.pdf.
[5]
Code and Data. https://github.com/iotdbColumnGroup/iotdbColumnGroup.
[6]
InfluxDB. https://www.influxdata.com/.
[7]
OpenTSDB. http://opentsdb.net/.
[8]
Prometheus. https://prometheus.io.
[9]
Source Code. https://github.com/apache/iotdb/tree/research/auto-aligned.
[10]
TDengine. https://github.com/taosdata/TDengine/.
[11]
TDengine Implementation. https://github.com/iotdbColumnGroup/TDengine.
[12]
TimescaleDB. https://www.timescale.com.
[13]
Zomato Dataset. https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants.
[14]
Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. 2006. Efficient Exact Set-Similarity Joins. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006. ACM, 918--929. http://dl.acm.org/citation.cfm?id=1164206
[15]
Esther M Arkin and Refael Hassin. 1998. On local search for weighted k-set packing. Mathematics of Operations Research 23, 3 (1998), 640--648.
[16]
Haoqiong Bian, Ying Yan, Wenbo Tao, Liang Jeff Chen, Yueguo Chen, Xiaoyong Du, and Thomas Moscibroda. 2017. Wide Table Layout Optimization based on Column Ordering and Duplication. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017. ACM, 299--314. https://doi.org/10.1145/3035918.3035930
[17]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!). In 7th Symposium on Operating Systems Design and Implementation (OSDI '06), November 6--8, Seattle, WA, USA, Brian N. Bershad and Jeffrey C. Mogul (Eds.). USENIX Association, 205--218. http://www.usenix.org/events/osdi06/tech/chang.html
[18]
Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A Primitive Operator for Similarity Joins in Data Cleaning. In Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3--8 April 2006, Atlanta, GA, USA. IEEE Computer Society, 5. https://doi.org/10.1109/ICDE.2006.9
[19]
Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 2011. Column-Oriented Storage Techniques for MapReduce. Proc. VLDB Endow. 4, 7 (2011), 419--429. https://doi.org/10.14778/1988776.1988778
[20]
Pranjal Gupta, Amine Mhedhbi, and Semih Salihoglu. 2021. Columnar Storage and List-based Processing for Graph Database Management Systems. Proc. VLDB Endow. 14, 11 (2021), 2491--2504. http://www.vldb.org/pvldb/vol14/p2491-gupta.pdf
[21]
Muon Ha and Yulia A. Shichkina. 2022. Translating a Distributed Relational Database to a Document Database. Data Sci. Eng. 7, 2 (2022), 136--155. https://doi.org/10.1007/s41019-022-00181--9
[22]
Shuai Han, Mingxia Liu, and Jian-Zhong Li. 2022. Efficient Partitioning Method for Optimizing the Compression on Array Data. J. Comput. Sci. Technol. 37, 5 (2022), 1049--1067. https://doi.org/10.1007/s11390-022--2371--7
[23]
Donghe Kang, Ruochen Jiang, and Spyros Blanas. 2021. Jigsaw: A Data Storage and Query Processing Engine for Irregular Table Partitioning. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021. 898--911. https://doi.org/10.1145/3448016.3457547
[24]
Per-Åke Larson, Eric N. Hanson, and Susan L. Price. 2012. Columnar Storage in SQL Server 2012. IEEE Data Eng. Bull. 35, 1 (2012), 15--20. http://sites.computer.org/debull/A12mar/apollo.pdf
[25]
Yinan Mei, Shaoxu Song, Yunsu Lee, Jungho Park, Soo-Hyung Kim, and Sungmin Yi. 2020. Representing Temporal Attributes for Schema Matching. In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23--27, 2020. ACM, 709--719. https://doi.org/10.1145/3394486.3403115
[26]
Jeffrey C. Mogul, Fred Douglis, Anja Feldmann, and Balachander Krishnamurthy. 1997. Potential Benefits of Delta Encoding and Data Compression for HTTP. In Proceedings of the ACM SIGCOMM 1997 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, September 14--18, 1997, Cannes, France. ACM, 181--194. https://doi.org/10.1145/263105.263162
[27]
M. Tamer Özsu. 2016. A survey of RDF data management systems. Frontiers Comput. Sci. 10, 3 (2016), 418--432. https://doi.org/10.1007/s11704-016--5554-y
[28]
Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin Teller, and Kaushik Veeraraghavan. 2015. Gorilla: A Fast, Scalable, In-Memory Time Series Database. Proc. VLDB Endow. 8, 12 (2015), 1816--1827. https://doi.org/10.14778/2824032.2824078
[29]
Horst Samulowitz, Chandra Reddy, Ashish Sabharwal, and Meinolf Sellmann. 2013. Snappy: A Simple Algorithm Portfolio. In Theory and Applications of Satisfiability Testing - SAT 2013 - 16th International Conference, Helsinki, Finland, July 8--12, 2013. Proceedings (Lecture Notes in Computer Science, Vol. 7962), Matti Järvisalo and Allen Van Gelder (Eds.). Springer, 422--428. https://doi.org/10.1007/978--3--642--39071--5_33
[30]
Dilbag Singh, Yavuz Selim Taspinar, Ramazan Kursun, Ilkay Cinar, Murat Koklu, Ilker Ali Ozkan, and Heung-No Lee. 2022. Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models. Electronics 11, 7 (2022), 981.
[31]
Shaoxu Song, Yue Cao, and Jianmin Wang. 2016. Cleaning Timestamps with Temporal Constraints. Proc. VLDB Endow. 9, 10 (2016), 708--719. https://doi.org/10.14778/2977797.2977798
[32]
Shaoxu Song and Lei Chen. 2010. Efficient set-correlation operator inside databases. In Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26--30, 2010. ACM, 139--148. https://doi.org/10.1145/1871437.1871459
[33]
Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang, Rong Kang, Julian Feinauer, Kevin Mcgrail, Peng Wang, Diaohan Luo, Jun Yuan, Jianmin Wang, and Jiaguang Sun. 2020. Apache IoTDB: Time-series database for Internet of Things. Proc. VLDB Endow. 13, 12 (2020), 2901--2904. https://doi.org/10.14778/3415478.3415504
[34]
Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012. ACM, 85--96. https://doi.org/10.1145/2213836.2213847
[35]
Zhiqi Wang, Jin Xue, and Zili Shao. 2021. Heracles: An Efficient Storage Model And Data Flushing For Performance Monitoring Timeseries. Proc. VLDB Endow. 14, 6 (2021), 1080--1092. https://doi.org/10.14778/3447689.3447710
[36]
Jiaye Wu, Peng Wang, Ningting Pan, Chen Wang, Wei Wang, and Jianmin Wang. 2019. KV-Match: A Subsequence Matching Approach Supporting Normalization and Time Warping. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8--11, 2019. IEEE, 866--877. https://doi.org/10.1109/ICDE.2019.00082

Cited By

View all
  • (2024)On Reducing Space Amplification with Multi-Column Compaction in Apache IoTDBProceedings of the VLDB Endowment10.14778/3681954.368197717:11(2974-2986)Online publication date: 30-Aug-2024
  • (2024)Time series data encoding in Apache IoTDB: comparative analysis and recommendationThe VLDB Journal10.1007/s00778-024-00840-533:3(727-752)Online publication date: 12-Feb-2024
  • (2024)Towards Real-Time Data Ingestion for Industrial Internet of ThingsDatabase Systems for Advanced Applications10.1007/978-981-97-5575-2_38(457-461)Online publication date: 2-Jul-2024
  • Show More Cited By

Index Terms

  1. Grouping Time Series for Efficient Columnar Storage

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 1, Issue 1
    PACMMOD
    May 2023
    2807 pages
    EISSN:2836-6573
    DOI:10.1145/3603164
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2023
    Published in PACMMOD Volume 1, Issue 1

    Author Tags

    1. columnar storage
    2. grouping
    3. time series

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Development Plan (2021YFB3300500)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)715
    • Downloads (Last 6 weeks)52
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)On Reducing Space Amplification with Multi-Column Compaction in Apache IoTDBProceedings of the VLDB Endowment10.14778/3681954.368197717:11(2974-2986)Online publication date: 30-Aug-2024
    • (2024)Time series data encoding in Apache IoTDB: comparative analysis and recommendationThe VLDB Journal10.1007/s00778-024-00840-533:3(727-752)Online publication date: 12-Feb-2024
    • (2024)Towards Real-Time Data Ingestion for Industrial Internet of ThingsDatabase Systems for Advanced Applications10.1007/978-981-97-5575-2_38(457-461)Online publication date: 2-Jul-2024
    • (2023)Apache IoTDB: A Time Series Database for IoT ApplicationsProceedings of the ACM on Management of Data10.1145/35897751:2(1-27)Online publication date: 20-Jun-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media