research-article

Unlocking the Power of Numbers: Log Compression via Numeric Token Parsing

Authors:

Pinjia HeAuthors Info & Claims

ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

Pages 919 - 930

https://doi.org/10.1145/3691620.3695474

Published: 27 October 2024 Publication History

Abstract

Parser-based log compressors have been widely explored in recent years because the explosive growth of log volumes makes the compression performance of general-purpose compressors unsatisfactory. These parser-based compressors preprocess logs by grouping the logs based on the parsing result and then feed the preprocessed files into a general-purpose compressor. However, parser-based compressors have their limitations. First, the goals of parsing and compression are misaligned, so the inherent characteristics of logs were not fully utilized. In addition, the performance of parser-based compressors depends on the sample logs and thus it is very unstable. Moreover, parser-based compressors often incur a long processing time. To address these limitations, we propose Denum, a simple, general log compressor with high compression ratio and speed. The core insight is that a majority of the tokens in logs are numeric tokens (i.e. pure numbers, tokens with only numbers and special characters, and numeric variables) and effective compression of them is critical for log compression. Specifically, Denum contains a Numeric Token Parsing module, which extracts all numeric tokens and applies tailored processing methods (e.g. store the differences of incremental numbers like timestamps), and a String Processing module, which processes the remaining log content without numbers. The processed files of the two modules are then fed as input to a general-purpose compressor and it outputs the final compression results. Denum has been evaluated on 16 log datasets and it achieves an 8.7% -- 434.7% higher average compression ratio and 2.6× -- 37.7× faster average compression speed (i.e. 26.2 MB/S) compared to the baselines. Moreover, integrating Denum's Numeric Token Parsing module into existing log compressors can provide a 11.8% improvement in their average compression ratio and achieve 37% faster average compression speed.

References

[1]

2024-6-4. https://7-zip.org/sdk.html.

[2]

2024-6-4. https://cloud.google.com/stackdriver/pricing?hl=zh-cn.

[3]

2024-6-4. https://github.com/gaiusyu/Denum.

[4]

2024-6-4. https://git.savannah.gnu.org/cgit/gzip.git.

[5]

2024-6-4. https://ppmd-cffi.readthedocs.io/en/latest/index.html.

[6]

2024-6-4. https://sourceware.org/bzip2/.

[7]

Donald Adjeroh, Timothy Bell, and Amar Mukherjee. 2008. The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching. Springer Science & Business Media.

[8]

Boyuan Chen. 2020. Improving the Logging Practices in DevOps. (2020).

[9]

Boyuan Chen and Zhen Ming Jiang. 2021. A survey of software log instrumentation. ACM Computing Surveys (CSUR) 54, 4 (2021), 1--34.

Digital Library

[10]

Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. 2014. The mystery machine: End-to-end performance analysis of large-scale internet services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 217--231.

Digital Library

[11]

Robert Christensen and Feifei Li. 2013. Adaptive log compression for massive log data. In SIGMOD Conference. 1283--1284.

Digital Library

[12]

John Cleary and Ian Witten. 1984. Data compression using adaptive coding and partial string matching. IEEE transactions on Communications 32, 4 (1984), 396--402.

[13]

Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun Chen. 2020. Logram: Efficient Log Parsing Using n n-Gram Dictionaries. IEEE Transactions on Software Engineering 48, 3 (2020), 879--892.

[14]

Peter Deutsch. 1996. DEFLATE compressed data format specification version 1.3. Technical Report.

[15]

Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. 1285--1298.

Digital Library

[16]

Bo Feng, Chentao Wu, and Jie Li. 2016. MLC: An efficient multi-level log compression method for cloud backup systems. In 2016 IEEE Trustcom/BigDataSE/ISPA. IEEE, 1358--1365.

[17]

Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In 2009 ninth IEEE international conference on data mining. IEEE, 149--158.

[18]

Ying Fu, Meng Yan, Jian Xu, Jianguo Li, Zhongxin Liu, Xiaohong Zhang, and Dan Yang. 2022. Investigating and improving log parsing in practice. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1566--1577.

Digital Library

[19]

Sanjay Ghemawat and Jeff Dean. 2014. Leveldb is a fast key-value storage library written at google that provides an ordered mapping from string keys to string values.

[20]

Sina Gholamian and Paul AS Ward. 2021. A comprehensive survey of logging in software: From logging statements automation to log mining and analysis. arXiv preprint arXiv:2110.12489 (2021).

[21]

Solomon Golomb. 1966. Run-length encodings (corresp.). IEEE transactions on information theory 12, 3 (1966), 399--401.

Digital Library

[22]

Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In 2017 IEEE international conference on web services (ICWS). IEEE, 33--40.

[23]

Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2021. A survey on automated log analysis for reliability engineering. ACM computing surveys (CSUR) 54, 6 (2021), 1--37.

[24]

Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2020. Loghub: A large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448 (2020).

[25]

David A Huffman. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40, 9 (1952), 1098--1101.

[26]

Yintong Huo, Yuxin Su, Cheryl Lee, and Michael R Lyu. 2023. Semparser: A semantic parser for log analytics. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 881--893.

Digital Library

[27]

Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, and Michael R Lyu. 2023. Llmparser: A llm-based log parsing framework. arXiv preprint arXiv:2310.01796 (2023).

[28]

Van-Hoang Le and Hongyu Zhang. 2023. An Evaluation of Log Parsing with ChatGPT. arXiv preprint arXiv:2306.01590 (2023).

[29]

Van-Hoang Le and Hongyu Zhang. 2023. Log Parsing with Prompt-based Few-shot Learning. arXiv preprint arXiv:2302.07435 (2023).

[30]

Xiaoyun Li, Hongyu Zhang, Van-Hoang Le, and Pengfei Chen. 2024. Logshrink: Effective log compression by leveraging commonality and variability of log data. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1--12.

Digital Library

[31]

Hao Lin, Jingyu Zhou, Bin Yao, Minyi Guo, and Jie Li. 2015. Cowic: A columnwise independent compression for log stream analysis. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 21--30.

Digital Library

[32]

Jinyang Liu, Jieming Zhu, Shilin He, Pinjia He, Zibin Zheng, and Michael R Lyu. 2019. Logzip: Extracting hidden structures via iterative clustering for log compression. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 863--873.

Digital Library

[33]

Yudong Liu, Xu Zhang, Shilin He, Hongyu Zhang, Liqun Li, Yu Kang, Yong Xu, Minghua Ma, Qingwei Lin, Yingnong Dang, et al. 2022. Uniparser: A unified log parser for heterogeneous log data. In Proceedings of the ACM Web Conference 2022. 1893--1901.

Digital Library

[34]

Siyang Lu, BingBing Rao, Xiang Wei, Byungchul Tak, Long Wang, and Liqiang Wang. 2017. Log-based abnormal task detection and root cause analysis for spark. In 2017 IEEE International Conference on Web Services (ICWS). IEEE, 389--396.

[35]

Giovanni Manzini. 2001. An analysis of the Burrows---Wheeler transform. Journal of the ACM (JACM) 48, 3 (2001), 407--430.

Digital Library

[36]

Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, Jorge Cardoso, and Odej Kao. 2021. Self-supervised log parsing. In Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14--18, 2020, Proceedings, Part IV. Springer, 122--138.

Digital Library

[37]

Jorma Rissanen and Glen G Langdon. 1979. Arithmetic coding. IBM Journal of research and development 23, 2 (1979), 149--162.

[38]

Kirk Rodrigues, Yu Luo, and Ding Yuan. 2021. {CLP}: Efficient and Scalable Search on Compressed Text Logs. In 15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21). 183--198.

[39]

Keiichi Shima. 2016. Length matters: Clustering system log messages using length of words. arXiv preprint arXiv:1611.03213 (2016).

[40]

Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. 2007. Thrift: Scalable cross-language services implementation. Facebook white paper 5, 8 (2007), 127.

[41]

Xuheng Wang, Xu Zhang, Liqun Li, Shilin He, Hongyu Zhang, Yudong Liu, Lingling Zheng, Yu Kang, Qingwei Lin, Yingnong Dang, et al. 2022. SPINE: a scalable log parser with feedback guidance. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1198--1208.

Digital Library

[42]

Junyu Wei, Guangyan Zhang, Junchao Chen, Yang Wang, Weimin Zheng, Tingtao Sun, Jiesheng Wu, and Jiangwei Jiang. 2023. LogGrep: Fast and Cheap Cloud Log Storage by Exploiting both Static and Runtime Patterns. In Proceedings of the Eighteenth European Conference on Computer Systems. 452--468.

Digital Library

[43]

Junyu Wei, Guangyan Zhang, Yang Wang, Zhiwei Liu, Zhanyang Zhu, Junchao Chen, Tingtao Sun, and Qi Zhou. 2021. On the Feasibility of Parser-based Log Compression in {Large-Scale} Cloud Systems. In 19th USENIX Conference on File and Storage Technologies (FAST 21). 249--262.

[44]

Junjielong Xu, Ruichun Yang, Yintong Huo, Chengyu Zhang, and Pinjia He. 2023. Prompting for Automatic Log Template Extraction. arXiv preprint arXiv:2307.09950 (2023).

[45]

Kundi Yao, Mohammed Sayagh, Weiyi Shang, and Ahmed E Hassan. 2021. Improving state-of-the-art compression techniques for log management tools. IEEE Transactions on Software Engineering 48, 8 (2021), 2748--2760.

Digital Library

[46]

Siyu Yu, Pinjia He, Ningjiang Chen, and Yifan Wu. 2023. Brain: Log Parsing with Bidirectional Parallel Tree. IEEE Transactions on Services Computing (2023).

[47]

Siyu Yu, Yifan Wu, Zhijing Li, Pinjia He, Ningjiang Chen, and Changjian Liu. 2023. Log Parsing with Generalization Ability under New Log Types. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 425--437.

Digital Library

[48]

Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on information theory 23, 3 (1977), 337--343.

Digital Library

[49]

Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE transactions on Information Theory 24, 5 (1978), 530--536.

Digital Library

Index Terms

Unlocking the Power of Numbers: Log Compression via Numeric Token Parsing
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Availability
2. Software and its engineering
  1. Software notations and tools
    1. Software maintenance tools

Recommendations

LogShrink: Effective Log Compression by Leveraging Commonality and Variability of Log Data
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Log data is a crucial resource for recording system events and states during system execution. However, as systems grow in scale, log data generation has become increasingly explosive, leading to an expensive overhead on log storage, such as several ...
A study of the performance of general compressors on log files
Abstract
Large-scale software systems and cloud services continue to produce a large amount of log data. Such log data is usually preserved for a long time (e.g., for auditing purposes). General compressors, like the LZ77 compressor used in gzip, are ...
Adaptive log compression for massive log data
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

We present a novel adaptive log compression scheme. Results show 30% improvement on compression ratios over existing approaches.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

October 2024

2587 pages

ISBN:9798400712487

DOI:10.1145/3691620

General Chair:
Vladimir Filkov,
Program Co-chairs:
Baishakhi Ray
Columbia University, USA; AWS AI Lab
,
Minghui Zhou
Peking University, China

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2024

Check for updates

Badges

Artifacts Available / v1.1

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Guangdong Basic and Applied Basic Research Foundation
Shenzhen Research Institute of Big Data Innovation Fund

Conference

ASE '24

Sponsor:

ASE '24: 39th IEEE/ACM International Conference on Automated Software Engineering

October 27 - November 1, 2024

CA, Sacramento, USA

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
74
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)17

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten