Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3691620.3695474acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

Unlocking the Power of Numbers: Log Compression via Numeric Token Parsing

Published: 27 October 2024 Publication History

Abstract

Parser-based log compressors have been widely explored in recent years because the explosive growth of log volumes makes the compression performance of general-purpose compressors unsatisfactory. These parser-based compressors preprocess logs by grouping the logs based on the parsing result and then feed the preprocessed files into a general-purpose compressor. However, parser-based compressors have their limitations. First, the goals of parsing and compression are misaligned, so the inherent characteristics of logs were not fully utilized. In addition, the performance of parser-based compressors depends on the sample logs and thus it is very unstable. Moreover, parser-based compressors often incur a long processing time. To address these limitations, we propose Denum, a simple, general log compressor with high compression ratio and speed. The core insight is that a majority of the tokens in logs are numeric tokens (i.e. pure numbers, tokens with only numbers and special characters, and numeric variables) and effective compression of them is critical for log compression. Specifically, Denum contains a Numeric Token Parsing module, which extracts all numeric tokens and applies tailored processing methods (e.g. store the differences of incremental numbers like timestamps), and a String Processing module, which processes the remaining log content without numbers. The processed files of the two modules are then fed as input to a general-purpose compressor and it outputs the final compression results. Denum has been evaluated on 16 log datasets and it achieves an 8.7% -- 434.7% higher average compression ratio and 2.6× -- 37.7× faster average compression speed (i.e. 26.2 MB/S) compared to the baselines. Moreover, integrating Denum's Numeric Token Parsing module into existing log compressors can provide a 11.8% improvement in their average compression ratio and achieve 37% faster average compression speed.

References

[1]
2024-6-4. https://7-zip.org/sdk.html.
[2]
2024-6-4. https://cloud.google.com/stackdriver/pricing?hl=zh-cn.
[3]
2024-6-4. https://github.com/gaiusyu/Denum.
[4]
2024-6-4. https://git.savannah.gnu.org/cgit/gzip.git.
[5]
2024-6-4. https://ppmd-cffi.readthedocs.io/en/latest/index.html.
[6]
2024-6-4. https://sourceware.org/bzip2/.
[7]
Donald Adjeroh, Timothy Bell, and Amar Mukherjee. 2008. The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching. Springer Science & Business Media.
[8]
Boyuan Chen. 2020. Improving the Logging Practices in DevOps. (2020).
[9]
Boyuan Chen and Zhen Ming Jiang. 2021. A survey of software log instrumentation. ACM Computing Surveys (CSUR) 54, 4 (2021), 1--34.
[10]
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. 2014. The mystery machine: End-to-end performance analysis of large-scale internet services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 217--231.
[11]
Robert Christensen and Feifei Li. 2013. Adaptive log compression for massive log data. In SIGMOD Conference. 1283--1284.
[12]
John Cleary and Ian Witten. 1984. Data compression using adaptive coding and partial string matching. IEEE transactions on Communications 32, 4 (1984), 396--402.
[13]
Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun Chen. 2020. Logram: Efficient Log Parsing Using n n-Gram Dictionaries. IEEE Transactions on Software Engineering 48, 3 (2020), 879--892.
[14]
Peter Deutsch. 1996. DEFLATE compressed data format specification version 1.3. Technical Report.
[15]
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. 1285--1298.
[16]
Bo Feng, Chentao Wu, and Jie Li. 2016. MLC: An efficient multi-level log compression method for cloud backup systems. In 2016 IEEE Trustcom/BigDataSE/ISPA. IEEE, 1358--1365.
[17]
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In 2009 ninth IEEE international conference on data mining. IEEE, 149--158.
[18]
Ying Fu, Meng Yan, Jian Xu, Jianguo Li, Zhongxin Liu, Xiaohong Zhang, and Dan Yang. 2022. Investigating and improving log parsing in practice. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1566--1577.
[19]
Sanjay Ghemawat and Jeff Dean. 2014. Leveldb is a fast key-value storage library written at google that provides an ordered mapping from string keys to string values.
[20]
Sina Gholamian and Paul AS Ward. 2021. A comprehensive survey of logging in software: From logging statements automation to log mining and analysis. arXiv preprint arXiv:2110.12489 (2021).
[21]
Solomon Golomb. 1966. Run-length encodings (corresp.). IEEE transactions on information theory 12, 3 (1966), 399--401.
[22]
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In 2017 IEEE international conference on web services (ICWS). IEEE, 33--40.
[23]
Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2021. A survey on automated log analysis for reliability engineering. ACM computing surveys (CSUR) 54, 6 (2021), 1--37.
[24]
Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2020. Loghub: A large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448 (2020).
[25]
David A Huffman. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40, 9 (1952), 1098--1101.
[26]
Yintong Huo, Yuxin Su, Cheryl Lee, and Michael R Lyu. 2023. Semparser: A semantic parser for log analytics. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 881--893.
[27]
Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, and Michael R Lyu. 2023. Llmparser: A llm-based log parsing framework. arXiv preprint arXiv:2310.01796 (2023).
[28]
Van-Hoang Le and Hongyu Zhang. 2023. An Evaluation of Log Parsing with ChatGPT. arXiv preprint arXiv:2306.01590 (2023).
[29]
Van-Hoang Le and Hongyu Zhang. 2023. Log Parsing with Prompt-based Few-shot Learning. arXiv preprint arXiv:2302.07435 (2023).
[30]
Xiaoyun Li, Hongyu Zhang, Van-Hoang Le, and Pengfei Chen. 2024. Logshrink: Effective log compression by leveraging commonality and variability of log data. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1--12.
[31]
Hao Lin, Jingyu Zhou, Bin Yao, Minyi Guo, and Jie Li. 2015. Cowic: A columnwise independent compression for log stream analysis. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 21--30.
[32]
Jinyang Liu, Jieming Zhu, Shilin He, Pinjia He, Zibin Zheng, and Michael R Lyu. 2019. Logzip: Extracting hidden structures via iterative clustering for log compression. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 863--873.
[33]
Yudong Liu, Xu Zhang, Shilin He, Hongyu Zhang, Liqun Li, Yu Kang, Yong Xu, Minghua Ma, Qingwei Lin, Yingnong Dang, et al. 2022. Uniparser: A unified log parser for heterogeneous log data. In Proceedings of the ACM Web Conference 2022. 1893--1901.
[34]
Siyang Lu, BingBing Rao, Xiang Wei, Byungchul Tak, Long Wang, and Liqiang Wang. 2017. Log-based abnormal task detection and root cause analysis for spark. In 2017 IEEE International Conference on Web Services (ICWS). IEEE, 389--396.
[35]
Giovanni Manzini. 2001. An analysis of the Burrows---Wheeler transform. Journal of the ACM (JACM) 48, 3 (2001), 407--430.
[36]
Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, Jorge Cardoso, and Odej Kao. 2021. Self-supervised log parsing. In Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14--18, 2020, Proceedings, Part IV. Springer, 122--138.
[37]
Jorma Rissanen and Glen G Langdon. 1979. Arithmetic coding. IBM Journal of research and development 23, 2 (1979), 149--162.
[38]
Kirk Rodrigues, Yu Luo, and Ding Yuan. 2021. {CLP}: Efficient and Scalable Search on Compressed Text Logs. In 15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21). 183--198.
[39]
Keiichi Shima. 2016. Length matters: Clustering system log messages using length of words. arXiv preprint arXiv:1611.03213 (2016).
[40]
Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. 2007. Thrift: Scalable cross-language services implementation. Facebook white paper 5, 8 (2007), 127.
[41]
Xuheng Wang, Xu Zhang, Liqun Li, Shilin He, Hongyu Zhang, Yudong Liu, Lingling Zheng, Yu Kang, Qingwei Lin, Yingnong Dang, et al. 2022. SPINE: a scalable log parser with feedback guidance. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1198--1208.
[42]
Junyu Wei, Guangyan Zhang, Junchao Chen, Yang Wang, Weimin Zheng, Tingtao Sun, Jiesheng Wu, and Jiangwei Jiang. 2023. LogGrep: Fast and Cheap Cloud Log Storage by Exploiting both Static and Runtime Patterns. In Proceedings of the Eighteenth European Conference on Computer Systems. 452--468.
[43]
Junyu Wei, Guangyan Zhang, Yang Wang, Zhiwei Liu, Zhanyang Zhu, Junchao Chen, Tingtao Sun, and Qi Zhou. 2021. On the Feasibility of Parser-based Log Compression in {Large-Scale} Cloud Systems. In 19th USENIX Conference on File and Storage Technologies (FAST 21). 249--262.
[44]
Junjielong Xu, Ruichun Yang, Yintong Huo, Chengyu Zhang, and Pinjia He. 2023. Prompting for Automatic Log Template Extraction. arXiv preprint arXiv:2307.09950 (2023).
[45]
Kundi Yao, Mohammed Sayagh, Weiyi Shang, and Ahmed E Hassan. 2021. Improving state-of-the-art compression techniques for log management tools. IEEE Transactions on Software Engineering 48, 8 (2021), 2748--2760.
[46]
Siyu Yu, Pinjia He, Ningjiang Chen, and Yifan Wu. 2023. Brain: Log Parsing with Bidirectional Parallel Tree. IEEE Transactions on Services Computing (2023).
[47]
Siyu Yu, Yifan Wu, Zhijing Li, Pinjia He, Ningjiang Chen, and Changjian Liu. 2023. Log Parsing with Generalization Ability under New Log Types. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 425--437.
[48]
Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on information theory 23, 3 (1977), 337--343.
[49]
Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE transactions on Information Theory 24, 5 (1978), 530--536.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering
October 2024
2587 pages
ISBN:9798400712487
DOI:10.1145/3691620
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2024

Check for updates

Badges

Author Tags

  1. data compression
  2. log compression
  3. log analysis

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • Guangdong Basic and Applied Basic Research Foundation
  • Shenzhen Research Institute of Big Data Innovation Fund

Conference

ASE '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 74
    Total Downloads
  • Downloads (Last 12 months)74
  • Downloads (Last 6 weeks)17
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media