Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Write Skew and Zipf Distribution: Evidence and Implications

Published: 08 June 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Understanding workload characteristics is essential to storage systems design and performance optimization. With the emergence of flash memory as a new viable storage medium, the new design concern of flash endurance arises, necessitating a revisit of workload characteristics, in particular, of the write behavior. Inspired by Web caching studies where a Zipf-like access pattern is commonly found, we hypothesize that write count distribution at the block level may also follow Zipf’s Law. To validate this hypothesis, we study 48 block I/O traces collected from a wide variety of real and benchmark applications. Through extensive analysis, we demonstrate that the Zipf-like pattern indeed widely exists in write traffic provided its disguises are removed by statistical processing. This finding implies that write skew in a large class of applications could be analytically expressed and, thus, facilitates design tradeoff explorations adaptive to workload characteristics.

    References

    [1]
    Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse, and Rina Panigrahy. 2008. Design tradeoffs for SSD performance. In Proceedings of the USENIX 2008 Annual Technical Conference on Annual Technical Conference (ATC’08). USENIX Association, 57--70.
    [2]
    Martin F. Arlitt and Carey L. Williamson. 1997. Internet web servers: Workload characterization and performance implications. In IEEE/ACM Transactions on Networking. IEEE Press, 631--645.
    [3]
    J. Axboe. 2014. FIO (Flexible IO Tester). Retrieved from http://git.kernel.dk/?p=fio.git;a=summary.
    [4]
    Bernd Blasius and Ralf Tönjes. 2009. Zipf’s law in the popularity distribution of chess openings. Physics Review Letters 103, 21 (Nov. 2009), 218701.
    [5]
    L. Breslau, Pei Cao, Li Fan, G. Phillips, and S. Shenker. 1999. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings (INFOCOM’99), Vol. 1. IEEE, 126--134.
    [6]
    Werner Bux and Ilias Iliadis. 2010. Performance of greedy garbage collection in flash-based solid-state drives. In Performance Evaluation, Vol. 67. Elsevier Science Publishers B. V., 1172--1186.
    [7]
    Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue Moon. 2007. I tube, you tube, everybody tubes: Analyzing the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC’07). ACM, 1--14.
    [8]
    Li-Pin Chang and Chun-Da Du. 2009. Design and implementation of an efficient wear-leveling algorithm for solid-state-disk microcontrollers. ACM Transactions on Design Automation of Electronic Systems 15, 1, Article 6 (Dec. 2009), 36 pages.
    [9]
    Feng Chen, Tian Luo, and Xiaodong Zhang. 2011. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (FAST’11). USENIX Association, 1.
    [10]
    Maureen Chesire, Alec Wolman, Geoffrey M. Voelker, and Henry M. Levy. 2001. Measurement and analysis of a streaming-media workload. In Proceedings of the 3rd Conference on USENIX Symposium on Internet Technologies and Systems - Volume 3 (USITS’01). USENIX Association, 1.
    [11]
    Edward Chlebus. 2009. An approximate formula for a partial sum of the divergent p-series. In Applied Mathematics Letters. Elsevier Ltd, 732--737.
    [12]
    Pasquale Cirillo. 2013. Are your data really Pareto distributed? Physica A: Statistical Mechanics and Its Applications (2013).
    [13]
    Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. Power-law distributions in empirical data. In SIAM Review. Society for Industrial and Applied Mathematics, 661--703.
    [14]
    Carlos R. Cunha, Azer Bestavros, and Mark E. Crovella. 1995. Characteristics of WWW Client-based Traces. Technical Report BU-CS-95-010. Computer Science Department, Boston University, oston, MA 02215.
    [15]
    Peter Desnoyers. 2012. Analytic modeling of SSD write performance. In Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR’12). ACM, Article 12, 10 pages.
    [16]
    Peter Desnoyers. 2014. Analytic models of SSD write performance. Transactions on Storage 10, Article 8 (March 2014), 25 pages.
    [17]
    ETW. 2012. ETW: Event Tracing for Windows. Retrieved from http://msdn.microsoft.com/en-us/library/bb968803%28VS.85%29.aspx.
    [18]
    Phillipa Gill, Martin Arlitt, Zongpeng Li, and Anirban Mahanti. 2007. Youtube traffic characterization: A view from the edge. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC’07). ACM, 15--28.
    [19]
    Lei Guo, Enhua Tan, Songqing Chen, Zhen Xiao, and Xiaodong Zhang. 2008. The stretched exponential distribution of internet media access patterns. In Proceedings of the 27th ACM Symposium on Principles of Distributed Computing (PODC’08). ACM, 283--294.
    [20]
    Aayush Gupta, Raghav Pisolkar, Bhuvan Urgaonkar, and Anand Sivasubramaniam. 2011. Leveraging value locality in optimizing NAND flash-based SSDs. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (FAST’11). USENIX Association, 1.
    [21]
    Jen-Wei Hsieh, Tei-Wei Kuo, and Li-Pin Chang. 2006. Efficient identification of hot data for flash memory storage systems. In Transactions on Storage. ACM, 22--40.
    [22]
    S. Kavalanekar, V. Sharda, B. L. Worthington, and Q. Zhang. 2008. Characterization of storage workload traces from production windows servers. In Proceedings of the 4th International Symposium on Workload Characterization (IISWC’08). IEEE, New York, NY.
    [23]
    Isao Kotera, Ryusuke Egawa, Hiroyuki Takizawa, and Hiroaki Kobayashi. 2008. Modeling of cache access behavior based on Zipf’s law. In Proceedings of the 9th Workshop on Memory Performance: Dealing with Applications, Systems and Architecture (MEDEA’08). ACM, 9--15.
    [24]
    Jongsung Lee and Jin-Soo Kim. 2013. An empirical study of hot/cold data separation policies in solid state drives (SSDs). In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). ACM, Article 12, 6 pages.
    [25]
    Microsoft News Centre. 2013. The Big Bang: How the Big Data Explosion Is Changing the World. Retrieved from http://www.microsoft.com/en-us/news/features/2013/feb13/02-11bigdata.aspx.
    [26]
    Dushyanth Narayanan, Austin Donnelly, and Antony Rowstron. 2008. Write off-loading: Practical power management for enterprise storage. In Transactions on Storage. ACM, Article 10, 23 pages.
    [27]
    Storage Networking Industry Association. 2011. IOTTA Repository. Retrieved from http://iotta.snia.org/.
    [28]
    Mark E. J. Newman. 2005. Power laws, pareto distributions and Zipf’s law. Contemporary Physics 46 (2005), 323--351. http://arxiv.org/abs/cond-mat/0412004.
    [29]
    Chanik Park, Wonmoon Cheon, Jeonguk Kang, Kangho Roh, Wonhee Cho, and Jin-Soo Kim. 2008. A reconfigurable FTL (flash translation layer) architecture for NAND flash-based applications. ACM Transactions in Embedded Computing Systems 7, 4, Article 38 (Aug. 2008), 23 pages.
    [30]
    Dongchul Park and David H. C. Du. 2011. Hot data identification for flash-based storage systems using multiple bloom filters. In Proceedings of the 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST’11). IEEE Computer Society, 1--11.
    [31]
    Alma Riska and Erik Riedel. 2006. Disk drive level workload characterization. In Proceedings of the Annual Conference on USENIX’06 Annual Technical Conference (ATEC’06). USENIX Association, 9.
    [32]
    Storage Performance Council. 2002. OLTP Application I/O. Retrieved from http://traces.cs.umass.edu/index.php/Storage/Storage.
    [33]
    Storage Performance Council. 2009. SPC benchmark 2C™, (SPC-2C) official specification. Retrieved from http://www.storageperformance.org/specs/spc2c_v1.2.pdf.
    [34]
    Wenting Tang, Yun Fu, Ludmila Cherkasova, and Amin Vahdat. 2003. MediSyn: A synthetic streaming media service workload generator. In Proceedings of the 13th International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV’03). ACM, 12--21.
    [35]
    Vernon Turner, Stephen Minton, Vernon Turner, and David Reinsel. 2014. The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. Retrieved from http://idcdocserv.com/1678.
    [36]
    Benny Van Houdt. 2013. A mean field model for a class of garbage collection algorithms in flash-based solid state drives. In Proceedings of the ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’13). ACM, 191--202.
    [37]
    Adepele Williams, Arlitt Martin, Carey Williamson, and Barker Ken. 2005. Web server workload characterization: Ten years later. In Web Information Systems Engineering and Internet Technologies. Springer, 3--21.
    [38]
    Yue Yang and Jianwen Zhu. 2014. Analytical modeling of garbage collection algorithms in hotness-aware flash-based solid state drives. In Proceedings of the 30th International Conference on Massive Storage Systems and Technology (MSST’14). IEEE.
    [39]
    Hongliang Yu, Dongdong Zheng, Ben Y. Zhao, and Weimin Zheng. 2006. Understanding user behavior in large-scale video-on-demand systems. In Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006 (EuroSys’06). ACM, 333--344.
    [40]
    George Kingsley Zipf. 1950. Human behavior and the principle of least effort. cambridge, (mass.): Addison-Wesley, 1949, pp. 573. Journal of Clinical Psychology 6, 3 (1950), 394--401.

    Cited By

    View all
    • (2024)Text Semantics-Driven Data Classification Storage OptimizationApplied Sciences10.3390/app1403115914:3(1159)Online publication date: 30-Jan-2024
    • (2024)Drone-Based Bug Detection in Orchards with Nets: A Novel Orienteering ApproachACM Transactions on Sensor Networks10.1145/365371320:3(1-28)Online publication date: 22-Mar-2024
    • (2024)Towards Energy-Efficient and Thermal-Aware Data Placement for Storage ClustersIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33516849:4(631-647)Online publication date: Jul-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Storage
    ACM Transactions on Storage  Volume 12, Issue 4
    August 2016
    213 pages
    ISSN:1553-3077
    EISSN:1553-3093
    DOI:10.1145/2940403
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 June 2016
    Accepted: 01 January 2016
    Revised: 01 October 2015
    Received: 01 November 2014
    Published in TOS Volume 12, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Flash memory
    2. Zipf’s law
    3. workloads
    4. write traffic

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)87
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Text Semantics-Driven Data Classification Storage OptimizationApplied Sciences10.3390/app1403115914:3(1159)Online publication date: 30-Jan-2024
    • (2024)Drone-Based Bug Detection in Orchards with Nets: A Novel Orienteering ApproachACM Transactions on Sensor Networks10.1145/365371320:3(1-28)Online publication date: 22-Mar-2024
    • (2024)Towards Energy-Efficient and Thermal-Aware Data Placement for Storage ClustersIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33516849:4(631-647)Online publication date: Jul-2024
    • (2024)A Probabilistic Sketch for Summarizing Cold Items of Data StreamsIEEE/ACM Transactions on Networking10.1109/TNET.2023.331642632:2(1287-1302)Online publication date: Apr-2024
    • (2024)Lauca: A Workload Duplicator for Benchmarking Transactional Database PerformanceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336011636:7(3180-3194)Online publication date: Jul-2024
    • (2024)Enhancing LSM-Tree Key-Value Stores for Read-Modify-Writes via Key-Delta Separation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00375(4938-4950)Online publication date: 13-May-2024
    • (2024)Space-efficient and high-performance inline deduplication for emerging hybrid storage system with Libra+Journal of Systems Architecture10.1016/j.sysarc.2024.103137150(103137)Online publication date: May-2024
    • (2024)Cost-effective data classification storage through text seasonal featuresFuture Generation Computer Systems10.1016/j.future.2024.04.061158(472-487)Online publication date: Sep-2024
    • (2023)A Caching Placement Strategy Based on Dynamic Router Hierarchy for Named Data Networking2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00034(174-179)Online publication date: 17-Dec-2023
    • (2023)EEM: An elastic event matching framework for content-based publish/subscribe systemsComputer Networks10.1016/j.comnet.2023.109837232(109837)Online publication date: Aug-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media