Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Fast file existence checking in archiving systems

Published: 27 June 2011 Publication History

Abstract

This article presents a new Fast Hash-based File Existence Checking (FHFEC) method for archiving systems. During the archiving process, there are many submissions which are actually unchanged files that do not need to be re-archived. In this system, instead of comparing the entire files, only digests of the files are compared. Strong cryptographic hash functions with a low probability of collision can be used as digests. We propose a fast algorithm to check if a certain hash, that is, a corresponding file, is already stored in the system. The algorithm is based on dividing the whole domain of hashes into equally sized regions, and on the existence of a pointer array, which has exactly one pointer for each region. Each pointer points to the location of the first stored hash from the corresponding region and has a null value if no hash from that region exists. The entire structure can be stored in random access memory or, alternatively, on a dedicated hard disk. A statistical performance analysis has been performed that shows that in certain cases FHFEC performs nearly optimally. Extensive simulations have confirmed these analytical results. The performance of FHFEC has been compared to the performance of a binary search (BIS) and B+tree, which are commonly used in file systems and databases for table indices. The results show that FHFEC significantly outperforms both of them.

References

[1]
Bayer, R. and McCreight, E. M. 1972. Organization and maintenance of large ordered indices. Acta Informatica 1, 173--189.
[2]
Bingmann, T. 2010a. Speed test results. http://idlebox.net/2007/stx-btree/stx-btree-0.8-doxygen/speedtest.html.
[3]
Bingmann, T. 2010b. Stx b+ tree c++ template classes. http://idlebox.net/2007/stx-btree/.
[4]
Bohn, R., et al. 2008. How much information? At the global information industry center. http://hmi.ucsd.edu/howmuchinfo.php.
[5]
Broder, A. Z. 1993. Some Applications of Rabin's Fingerprinting Method, Sequences II: In Methods in Communications, Security and Computer Science, Springer-Verlag.
[6]
Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms, 2nd Ed. MIT Press and McGraw-Hill.
[7]
Corwin, E. M. 2010. Average case of binary search. http://www.mcs.sdsmt.edu/ecorwin/cs251/binavg/binavg.htm.
[8]
Cox, L. P., Murray, C. D., and Noble, B. D. 2002. Pastiche: Making backup cheap and easy. ACM SIGOPS Oper. Syst. Rev. 36, 285--298.
[9]
FIPS 180-2 2002. Secure hash standard. National Institute of Standards and Technology.
[10]
IBM 2010. Grouping hash implementation. http://publib.boulder.ibm.com/infocenter/iseries/v5r3/index.jsp?topic=/rzajq/groupopt.htm.
[11]
Jovanov, E., Milutinovic, V., and Hurson, A. R. 2002. Acceleration of nonnumeric operations using hardware support for the ordered table hashing algorithms. IEEE Trans. Comput. 51, 9.
[12]
Knuth, D. 1997. The Art of Computer Programming, Vol. 3: Sorting and Searching, 3rd Ed. Addison-Wesley.
[13]
Kulkarni, P., Douglis, F., LaVoie, J., and Tracey, J. M. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Technical Conference.
[14]
Lyman, P., Varian, H. R., Swearingen, K., Chanles, P., Good, N., Jorvan, L. L., and Pal, J. 2003. How much information? 2003. http://www2.sims.berkeley.edu/research/projects/how-much-info-2003.
[15]
Muthitacharoen, A., Chen, B., and Mazieres, D. 2001. A low-bandwidth network file system. In Proceedings of the Symposium on Operating Systems Principles.
[16]
Papoulis, A. 1984. Probability, Random Variables and Stochastic Processes, 2nd Ed. McGraw-Hill.
[17]
Parlante, N. 2001. Linked List Basics. Stanford University.
[18]
PCGuide 2010. Logical block addressing (LBA). http://www.pcguide.com/ref/hdd/bios/modesLBA-c.html.
[19]
Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems data. In Proceedings of the USENIX Conference.
[20]
Quinlan, S., and Dorward, S. 2002. Venti: A new approach to archival storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies.
[21]
RFC 1321 1992. The MD5 message-digest algorithm. IETF.
[22]
Rudan, S., Kovacevic, A. Z., Babovic, D. J., Milligan, C., and Milutinovic, V. 2006. One approach to efficient management of zillion signatures. PSI Trans. Internet Res. 2, 2, 17--21.

Cited By

View all
  • (2019)An efficient integer coding index algorithm for multi-scale time information managementData & Knowledge Engineering10.1016/j.datak.2019.01.003Online publication date: Feb-2019
  • (2014)Efficiently Representing Membershipfor Variable Large Data SetsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2013.6625:4(960-970)Online publication date: 1-Apr-2014
  • (2014)Identifying Forensically Uninteresting Files Using a Large CorpusDigital Forensics and Cyber Crime10.1007/978-3-319-14289-0_7(86-101)Online publication date: 23-Dec-2014
  • Show More Cited By

Index Terms

  1. Fast file existence checking in archiving systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Storage
    ACM Transactions on Storage  Volume 7, Issue 1
    June 2011
    73 pages
    ISSN:1553-3077
    EISSN:1553-3093
    DOI:10.1145/1970343
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 June 2011
    Accepted: 01 July 2010
    Revised: 01 April 2010
    Received: 01 December 2009
    Published in TOS Volume 7, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. File systems management
    2. archiving
    3. files backup/recovery
    4. files sorting/searching
    5. hash-table
    6. performance evaluation

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)An efficient integer coding index algorithm for multi-scale time information managementData & Knowledge Engineering10.1016/j.datak.2019.01.003Online publication date: Feb-2019
    • (2014)Efficiently Representing Membershipfor Variable Large Data SetsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2013.6625:4(960-970)Online publication date: 1-Apr-2014
    • (2014)Identifying Forensically Uninteresting Files Using a Large CorpusDigital Forensics and Cyber Crime10.1007/978-3-319-14289-0_7(86-101)Online publication date: 23-Dec-2014
    • (2013)An Undirected Graph Traversal Based Grouping Prediction Method for Data De-duplicationProceedings of the 2013 14th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing10.1109/SNPD.2013.34(3-8)Online publication date: 1-Jul-2013

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media