research-article

Cache Design of SSD-Based Search Engine Architectures: An Experimental Study

Authors:

Xiaoguang LiuAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 32, Issue 4

Article No.: 21, Pages 1 - 26

https://doi.org/10.1145/2661629

Published: 28 October 2014 Publication History

Abstract

Caching is an important optimization in search engine architectures. Existing caching techniques for search engine optimization are mostly biased towards the reduction of random accesses to disks, because random accesses are known to be much more expensive than sequential accesses in traditional magnetic hard disk drive (HDD). Recently, solid-state drive (SSD) has emerged as a new kind of secondary storage medium, and some search engines like Baidu have already used SSD to completely replace HDD in their infrastructure. One notable property of SSD is that its random access latency is comparable to its sequential access latency. Therefore, the use of SSDs to replace HDDs in a search engine infrastructure may void the cache management of existing search engines. In this article, we carry out a series of empirical experiments to study the impact of SSD on search engine cache management. Based on the results, we give insights to practitioners and researchers on how to adapt the infrastructure and caching policies for SSD-based search engines.

References

[1]

Devesh Agrawal, Deepak Ganesan, Ramesh Sitaraman, Yanlei Diao, and Shashi Singh. 2009. Lazy-adaptive tree: An optimized index structure for flash devices. Proc. VLDB Endow. 2, 1 (2009), 361--372.

Digital Library

[2]

Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse, and Rina Panigrahy. 2008. Design tradeoffs for SSD performance. In Proceedings of the USENIX Conference on Annual Technical Conference (ATC). 57--70.

Digital Library

[3]

Ismail Sengor Altingovde, Rifat Ozcan, B. Barla Cambazoglu, and Özgür Ulusoy. 2011. Second chance: A hybrid approach for dynamic result caching in search engines. In Proceedings of the European Conference on Advances in Information Retrieval (ECIR). 510--516.

Digital Library

[4]

Ismail Sengor Altingovde, Rifat Ozcan, and Özgür Ulusoy. 2009. A cost-aware strategy for query result caching in web search engines. In Proceedings of the European Conference on Advances in Information Retrieval (ECIR). 628--636.

Digital Library

[5]

Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, and Fabrizio Silvestri. 2007a. Challenges on distributed web retrieval. In Proceedings of the International Conference on Data Engineering (ICDE). 6--20.

[6]

Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007b. The impact of caching on search engines. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 183--190.

Digital Library

[7]

Ricardo Baeza-Yates, Aristides Gionis, Flavio P. Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2008. Design trade-offs for search engine caching. ACM Trans. Web 2, 4 (2008), 1--28.

Digital Library

[8]

Ricardo Baeza-Yates and Simon Jonassen. 2012. Modeling static caching in web search engines. In Proceedings of the European Conference on Advances in Information Retrieval (ECIR). 436--446.

Digital Library

[9]

Ricardo Baeza-Yates and Felipe Saint-Jean. 2003. A three level search engine index based in query log distribution. In Proceedings of the International Symposium on String Processing and Information Retrieval (SPIRE). 56--65.

[10]

Luiz André Barroso, Jeffrey Dean, and Urs Hölzle. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro Mag. 23, 2 (2003), 22--28.

Digital Library

[11]

Laszlo A. Belady. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 5, 2 (1966), 78--101.

Digital Library

[12]

Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the International Conference on Information and Knowledge Management (CIKM). 426--434.

Digital Library

[13]

Pei Cao and Sandy Irani. 1997. Cost-aware WWW proxy caching algorithms. In Proceedings of the USENIX Symposium on Internet Technologies and Systems.

Digital Library

[14]

Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Fabrizio Silvestri. 2011. Caching query-biased snippets for efficient retrieval. In Proceedings of the International Conference on Extending Database Technology (EDBT). 93--104.

Digital Library

[15]

Feng Chen, David A. Koufaty, and Xiaodong Zhang. 2009. Understanding intrinsic characteristics and system implications of flash memory based solid state drives. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 181--192.

Digital Library

[16]

Shimin Chen, Phillip B. Gibbons, and Suman Nath. 2011. Rethinking database algorithms for phase change memory. In Proceedings of the International Conference on Innovative Data Systems Research (CIDR). 21--31.

[17]

Jeffrey Dean. 2009. Challenges in building large-scale information retrieval systems: Invited talk. In Proceedings of the International Conference on Web Search and Data Mining (WSDM).

Digital Library

[18]

Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. FlashStore: High throughput persistent key-value store. Proc. VLDB Endow. 3, 1--2 (2010), 1414--1425.

Digital Library

[19]

Klaus Elhardt and Rudolf Bayer. 1984. A database cache for high performance and fast restart in database systems. ACM Trans. Datab. Syst. 9, 4 (1984), 503--525.

Digital Library

[20]

Tiziano Fagni, Raffaele Perego, Fabrizio Silvestri, and Salvatore Orlando. 2006. Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Trans. Inf. Syst. 24, 1 (2006), 51--78.

Digital Library

[21]

Brad Fitzpatrick. 2009. Memcached -- A distributed memory object caching system. http://memcached.org/. (2009).

[22]

Flexstar Technology. 2012. Flexstar SSD test market analysis. http://info.flexstar.com/Portals/161365/docs/SSD_Testing_Market_Analysis.pdf. (2012).

[23]

Eran Gal and Sivan Toledo. 2005. Algorithms and data structures for flash memories. ACM Comput. Surv. 37, 2 (2005), 138--163.

Digital Library

[24]

Qingqing Gan and Torsten Suel. 2009. Improved techniques for result caching in web search engines. In Proceedings of the International Conference on World Wide Web (WWW). 431--440.

Digital Library

[25]

Goetz Graefe. 2009. The five-minute rule 20 years later (and how flash memory changes the rules). Commun. ACM 52, 7 (2009), 48--59.

Digital Library

[26]

Jim Gray. 2006. Tape is dead, disk is tape, flash is disk, ram locality is king. http://research.microsoft.com/en-us/um/people/gray/talks/Flash_is_Good.ppt. (2006).

[27]

Ari Geir Hauksson and Sverrir Smundsson. 2007. Data storage technologies. http://olafurandri.com/nyti/papers2007/DST.pdf. (2007).

[28]

Enric Herrero, José González, and Ramon Canal. 2008. Distributed cooperative caching. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 134--143.

Digital Library

[29]

Bojun Huang and Zenglin Xia. 2011. Allocating inverted index into flash memory for search engines. In Proceedings of the International Conference on World Wide Web (WWW). 61--62.

Digital Library

[30]

Song Jiang and Xiaodong Zhang. 2002. LIRS: An efficient low inter-reference recency set replacement policy to improve buffer cache performance. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 31--42.

Digital Library

[31]

Atsuo Kawaguchi, Shingo Nishioka, and Hiroshi Motoda. 1995. A flash-memory based file system. In Proceedings of the USENIX Conference on Annual Technical Conference (ATC). 155--164.

Digital Library

[32]

Zsolt Kerekes. 2009. Storage market outlook to 2015. http://www.storagesearch.com/5year-2009.html. (2009).

[33]

Sang-Won Lee and Bongki Moon. 2007. Design of flash-based DBMS: An in-page logging approach. In Proceedings of the ACM Conference on Management of Data (SIGMOD). 55--66.

Digital Library

[34]

Sang-Won Lee, Bongki Moon, Chanik Park, Jae-Myung Kim, and Sang-Woo Kim. 2008. A case for flash memory SSD in enterprise database applications. In Proceedings of the ACM Conference on Management of Data (SIGMOD). 1075--1086.

Digital Library

[35]

Ruixuan Li, Xuefan Chen, Chengzhou Li, Xiwu Gu, and Kunmei Wen. 2012a. Efficient online index maintenance for SSD-based information retrieval systems. In Proceedings of the International Conference on High Performance Computing and Communication (HPCC). 262--269.

Digital Library

[36]

Ruixuan Li, Chengzhou Li, Weijun Xiao, Hai Jin, Heng He, Xiwu Gu, Kunmei Wen, and Zhiyong Xu. 2012b. An efficient SSD-based hybrid storage architecture for large-scale search engines. In Proceedings of the International Conference on Parallel Processing (ICPP). 450--459.

Digital Library

[37]

Yinan Li, Bingsheng He, Robin Jun Yang, Qiong Luo, and Ke Yi. 2010. Tree indexing on solid state drives. Proc. VLDB Endow. 3, 1--2 (2010), 1195--1206.

Digital Library

[38]

Xiaohui Long and Torsten Suel. 2005. Three-level caching for efficient query processing in large web search engines. In Proceedings of the International Conference on World Wide Web (WWW). 257--266.

Digital Library

[39]

Ruyue Ma. 2010. Baidu distributed database. In Proceedings of the System Architect Conference China (SACC).

[40]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. 2008. Introduction to Information Retrieval. Cambridge University Press.

Digital Library

[41]

Mauricio Marin, Veronica Gil-Costa, and Carlos Gomez-Pantoja. 2010. New caching techniques for web search engines. In Proceedings of the ACM International Symposium on High Performance Distributed Computing (HPDC). 215--226.

Digital Library

[42]

Evangelos P. Markatos. 2001. On caching search engine query results. Comput. Commun. 24, 2 (2001), 137--143.

Digital Library

[43]

Dushyanth Narayanan, Eno Thereska, Austin Donnelly, Sameh Elnikety, and Antony Rowstron. 2009. Migrating server storage to SSDs: Analysis of tradeoffs. In Proceedings of the ACM European Conference on Computer Systems (EuroSys). 145--158.

Digital Library

[44]

Suman Nath and Phillip B. Gibbons. 2008. Online maintenance of very large random samples on flash storage. Proc. VLDB Endow. 1, 1 (2008), 970--983.

Digital Library

[45]

Rifat Ozcan, Ismail Sengor Altingovde, B. Barla Cambazoglu, Flavio P. Junqueira, and Özgür Ulusoy. 2011b. A five-level static cache architecture for web search engines. Inf. Process. Manag. 48, 5 (2011), 828--840.

Digital Library

[46]

Rifat Ozcan, Ismail Sengor Altingovde, and Özgür Ulusoy. 2008. Static query result caching revisited. In Proceedings of the International Conference on World Wide Web (WWW). 1169--1170.

Digital Library

[47]

Rifat Ozcan, Ismail Sengor Altingovde, and Özgür Ulusoy. 2011a. Cost-aware strategies for query result caching in web search engines. ACM Trans. Web 5, 2 (2011), 1--25.

Digital Library

[48]

Seon-yeong Park, Dawoon Jung, Jeong-uk Kang, Jin-soo Kim, and Joonwon Lee. 2006. CFLRU: A replacement algorithm for flash memory. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES). 234--241.

Digital Library

[49]

Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In Proceedings of the International Conference on Scalable Information Systems (InfoScale).

Digital Library

[50]

Stefan Podlipnig and Laszlo Böszörmenyi. 2003. A survey of Web cache replacement strategies. ACM Comput. Surv. 35, 4 (2003), 374--398.

Digital Library

[51]

Hongchan Roh, Sanghyun Park, Sungho Kim, Mincheol Shin, and Sang-Won Lee. 2011. B+-tree index optimization by exploiting internal parallelism of flash-based solid state drives. Proceedings of the VLDB Endowment (PVLDB) 5, 4 (2011), 286--297.

Digital Library

[52]

Paricia Correia Saraiva, Edleno Silva de Moura, Novio Ziviani, Wagner Meira, Rodrigo Fonseca, and Berthier Riberio-Neto. 2001. Rank-preserving two-level caching for scalable search engines. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 51--58.

Digital Library

[53]

Mohit Saxena and Michael M. Swift. 2009. FlashVM: Revisiting the virtual memory hierarchy. In Proceedings of the International Conference on Hot Topics in Operating Systems (HotOS).

Digital Library

[54]

Falk Scholer, Hugh E. Williams, John Yiannis, and Justin Zobel. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 222--229.

Digital Library

[55]

Euiseong Seo, Seon Yeong Park, and Bhuvan Urgaonkar. 2008. Empirical analysis on energy efficiency of flash-based SSDs. In Proceedings of the International Conference on Power Aware Computing and Systems (HotPower).

Digital Library

[56]

Mehul A. Shah, Stavros Harizopoulos, Janet L. Wiener, and Goetz Graefe. 2008. Fast scans and joins using flash drives. In Proceedings of the International Workshop on Data Management on New Hardware (DaMoN). 17--24.

Digital Library

[57]

Anastasios Tombros and Mark Sanderson. 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 2--10.

Digital Library

[58]

Andrew Trotman. 2003. Compressing inverted files. Inf. Retr. 6, 1 (2003), 5--19.

Digital Library

[59]

Dimitris Tsirogiannis, Stavros Harizopoulos, Mehul A. Shah, Janet L. Wiener, and Goetz Graefe. 2009. Query processing techniques for solid state drives. In Proceedings of the ACM Conference on Management of Data (SIGMOD). 59--72.

Digital Library

[60]

Andrew Turpin, Yohannes Tsegay, David Hawking, and Hugh E. Williams. 2007. Fast generation of result snippets in web search. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 127--134.

Digital Library

[61]

Howard Turtle and James Flood. 1995. Query evaluation: Strategies and optimizations. Inf. Process. Manag. 31, 6 (1995), 831--850.

Digital Library

[62]

Jianguo Wang, Eric Lo, Man Lung Yiu, Jiancong Tong, Gang Wang, and Xiaoguang Liu. 2013. The impact of solid state drive on search engine cache management. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 693--702.

Digital Library

[63]

William Webber and Alistair Moffat. 2005. In search of reliable retrieval experiments. In Proceedings of the Australasian Document Computing Symposium (ADCS). 26--33.

[64]

Jiangong Zhang, Xiaohui Long, and Torsten Suel. 2008. Performance of compressed inverted list caching in search engines. In Proceedings of the International Conference on World Wide Web (WWW). 387--396.

Digital Library

Cited By

Kang SKim JLee GLee JSeo JJung HSong YPark Y(2023)ISP Agent: A Generalized In-storage-processing Workload Offloading Framework by Providing Multiple Optimization OpportunitiesACM Transactions on Architecture and Code Optimization10.1145/363295121:1(1-24)Online publication date: 14-Nov-2023
https://dl.acm.org/doi/10.1145/3632951
Yang KWang HTan ZZhang J(2022)NDANN: efficient SSD-based approximate nearest neighbor search through navigationInternational Conference on Mechanisms and Robotics (ICMAR 2022)10.1117/12.2652299(63)Online publication date: 10-Nov-2022
https://doi.org/10.1117/12.2652299
Liu XPan YLi YWang GLiu X(2022)An NVM SSD-based High Performance Query Processing Framework for Search EnginesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3160557(1-1)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3160557
Show More Cited By

Index Terms

Cache Design of SSD-Based Search Engine Architectures: An Experimental Study
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

The impact of solid state drive on search engine cache management
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Caching is an important optimization in search engine architectures. Existing caching techniques for search engine optimization are mostly biased towards the reduction of random accesses to disks, because random accesses are known to be much more ...
Read More
Caching search engine results over incremental indices
WWW '10: Proceedings of the 19th international conference on World wide web

A Web search engine must update its index periodically to incorporate changes to the Web, and we argue in this work that index updates fundamentally impact the design of search engine result caches. Index updates lead to the problem of cache ...
Read More
An Efficient Data Selection Policy for Search Engine Cache Management
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems

Caching is an effective optimization in search engine. The data selection policy plays a key role in caching, which places the data to be cached in memory. However, the current data selection policies are not suitable to the hybrid storage architecture ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 32, Issue 4

October 2014

198 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/2684820

Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2014

Received: 01 September 2014

Accepted: 01 August 2014

Revised: 01 June 2014

Published in TOIS Volume 32, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China
Key Projects in the Tianjin Science&Technology Pillar Program (11ZCKFGX01100)
Research Grants Council, University Grants Committee, Hong Kong

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
400
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

Kang SKim JLee GLee JSeo JJung HSong YPark Y(2023)ISP Agent: A Generalized In-storage-processing Workload Offloading Framework by Providing Multiple Optimization OpportunitiesACM Transactions on Architecture and Code Optimization10.1145/363295121:1(1-24)Online publication date: 14-Nov-2023
https://dl.acm.org/doi/10.1145/3632951
Yang KWang HTan ZZhang J(2022)NDANN: efficient SSD-based approximate nearest neighbor search through navigationInternational Conference on Mechanisms and Robotics (ICMAR 2022)10.1117/12.2652299(63)Online publication date: 10-Nov-2022
https://doi.org/10.1117/12.2652299
Liu XPan YLi YWang GLiu X(2022)An NVM SSD-based High Performance Query Processing Framework for Search EnginesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3160557(1-1)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3160557
Zeng YHuang YLiu ZLiu J(2022)Distributed and Decentralized Edge Caching in 5G Networks Using Non-Volatile Memory Systems2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS54860.2022.00048(425-435)Online publication date: Jul-2022
https://doi.org/10.1109/ICDCS54860.2022.00048
Wang JLin CPapakonstantinou YSwanson S(2021)Evaluating List Intersection on SSDs for Parallel I/O Skipping2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00161(1823-1828)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00161
Zhang RSun PTong JZang RQian HPan YStones RWang GLiu XLi Y(2021)Three-level Compact Caching for Search Engines Based on Solid State Drives2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys53884.2021.00030(16-25)Online publication date: Dec-2021
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys53884.2021.00030
Wu JRohatgi SReddy Keesara SChhay JKuo KMenon AParsons SUrgaonkar BGiles C(2021)Building an Accessible, Usable, Scalable, and Sustainable Service for Scholarly Big Data2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671612(141-152)Online publication date: 15-Dec-2021
https://doi.org/10.1109/BigData52589.2021.9671612
Kucukyilmaz T(2021)Exploiting temporal changes in query submission behavior for improving the search engine result cache performanceInformation Processing & Management10.1016/j.ipm.2021.10253358:3(102533)Online publication date: May-2021
https://doi.org/10.1016/j.ipm.2021.102533
Barbalace ADecky MPicorel JBhatotia P(2020)blockNDPProceedings of the 21st International Middleware Conference Industrial Track10.1145/3429357.3430519(8-15)Online publication date: 7-Dec-2020
https://dl.acm.org/doi/10.1145/3429357.3430519
Liu XPan YLi YWang GLiu Xd'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)An NVM SSD-Optimized Query Processing FrameworkProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412010(935-944)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3412010
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents