Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Comprehensive Characterization of an Open Source Document Search Engine

Published: 29 May 2019 Publication History
  • Get Citation Alerts
  • Abstract

    This work performs a thorough characterization and analysis of the open source Lucene search library. The article describes in detail the architecture, functionality, and micro-architectural behavior of the search engine, and investigates prominent online document search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput, explore the potential use of low power servers for document search, and examine the sources of performance degradation ands the causes of tail latencies. Some of our main conclusions are the following: (a) intra-server index partitioning can reduce tail latencies but with diminishing benefits as incoming query traffic increases, (b) low power servers given enough partitioning can provide same average and tail response times as conventional high performance servers, (c) index search is a CPU-intensive cache-friendly application, and (d) C-states are the main culprits for performance degradation in document search.

    References

    [1]
    Apache. 2012. Nutch Crawl Tutorial. Retrieved April 9, 2019 from https://wiki.apache.org/nutch/NutchTutorial.
    [2]
    Ioannis Arapakis, Xiao Bai, and B. Barla Cambazoglu. 2014. Impact of response latency on user behavior in web search. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 103--112.
    [3]
    Claudine Badue, Ricardo Baeza-Yates, Berthier Ribeiro-Neto, and Nivio Ziviani. 2001. Distributed query processing using partitioned inverted files. In Proceedings of the 8th Symposium on String Processing and Information Retrieval. IEEE, Los Alamitos, CA, 0010.
    [4]
    Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis Lectures on Computer Architecture 8, 3 (2013), 1--154.
    [5]
    Luiz André Barroso, Jeffrey Dean, and Urs Hölzle. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 23, 2 (2003), 22--28.
    [6]
    Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and James Lin. 2012. Earlybird: Real-time search at Twitter. In Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE’12). IEEE, Los Alamitos, CA, 1360--1369.
    [7]
    Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, et al. 2007. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Operating Systems Review 41, 205--220.
    [8]
    Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, et al. 2012. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. ACM SIGPLAN Notices 47, 37--48.
    [9]
    Zacharias Hadjilambrou, Marios Kleanthous, and Yiannakis Sazeides. 2015. Characterization and analysis of a web search benchmark. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’15). IEEE, Los Alamitos, CA, 328--337.
    [10]
    Md E. Haque, Yong Hun Eom, Yuxiong He, Sameh Elnikety, Ricardo Bianchini, and Kathryn S. McKinley. 2015. Few-to-many: Incremental parallelism for reducing tail latency in interactive services. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 161--175.
    [11]
    Damien Hardy, Marios Kleanthous, Isidoros Sideris, Ali G. Saidi, Emre Ozer, and Yiannakis Sazeides. 2013. An analytical framework for estimating TCO and exploring data center design space. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’13). IEEE, Los Alamitos, CA, 54--63.
    [12]
    Todd Hoff. 2009. Latency is everywhere and it costs you sales—How to crush it. High Scalability. Retrieved April 9, 2019 from http://www.highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it.
    [13]
    Urs Hölzle. 2010. Brawny cores still beat wimpy cores, most of the time. IEEE Micro 30, 4 (2010), 1--2.
    [14]
    Chang-Hong Hsu, Yunqi Zhang, Michael A. Laurenzano, David Meisner, Thomas Wenisch, Jason Mars, Lingjia Tang, et al. 2015. Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, Los Alamitos, CA, 271--282.
    [15]
    Myeongjae Jeon, Yuxiong He, Sameh Elnikety, Alan L. Cox, and Scott Rixner. 2013. Adaptive parallelism for web search. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, New York, NY, 155--168.
    [16]
    S. Shunmuga Krishnan and Ramesh K. Sitaraman. 2013. Video stream quality impacts viewer behavior: Inferring causality using quasi-experimental designs. IEEE/ACM Transactions on Networking 21, 6 (2013), 2001--2014.
    [17]
    J. Li, K. Agrawal, S. Elnikety, Y. He, I. Lee, C. Lu, K. S. McKinley, et al. 2016. Work stealing for interactive services to meet target latency. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 14.
    [18]
    Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. 2014. Tales of the tail: Hardware, OS, and application-level sources of tail latency. In Proceedings of the ACM Symposium on Cloud Computing. ACM, New York, NY, 1--14.
    [19]
    Linux. 2004. numatcl. Retrieved April 9, 2019 from http://linux.die.net/man/8/numactl.
    [20]
    David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Christos Kozyrakis. 2014. Towards energy proportionality for large-scale latency-critical workloads. ACM SIGARCH Computer Architecture News 42, 301--312.
    [21]
    Pejman Lotfi-Kamran, Boris Grot, Michael Ferdman, Stavros Volos, Onur Kocberber, Javier Picorel, Almutaz Adileh, et al. 2012. Scale-out processors. ACM SIGARCH Computer Architecture News 40, 500--511.
    [22]
    Lucene. 2012. Lucene Scoring Explanation. Retrieved April 9, 209 from lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html.
    [23]
    Lucene. 2012. Lucene Variable Integer Format. Retrieved April 9, 2019 from https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/store/DataOutput.html#write VInt(int).
    [24]
    Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 1. Cambridge University Press.
    [25]
    Jason Mars and Lingjia Tang. 2013. Whare-map: Heterogeneity in homogeneous warehouse-scale computers. ACM SIGARCH Computer Architecture News 41, 619--630.
    [26]
    Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, 248--259.
    [27]
    David Meisner, Christopher M. Sadler, Luiz André Barroso, Wolf-Dietrich Weber, and Thomas F. Wenisch. 2011. Power management of online data-intensive services. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). IEEE, Los Alamitos, CA, 319--330.
    [28]
    Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.
    [29]
    Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In Proceedings of the 1st International Conference on Scalable Information Systems (InfoScale’06), Vol. 152. Article 1.
    [30]
    Vijay Janapa Reddi, Benjamin C. Lee, Trishul Chilimbi, and Kushagra Vaid. 2010. Web search using mobile cores. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10).
    [31]
    Shaolei Ren, Yuxiong He, Sameh Elnikety, and Kathryn S. McKinley. 2013. Exploiting processor heterogeneity in interactive services. In Proceedings of the 10th International Conference on Autonomic Computing (ICAC’13). 45--58.
    [32]
    Shirish Tatikonda, B. Barla Cambazoglu, and Flavio P. Junqueira. 2011. Posting list intersection on multicore architectures. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 963--972.
    [33]
    B. Vamanan, H. Bin Sohail, J. Hasan, and T. N. Vijaykumar. 2015. Timetrader: Exploiting latency tail to save datacenter energy for online search. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, New York, NY, 585--597.
    [34]
    Huafeng Xi, Jianfeng Zhan, Zhen Jia, Xuehai Hong, Lei Wang, Lixin Zhang, Ninghui Sun, et al. 2011. Characterization of real workloads of web search engines. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’11). IEEE, Los Alamitos, CA, 15--25.
    [35]
    Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-flux: Precise online QoS management for increased utilization in warehouse scale computers. ACM SIGARCH Computer Architecture News 41, 607--618.

    Cited By

    View all
    • (2023)Analyzing and Improving the Scalability of In-Memory Indices for Managed Search EnginesProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595272(15-29)Online publication date: 6-Jun-2023
    • (2022)Web Page Ranking Using Web Mining TechniquesMobile Information Systems10.1155/2022/75195732022Online publication date: 1-Jan-2022
    • (2020)System Design of Cloud Search Engine Based on Rich Text ContentMobile Networks and Applications10.1007/s11036-020-01676-3Online publication date: 31-Oct-2020

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 2
    June 2019
    317 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3325131
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 May 2019
    Accepted: 01 March 2019
    Revised: 01 March 2019
    Received: 01 May 2018
    Published in TACO Volume 16, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Document search
    2. characterization
    3. evaluation
    4. experimentation
    5. index partitioning
    6. measurement
    7. parallel index search
    8. parallelism
    9. performance
    10. real hardware

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)68
    • Downloads (Last 6 weeks)9

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Analyzing and Improving the Scalability of In-Memory Indices for Managed Search EnginesProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595272(15-29)Online publication date: 6-Jun-2023
    • (2022)Web Page Ranking Using Web Mining TechniquesMobile Information Systems10.1155/2022/75195732022Online publication date: 1-Jan-2022
    • (2020)System Design of Cloud Search Engine Based on Rich Text ContentMobile Networks and Applications10.1007/s11036-020-01676-3Online publication date: 31-Oct-2020

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media