Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2433396.2433486acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Maguro, a system for indexing and searching over very large text collections

Published: 04 February 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Maguro is a system for efficiently searching very large collections of text content of up to 1 trillion documents at low cost. Search engines span across content that is very dynamic and highly augmented with metadata to the tail content of the web. A long tail distribution of content calls for different trade-offs in the design space for good efficiency across the entire index range. Maguro is designed for the long tail of content with less dynamics and less metadata, but very good cost efficiency. Maguro is part of the serving stack in Bing and allows us to scale the index significantly better.

    References

    [1]
    Badue C., Baeza-Yates R., Ribeiro-Neto B., and Ziviani N. 2001. Distributed query processing using partitioned inverted files. In G. Navarro, editor, Proc. String Processing and Information Retrieval Symp., IEEE Computer Society, Laguna de San Rafael, Chile, 10--20.
    [2]
    Baeza-Yates, R., Ribeiro-Neto, B. 1999. Modern information retrieval. ACM Press, New York, NY.
    [3]
    Barroso, L. A., Dean, J., & Holzle, U. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro, 23, 2, 22--28.
    [4]
    Bing L. 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer Berlin Heidelberg.
    [5]
    Broder, A., Carmel, D., Herscovici, M., Soffer, A., Zien, J. Efficient query evaluation using a two-level retrieval process. In Proceedings of the CIKM, 2003.
    [6]
    Chaiken, R., Jenkins, B., Larson P., Ramsey, B., Shakib, D., Weaver, S., Zhou J. SCOPE: Easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1, 2. August 2008.
    [7]
    Greenberg, A., Hamilton, J., Jain N., Kandula, S., Kim, C., Lahri, P., Maltz, D., Patel, P., Sengupta, S. VL2: A scalable and flexible data center network. Communications of the ACM, March 2008.
    [8]
    Greenberg, A., Lahri, P., Maltz, D., Patel, P., Sengupta S. Towards a next generation data center architecture: scalability and commoditization. Proc. ACM workshop on Programmable routers for extensible services of tomorrow., 2008.
    [9]
    Isard, M., Mihai, B., Yuan, Y., Birell, A., Fetterly, D. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. Proc. Of ACM Eurosys, 2007.
    [10]
    Isard, M. 2007. Autopilot: Automatic Data Center Management, in Operating Systems Review, 41, 2 (April 2007), 60--67.
    [11]
    Jarvelin, K., Kekalainen, J. Cumulated gain-based evaluation of IR techniques. In ACM Transactions on Information Systems 20(4). 2002.
    [12]
    Lorch J. R., Adya A., Bolosky W. J., Chaiken R., Douceur J. R., Howell J. 2006. The smart way to migrate replicated stateful services, In Proceedings of ACM Eurosys.
    [13]
    Manning C. D., Raghavan P., Schutze H. 2008. Introduction to Information Retrieval, Cambridge University Press.
    [14]
    Marin, M., Gomez, C., Gonzalez, S., Costa, G.V. 2008. Scheduling Intersection Queries in Term Partitioned Inverted Files, 14th European Conference on Parallel and Distributed Computing (EuroPar 2008), LNCS, Springer (Aug. 26--29), Spain.
    [15]
    Melink, S., Raghavan, S., Yang, B., Garcia-Molina, H. 2001. Building a distributed full-text index for the web. ACM Trans. Inf. Syst. 19, 3, 217--241.
    [16]
    Moffat, A., Webber, W., Zobel, J. 2006. Load balancing for term-distributed parallel retrieval. In Proceeding of SIGIR 2006: 29th annual international ACM SIGIR conference on Research and development in information retrieval, 348--355.
    [17]
    Moffat, A., Webber, W., Zobel, J., Baeza-Yates, R. 2007. A pipelined architecture for distributed text query evaluation. Information Retrieval, 10, 3, 205--231.
    [18]
    Ribeiro-Neto, B., Barbosa, R. Query performance for tightly coupled distributed digital libraries. In Proc. ACM Digital Libraries, June 1998.
    [19]
    Tomasic, A., H. Garcia-Molina, H. 1996. Performance issues in distributed shared-nothing information-retrieval systems. Information Processing & Management, 32, 6, 647--665.

    Cited By

    View all
    • (2022)Scalability Challenges in Web Search EnginesundefinedOnline publication date: 10-Mar-2022
    • (2021)The cosmos big data platform at MicrosoftProceedings of the VLDB Endowment10.14778/3476311.347639014:12(3148-3161)Online publication date: 28-Oct-2021
    • (2020)Pipelined Query Processing Using Non-volatile Memory SSDsWeb and Big Data10.1007/978-3-030-60290-1_35(457-472)Online publication date: 12-Aug-2020
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining
    February 2013
    816 pages
    ISBN:9781450318693
    DOI:10.1145/2433396
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 February 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. index serving
    2. scalability

    Qualifiers

    • Research-article

    Conference

    WSDM 2013

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Scalability Challenges in Web Search EnginesundefinedOnline publication date: 10-Mar-2022
    • (2021)The cosmos big data platform at MicrosoftProceedings of the VLDB Endowment10.14778/3476311.347639014:12(3148-3161)Online publication date: 28-Oct-2021
    • (2020)Pipelined Query Processing Using Non-volatile Memory SSDsWeb and Big Data10.1007/978-3-030-60290-1_35(457-472)Online publication date: 12-Aug-2020
    • (2019)A Hybrid BitFunnel and Partitioned Elias-Fano Inverted IndexThe World Wide Web Conference10.1145/3308558.3313553(1153-1163)Online publication date: 13-May-2019
    • (2018)List intersection for web searchProceedings of the VLDB Endowment10.14778/3275536.327553712:1(1-13)Online publication date: 1-Sep-2018
    • (2018)Crawling, indexing, and retrieving moments in videogamesProceedings of the 13th International Conference on the Foundations of Digital Games10.1145/3235765.3235786(1-10)Online publication date: 7-Aug-2018
    • (2017)Small-Term Distribution for Disk-Based SearchProceedings of the 2017 ACM Symposium on Document Engineering10.1145/3103010.3103022(49-58)Online publication date: 31-Aug-2017
    • (2017)BitFunnelProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080789(605-614)Online publication date: 7-Aug-2017
    • (2016)Prediction and Predictability for Search Query AccelerationACM Transactions on the Web10.1145/294378410:3(1-28)Online publication date: 16-Aug-2016
    • (2016)Fast First-Phase Candidate Generation for Cascading RankersProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2911515(295-304)Online publication date: 7-Jul-2016
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media