Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Fast candidate generation for real-time tweet search with bloom filter chains

Published: 05 August 2013 Publication History

Abstract

The rise of social media and other forms of user-generated content have created the demand for real-time search: against a high-velocity stream of incoming documents, users desire a list of relevant results at the time the query is issued. In the context of real-time search on tweets, this work explores candidate generation in a two-stage retrieval architecture where an initial list of results is processed by a second-stage rescorer to produce the final output. We introduce Bloom filter chains, a novel extension of Bloom filters that can dynamically expand to efficiently represent an arbitrarily long and growing list of monotonically-increasing integers with a constant false positive rate. Using a collection of Bloom filter chains, a novel approximate candidate generation algorithm called BWand is able to perform both conjunctive and disjunctive retrieval. Experiments show that our algorithm is many times faster than competitive baselines and that this increased performance does not require sacrificing end-to-end effectiveness. Our results empirically characterize the trade-off space defined by output quality, query evaluation speed, and memory footprint for this particular search architecture.

References

[1]
Almeida, P. S., Baquero, C., Preguiça, N., and Hutchison, D. 2007. Scalable Bloom filters. Inform. Process. Lett. 101, 6, 255--261.
[2]
Anh, V., de Kretser, O., and Moffat, A. 2001. Vector-space ranking with effective early termination. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01). 35--42.
[3]
Anh, V. and Moffat, A. 2005. Simplified similarity scoring using term ranks. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). 226--233.
[4]
Asadi, N. and Lin, J. 2012a. Document vector representations for feature extraction in multi-stage document ranking. Inform. Retriev. To appear.
[5]
Asadi, N. and Lin, J. 2012b. Fast candidate generation for two-phase document ranking: Postings list intersection with Bloom filters. In Proceedings of the 21st Annual International ACM Conference on Information and Knowledge Management (CIKM'12). 2419--2422.
[6]
Asadi, N., Lin, J., and Busch, M. 2013. Dynamic memory allocation policies for postings in real-time twitter search. arXiv:1302.5302. http://arvix.org/abs/1302.5302.
[7]
Baeza-Yates, R., Castillo, C., Junqueira, F., Plachouras, V., and Silvestri, F. 2007. Challenges on distributed web retrieval. In Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE'07). 6--20.
[8]
Barbay, J., López-Ortiz, A., and Lu, T. 2006. Faster adaptive set intersections for text searching. In Proceedings of the 5th International Workshop on Experimental Algorithms (WEA'06). 146--157.
[9]
Barroso, L. and Hölzle, U. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan & Claypool.
[10]
Bloom, B. 1970. Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7, 422--426.
[11]
Bose, P., Guo, H., Kranakis, E., Maheshwari, A., Morin, P., Morrison, J., Smid, M., and Tang, Y. 2008. On the false-positive rate of Bloom filters. Inform. Process. Lett. 108, 210--213.
[12]
Broder, A. 2002. A taxonomy of Web search. SIGIR Forum 36, 2, 3--10.
[13]
Broder, A., Carmel, D., Herscovici, M., Soffer, A., and Zien, J. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM'03). 426--434.
[14]
Brown, E. W. 1995. Fast evaluation of structured queries for information retrieval. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95). 30--38.
[15]
Brutlag, J. 2009. Speed matters for Google web search. Tech. rep. Google, Mountain View, CA.
[16]
Burges, C. 2010. From RankNet to LambdaRank to LambdaMART: An overview. Tech. rep. MSR-TR-2010-82, Microsoft Research, Redmond, WA.
[17]
Busch, M., Gade, K., Larson, B., Lok, P., Luckenbill, S., and Lin, J. 2012. Earlybird: Real-time search at Twitter. In Proceedings of the 28th International Conference on Data Engineering (ICDE'12). 1360--1369.
[18]
Büttcher, S. and Clarke, C. L. A. 2005. Indexing time vs. query time: Trade-offs in dynamic information retrieval systems. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM'05). 317--318.
[19]
Cambazoglu, B. B., Zaragoza, H., Chapelle, O., Chen, J., Liao, C., Zheng, Z., and Degenhardt, J. 2010. Early exit optimizations for additive machine learned ranking systems. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM'10). 411--420.
[20]
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating System Design and Implementation (OSDI'06). 205--218.
[21]
Chiueh, T. and Huang, L. 1999. Efficient real-time index updates in text retrieval systems. Tech. rep. State University of New York and Stony Brook, Stony Brook, NY.
[22]
Cormack, G. V., Smucker, M. D., and Clarke, C. L. A. 2011. Efficient and effective spam filtering and re-ranking for large web datasets. Inform. Retriev. 14, 5, 441--465.
[23]
Culpepper, J. S. and Moffat, A. 2010. Efficient set intersection for inverted indexing. ACM Trans. Inf. Syst. 29, 1, Article 1.
[24]
Cutting, D. and Pedersen, J. 1990. Optimization for dynamic inverted index maintenance. In Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'90). 405--411.
[25]
Dakka, W., Gravano, L., and Ipeirotis, P. G. 2008. Answering general time-sensitive queries. In Proceedings of the 17th International Conference on Information and Knowledge Management (CIKM'08). 1437--1438.
[26]
Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI'04). 137--150.
[27]
Demaine, E. D., López-Ortiz, A., and Munro, J. I. 2001. Experiments on adaptive set intersections for text retrieval systems. In Revised Papers from the 3rd International Workshop on Algorithm Engineering and Experimentation (ALENEX'01), Lecture Notes in Computer Science, vol. 2153, Springer Verlag, Berlin Heidelberg, 91--104.
[28]
Ding, S. and Suel, T. 2011. Faster top-k document retrieval using block-max indexes. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'11). 993--1002.
[29]
Efron, M. 2010. Linear time series models for term weighting in information retrieval. J. Amer. Soc. Inf. Sci. Technol. 61, 7, 1299--1312.
[30]
Efron, M. 2011. Information search and retrieval in microblogs. J. Amer. Soc. Inf. Sci. Technol. 62, 6, 996--1008.
[31]
Elsas, J. L. and Dumais, S. T. 2010. Leveraging temporal dynamics of document content in relevance ranking. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM'10). 1--10.
[32]
Fan, L., Cao, P., Almeida, J., and Broder, A. Z. 2000. Summary cache: A scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw. 8, 3, 281--293.
[33]
Ganjisaffar, Y., Caruana, R., and Lopes, C. 2011. Bagging gradient-boosted trees for high precision, low variance ranking models. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'11). 85--94.
[34]
Guo, R., Cheng, X., Xu, H., and Wang, B. 2007. Efficient on-line index maintenance for dynamic text collections by using dynamic balancing tree. In Proceedings of the 16th International Conference on Information and Knowledge Management (CIKM'07). 751--759.
[35]
Hamilton, J. 2007. On designing and deploying Internet-scale services. In Proceedings of the 21st Conference on Large Installation System Administration (LISA'07). 18:1--18:12.
[36]
Heinz, S. and Zobel, J. 2003. Efficient single-pass index construction for text databases. J. Amer. Soc. Inf. Sci. Technol. 54, 8, 713--729.
[37]
Järvelin, K. and Kekäläinen, J. 2002. Cumulative gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4, 422--446.
[38]
Jones, R. and Diaz, F. 2007. Temporal profiles of queries. ACM Trans. Inf. Syst. 25, 3, Article 14.
[39]
Kayaaslan, E., Cambazoglu, B. B., Blanco, R., Junqueira, F., and Aykanat, C. 2011. Energy-price-driven query processing in multi-center Web search engines. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'11). 983--992.
[40]
Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632.
[41]
Lempel, R. and Moran, S. 2000. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Comput. Netw. 33, 387--401.
[42]
Lester, N., Moffat, A., and Zobel, J. 2008. Efficient online index construction for text databases. ACM Trans. Datab. Syst. 33, 3, 19:1--19:33.
[43]
Lester, N., Zobel, J., and Williams, H. E. 2006. Efficient online index maintenance for contiguous inverted lists. Inf. Proces. Manag. 42, 4, 916--933.
[44]
Li, H. 2011. Learning to Rank for Information Retrieval and Natural Language Processing. Morgan & Claypool Publishers.
[45]
Li, J., Loo, B. T., Hellerstein, J. M., Kaashoek, M. F., Karger, D. R., and Morris, R. 2003. On the feasibility of peer-to-peer Web indexing and search. In Proceedings of the 2nd International Workshop on Peer-to-Peer Systems (IPTPS'03). 207--215.
[46]
Li, X. and Croft, W. B. 2003. Time-based language models. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM'03). 469--475.
[47]
Lin, J. and Mishne, G. 2012. A study of “churn” in tweets and real-time search queries. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM'12). 503--506.
[48]
Liu, T.-Y. 2009. Learning to rank for information retrieval. Found. Trends Inf. Retriev. 3, 3, 225--331.
[49]
Macdonald, C., Santos, R. L., and Ounis, I. 2012. The whens and hows of learning to rank for Web search. Inf. Retriev. To appear.
[50]
Matveeva, I., Burges, C., Burkard, T., Laucius, A., and Wong, L. 2006. High accuracy retrieval with multiple nested ranker. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'06). 437--444.
[51]
McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., and McCullough, D. 2012. On building a reusable Twitter corpus. In Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'12). 1113--1114.
[52]
Metzler, D. 2007. Automatic feature selection in the Markov random field model for information retrieval. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM'07). 253--262.
[53]
Metzler, D. and Cai, C. 2011. USC/ISI at TREC 2011: Microblog track. In Proceedings of the 20th Text REtrieval Conference (TREC'11).
[54]
Metzler, D. and Croft, W. B. 2007. Latent concept expansion using Markov random fields. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). 311--318.
[55]
Ounis, I., Macdonald, C., Lin, J., and Soboroff, I. 2011. Overview of the TREC-2011 Microblog Track. In Proceedings of the 20th Text REtrieval Conference (TREC 2011).
[56]
Page, L., Brin, S., Motwani, R., and Winograd, T. 1999. The PageRank citation ranking: Bringing order to the Web. Stanford Digital Library Working Paper SIDL-WP-1999-0120, Stanford University, Standford, CA.
[57]
Pass, G., Chowdhury, A., and Torgeson, C. 2006. A picture of search. In Proceedings of the 1st International Conference on Scalable Information Systems (InfoScale'06).
[58]
Peng, D. and Dabek, F. 2010. Large-scale incremental processing using distributed transactions and notifications. In Proceedings of the 9th Symposium on Operating System Design and Implementation (OSDI'10). 251--264.
[59]
Robertson, S. E., Walker, S., Hancock-Beaulieu, M., Gatford, M., and Payne, A. 1995. Okapi at TREC-4. In Proceedings of the 4th Text REtrieval Conference (TREC-4). 73--96.
[60]
Shepherd, M. A., Phillips, W. J., and Chu, C.-K. 1989. A fixed-size Bloom filter for searching textual documents. Comput. J. 32, 3, 212--219.
[61]
Skobeltsyn, G., Junqueira, F. P., Plachouras, V., and Baeza-Yates, R. 2008. ResIn: A combination of results caching and index pruning for high-performance Web search engines. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'08). 131--138.
[62]
Soboroff, I., McCullough, D., Lin, J., Macdonald, C., Ounis, I., and McCreadie, R. 2012a. Evaluating real-time search over tweets. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM'12). 579--582.
[63]
Soboroff, I., Ounis, I., Macdonald, C., and Lin, J. 2012b. Overview of the TREC 2012b Microblog Track. In Proceedings of the 21st Text REtrieval Conference (TREC'12).
[64]
Strohman, T. and Croft, W. B. 2006. Low latency index maintenance in Indri. In Proceedings of the Open Source Information Retrieval Workshop (OSIR'06). 7--11.
[65]
Strohman, T. and Croft, W. B. 2007. Efficient document retrieval in main memory. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). 175--182.
[66]
Strohman, T., Turtle, H., and Croft, W. B. 2005. Optimization strategies for complex queries. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). 219--225.
[67]
Tatikonda, S., Cambazoglu, B. B., and Junqueira, F. 2011. Posting list intersection on multicore architectures. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'11). 963--972.
[68]
Teevan, J., Ramage, D., and Morris, M. R. 2011. #TwitterSearch: A comparison of microblog search and Web search. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM'11). 35--44.
[69]
Tirdad, K., Ghodsnia, P., Munro, J. I., and López-Ortiz, A. 2011. COCA filters: Co-occurrence aware Bloom filters. In Proceedings of the 18th International Symposium on String Processing and Information Retrieval (SPIRE'11). 313--325.
[70]
Tomasic, A., García-Molina, H., and Shoens, K. 1994. Incremental updates of inverted lists for text document retrieval. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'94). 289--300.
[71]
Tonellotto, N., Macdonald, C., and Ounis, I. 2013. Efficient and effective retrieval using selective pruning. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM'13).
[72]
Tsirogiannis, D., Guha, S., and Koudas, N. 2009. Improving the performance of list intersection. In Proceedings of the 35th International Conference on Very Large Data Bases (VLDB'09). 838--849.
[73]
Turtle, H. and Flood, J. 1995. Query evaluation: Strategies and optimizations. Inf. Proces. Manage. 31, 6, 831--850.
[74]
Wang, L., Lin, J., and Metzler, D. 2011. A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'11). 105--114.
[75]
Zobel, J. and Moffat, A. 2006. Inverted files for text search engines. ACM Comput. Sur. 38, 6, 1--56.

Cited By

View all
  • (2024)Efficient Approximate Maximum Inner Product Search Over Sparse Vectors2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00303(3961-3974)Online publication date: 13-May-2024
  • (2023)An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse VectorsACM Transactions on Information Systems10.1145/360979742:2(1-43)Online publication date: 8-Nov-2023
  • (2022)ReNeuIR: Reaching Efficiency in Neural Information RetrievalProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531704(3462-3465)Online publication date: 6-Jul-2022
  • Show More Cited By

Index Terms

  1. Fast candidate generation for real-time tweet search with bloom filter chains

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 31, Issue 3
    July 2013
    202 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/2493175
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 August 2013
    Accepted: 01 March 2013
    Revised: 01 January 2013
    Received: 01 August 2012
    Published in TOIS Volume 31, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Scalability
    2. bloom filters
    3. efficiency
    4. top-k retrieval
    5. tweet search

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficient Approximate Maximum Inner Product Search Over Sparse Vectors2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00303(3961-3974)Online publication date: 13-May-2024
    • (2023)An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse VectorsACM Transactions on Information Systems10.1145/360979742:2(1-43)Online publication date: 8-Nov-2023
    • (2022)ReNeuIR: Reaching Efficiency in Neural Information RetrievalProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531704(3462-3465)Online publication date: 6-Jul-2022
    • (2021)Exploiting Intel optane persistent memory for full text searchProceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management10.1145/3459898.3463906(80-93)Online publication date: 22-Jun-2021
    • (2020)Read as neededProceedings of the 18th USENIX Conference on File and Storage Technologies10.5555/3386691.3386698(59-74)Online publication date: 24-Feb-2020
    • (2019)A content search method for security topics in microblog based on deep reinforcement learningWorld Wide Web10.1007/s11280-019-00697-7Online publication date: 28-Jul-2019
    • (2018)Content-Aware Partial Compression for Textual Big Data Analysis in HadoopIEEE Transactions on Big Data10.1109/TBDATA.2017.27214314:4(459-472)Online publication date: 1-Dec-2018
    • (2018)Augmented keyword search on spatial entity databasesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0497-627:2(225-244)Online publication date: 1-Apr-2018
    • (2017)Processing Long Queries Against Short TextACM Transactions on Information Systems10.1145/305277235:3(1-27)Online publication date: 12-May-2017
    • (2017)Compact Indexing and Judicious Searching for Billion-Scale Microblog RetrievalACM Transactions on Information Systems10.1145/305277135:3(1-24)Online publication date: 12-May-2017
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media