Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Compact Indexing and Judicious Searching for Billion-Scale Microblog Retrieval

Published: 12 May 2017 Publication History

Abstract

In this article, we study the problem of efficient top-k disjunctive query processing in a huge microblog dataset. In terms of compact indexing, we categorize the keywords into rare terms and common terms based on inverse document frequency (idf) and propose tailored block-oriented organization to save memory consumption. In terms of fast searching, we classify the queries into three types based on term category and judiciously design an efficient search algorithm for each type. We conducted extensive experiments on a billion-scale Twitter dataset and examined the performance with both simple and more advanced ranking functions. The results showed that with much smaller index size, our search algorithm achieves a factor of 2--3 times faster speedup over state-of-the-art solutions in both ranking scenarios.

References

[1]
Vo Ngoc Anh, Owen de Kretser, and Alistair Moffat. 2001. Vector-space ranking with effective early termination. In SIGIR. 35--42.
[2]
Vo Ngoc Anh and Alistair Moffat. 2005. Simplified similarity scoring using term ranks. In SIGIR. 226--233.
[3]
Vo Ngoc Anh and Alistair Moffat. 2006. Pruned query evaluation using pre-computed impacts. In SIGIR. 372--379.
[4]
Nima Asadi and Jimmy Lin. 2013. Fast candidate generation for real-time tweet search with bloom filter chains. In ACM Trans. Inf. Syst. 13.
[5]
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. In WWW. Elsevier Science Publishers B. V., Amsterdam, The Netherlands, 107--117.
[6]
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Y. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In CIKM. 426--434.
[7]
Chris Buckley and A. F. Lewit. 1985. Optimization of inverted vector searches. In SIGIR. 97--110.
[8]
Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. 2012. Earlybird: Real-time search at twitter. In ICDE. 1360--1369.
[9]
Kaushik Chakrabarti, Surajit Chaudhuri, and Venkatesh Ganti. 2011. Interval-based pruning for top-k processing over compressed lists. In ICDE. 709--720.
[10]
Chun Chen, Feng Li, Beng Chin Ooi, and Sai Wu. 2011. TI: An efficient indexing mechanism for real-time search on tweets. In SIGMOD Conference. 649--660.
[11]
Ronan Cummins and Colm O’Riordan. 2009. Learning in a pairwise term-term proximity framework for information retrieval. In SIGIR. 251--258.
[12]
Constantinos Dimopoulos, Sergey Nepomnyachiy, and Torsten Suel. 2013a. A candidate filtering mechanism for fast top-k query processing on modern cpus. In SIGIR. 723--732.
[13]
Constantinos Dimopoulos, Sergey Nepomnyachiy, and Torsten Suel. 2013b. Optimizing top-k document retrieval strategies for block-max indexes. In WSDM. 113--122.
[14]
Shuai Ding and Torsten Suel. 2011. Faster top-k document retrieval using block-max indexes. In SIGIR. 993--1002.
[15]
Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66, 4 (2003), 614--656.
[16]
Marcus Fontoura, Vanja Josifovski, Jinhui Liu, Srihari Venkatesan, Xiangfei Zhu, and Jason Y. Zien. 2011. Evaluation strategies for top-k queries over memory-resident inverted indexes. Proc. VLDB 4, 12 (2011), 1213--1224.
[17]
Zhangjie Fu, Kui Ren, Jiangang Shu, Xingming Sun, and Fengxiao Huang. 2016a. Enabling personalized search over encrypted outsourced data with efficiency improvement. IEEE Trans. Parallel Distrib. Syst. 27, 9 (2016), 2546--2559.
[18]
Zhangjie Fu, Xingming Sun, Sai Ji, and Guowu Xie. 2016b. Towards efficient content-aware search over encrypted outsourced data in cloud. In Proceedings of the 35th Annual IEEE International Conference on Computer Communications (INFOCOM’16). 1--9.
[19]
Zhangjie Fu, Xinle Wu, Chaowen Guan, Xingming Sun, and Kui Ren. 2016c. Toward efficient multi-keyword fuzzy search over encrypted outsourced data with accuracy improvement. IEEE Trans. Inf. Forens. Secur. 11, 12 (2016), 2706--2716.
[20]
Yuchen Li, Dongxiang Zhang, Ziquan Lan, and Kian-Lee Tan. 2016. Context-aware advertisement recommendation for high-speed social news feeding. In Proceedings of the 32nd IEEE International Conference on Data Engineering (ICDE’16). 505--516.
[21]
Jimmy Lin, Miles Efron, Yulu Wang, and Garrick Sherman. 2014. Overview of the TREC-2014 microblog track. In TREC.
[22]
Xiaohui Long and Torsten Suel. 2003. Optimized query execution in large search engines with global page ordering. In VLDB. 129--140. Retrieved from http://www.vldb.org/conf/2003/papers/S05P03.pdf.
[23]
Alistair Moffat and Justin Zobel. 1994. Fast ranking in limited space. In ICDE. 428--437.
[24]
Iadh Ounis, Craig Macdonald, Jimmy Lin, and Ian Soboroff. 2011. Overview of the TREC-2011 microblog track. TREC (2011).
[25]
Michael Persin. 1994. Document filtering for fast ranking. In SIGIR. 339--348.
[26]
Michael Persin, Justin Zobel, and Ron Sacks-Davis. 1996. Filtered document retrieval with frequency-sorted indexes. JASIS 47, 10 (1996), 749--764.
[27]
Cristian Rossi, Edleno Silva de Moura, Andre L. Carvalho, and Altigran Soares da Silva. 2013. Fast document-at-a-time query processing using two-tier indexes. In SIGIR. 183--192.
[28]
Dongdong Shan, Shuai Ding, Jing He, Hongfei Yan, and Xiaoming Li. 2012. Optimized top-k processing with global page scores on block-max indexes. In WSDM. 423--432.
[29]
Fabrizio Silvestri and Rossano Venturini. 2010. VSEncoding: Efficient coding and fast decoding of integer lists via dynamic programming. In CIKM. 1219--1228.
[30]
Ian Soboroff, Iadh Ounis, Craig Macdonald, and Jimmy Lin. 2012. Overview of the TREC-2012 microblog track. In TREC.
[31]
Trevor Strohman and W. Bruce Croft. 2007. Efficient document retrieval in main memory. In SIGIR. 175--182.
[32]
Andrew Trotman. 2003. Compressing inverted files. Inf. Retr. 6, 1 (2003), 5--19.
[33]
Howard R. Turtle and James Flood. 1995. Query evaluation: Strategies and optimizations. Inf. Process. Manage. 31, 6 (1995), 831--850.
[34]
Lingkun Wu, Wenqing Lin, Xiaokui Xiao, and Yabo Xu. 2013. LSII: An indexing structure for exact real-time search on microblogs. In ICDE. 482--493.
[35]
Zhihua Xia, Xinhui Wang, Xingming Sun, and Qian Wang. 2016. A secure and dynamic multi-keyword ranked search scheme over encrypted cloud data. IEEE Trans. Parallel Distrib. Syst. 27, 2 (2016), 340--352.
[36]
Hao Yan, Shuai Ding, and Torsten Suel. 2009. Inverted index compression and query processing with optimized document ordering. In WWW. 401--410.
[37]
Dongxiang Zhang, Chee-Yong Chan, and Kian-Lee Tan. 2014. Processing spatial keyword query as a top-k aggregation query. In SIGIR. 355--364.
[38]
Dongxiang Zhang, Beng Chin Ooi, and Anthony K. H. Tung. 2010a. Locating mapped resources in Web 2.0. In Proceedings of the 26th International Conference on Data Engineering (ICDE 2010). 521--532.
[39]
Dongxiang Zhang, Kian-Lee Tan, and Anthony K. H. Tung. 2013. Scalable top-k spatial keyword search. In Proceedings of the Joint 2013 EDBT/ICDT Conferences (EDBT’13). 359--370.
[40]
Fan Zhang, Shuming Shi, Hao Yan, and Ji-Rong Wen. 2010b. Revisiting globally sorted indexes for efficient document retrieval. In WSDM. 371--380.

Cited By

View all
  • (2022)Unsupervised Entity Resolution With Blocking and Graph AlgorithmsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.299106334:3(1501-1515)Online publication date: 1-Mar-2022
  • (2021)Exploiting Real-time Search Engine Queries for Earthquake Detection: A Summary of ResultsACM Transactions on Information Systems10.1145/345384239:3(1-32)Online publication date: 22-May-2021
  • (2019)Security topics related microblogs search based on deep convolutional neural networksNeurocomputing10.1016/j.neucom.2018.09.105Online publication date: Jul-2019
  • Show More Cited By

Index Terms

  1. Compact Indexing and Judicious Searching for Billion-Scale Microblog Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 35, Issue 3
    July 2017
    410 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3026478
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 May 2017
    Accepted: 01 December 2016
    Revised: 01 August 2016
    Received: 01 April 2016
    Published in TOIS Volume 35, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Top-k
    2. billion-scale
    3. disjunctive keyword search
    4. microblg

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • National Nature Science Foundation of China
    • Jiangsu Collaborative Innovation Center on Atmospheric Environment and Equipment Technology
    • Fundamental Research Funds for the Central Universities
    • Priority Academic Program Development of Jiangsu Higher Education Institutions

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Unsupervised Entity Resolution With Blocking and Graph AlgorithmsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.299106334:3(1501-1515)Online publication date: 1-Mar-2022
    • (2021)Exploiting Real-time Search Engine Queries for Earthquake Detection: A Summary of ResultsACM Transactions on Information Systems10.1145/345384239:3(1-32)Online publication date: 22-May-2021
    • (2019)Security topics related microblogs search based on deep convolutional neural networksNeurocomputing10.1016/j.neucom.2018.09.105Online publication date: Jul-2019
    • (2019)Embedding and predicting the event at early stageWorld Wide Web10.1007/s11280-018-0545-622:3(1055-1074)Online publication date: 21-May-2019
    • (2018)Video logo removal detection based on sparse representationMultimedia Tools and Applications10.5555/3288251.328831777:22(29303-29322)Online publication date: 1-Nov-2018
    • (2018)A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution2018 IEEE 34th International Conference on Data Engineering (ICDE)10.1109/ICDE.2018.00070(713-724)Online publication date: Apr-2018
    • (2017)Tree based searching approaches for integrated vehicle dispatching and container allocation in a transshipment hubExpert Systems with Applications: An International Journal10.1016/j.eswa.2017.01.00374:C(139-150)Online publication date: 15-May-2017

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media