Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1739041.1739047acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Beyond pages: supporting efficient, scalable entity search with dual-inversion index

Published: 22 March 2010 Publication History

Abstract

Entity search, a significant departure from page-based retrieval, finds data, i.e., entities, embedded in documents directly and holistically across the whole collection. This paper aims at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. We systematically evaluate our framework using a prototype over a 3TB real Web corpus with 150M pages and over 20 entity types extracted. Our experiments in two concrete application settings show our techniques of on average, 2 to 4 orders of magnitude speed-up, over the keyword-based baseline, with reasonable space overhead.

References

[1]
Gate - general architecture for text engineering,.
[2]
S. Abney, M. Collins, and A. Singhal. Answer extraction. In ANLP, 2000.
[3]
E. Brill, S. Dumais, and M. Banko. An analysis of the askmsr question-answering system. In EMNLP, 2002.
[4]
M. Cafarella and O. Etzioni. A search engine for large-corpus language applications. In WWW, 2005.
[5]
M. Cafarella, C. Re, D. Suciu, and O. Etzioni. Structured querying of web text data: A technical challenge. In CIDR, 2007.
[6]
K. Chakrabarti, V. Ganti, J. Han, and D. Xin. Ranking objects based on relationships. In SIGMOD, 2006.
[7]
S. Chakrabarti. Breaking through the syntax barrier: Searching with entities and relations. In ECML, 2004.
[8]
S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In WWW, 2006.
[9]
T. Cheng and K. C.-C. Chang. Entityrank: Searching entities directly and holistically. In VLDB, 2007.
[10]
J. Cho and S. Rajagopalan. A fast regular expression indexing engine. In ICDE, 2002.
[11]
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall. In WWW, 2004.
[12]
E. Kandogan, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar semantic search: a database approach to information retrieval. In SIGMOD, 2006.
[13]
G. Kasneci, F. M. Suchanek, M. Ramanath, and G. Weikum. How naga uncoils: searching with entities and relations. In WWW, 2007.
[14]
C. C. T. Kwok, O. Etzioni, and D. S. Weld. Scaling question answering to the web. In WWW, 2001.
[15]
J. J. Lin and B. Katz. Question answering from the web using knowledge annotation and knowledge mining techniques. In CIKM, 2003.
[16]
X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In VLDB, 2003.
[17]
X. Long and T. Suel. Three-level caching for efficient query processing in large web search engines. In WWW, 2005.
[18]
P. Marius, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Organizing and searching theworldwideweb of facts - step one: the one-million fact extraction challenge. In AAAI, 2006.
[19]
E. Markatos. On caching search engine query results. In Computer Communications, 2000.
[20]
Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma. Web object retrieval. In WWW, 2007.
[21]
M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. J. Am. Soc. Inf. Sci., 47(10), 1996.
[22]
D. Taniar, Y. Jiang, K. H. Liu, and C. H. C. Leung. Aggregate-join query processing in parallel database systems. In HPC, 2000.
[23]
D. Taniar, R. B. Tan, C. H. C. Leung, and K. H. Liu. Performance analysis of "Groupby-After-Join" query processing in parallel database systems. Information Sciences, 168(1--4), Dec. 2004.
[24]
D. Taniar, R. B.-N. Tan, C. H. C. Leung, and K. H. Liu. Performance analysis of "groupby-after-join" query processing in parallel database systems. Inf. Comput. Sci., 168(1--4), 2004.
[25]
G. Weikum. Db&ir: both sides now. In SIGMOD, 2007.
[26]
H. E. Williams, J. Zobel, and D. Bahle. Fast phrase querying with combined indexes. ACM Trans. Inf. Syst., 22(4):573--594, 2004.
[27]
M. Wu and A. Marian. Corroborating answers from multiple web sources. In WebDB, 2007.
[28]
W. P. Yan and P.-A. Larson. Performing group-by before join. In ICDE, 1994.
[29]
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In WWW, 2008.
[30]
M. Zhou, T. Cheng, and K. C.-C. Chang. Data-oriented content query system: Searching for data in text on the web. In WSDM, 2010.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EDBT '10: Proceedings of the 13th International Conference on Extending Database Technology
March 2010
741 pages
ISBN:9781605589459
DOI:10.1145/1739041
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 March 2010

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

EDBT/ICDT '10
EDBT/ICDT '10: EDBT/ICDT '10 joint conference
March 22 - 26, 2010
Lausanne, Switzerland

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Neural embedding-based indices for semantic searchInformation Processing and Management: an International Journal10.1016/j.ipm.2018.10.01556:3(733-755)Online publication date: 1-May-2019
  • (2018)Term-Based Models for Entity RankingEntity-Oriented Search10.1007/978-3-319-93935-3_3(57-99)Online publication date: 3-Oct-2018
  • (2018)IntroductionEntity-Oriented Search10.1007/978-3-319-93935-3_1(1-23)Online publication date: 3-Oct-2018
  • (2015)DB-IR integration using tight-coupling in the Odysseus DBMSWorld Wide Web10.1007/s11280-013-0264-y18:3(491-520)Online publication date: 1-May-2015
  • (2014)Post-analysis of Keyword-Based Search Results Using Entity Mining, Linked Data, and Link Analysis at Query TimeProceedings of the 2014 IEEE International Conference on Semantic Computing10.1109/ICSC.2014.11(36-43)Online publication date: 16-Jun-2014
  • (2014)Exploratory Professional Search through Semantic Post-Analysis of Search ResultsProfessional Search in the Modern World10.1007/978-3-319-12511-4_9(166-192)Online publication date: 2014
  • (2012)BOSS: context-enhanced search for biomedical objectsBMC Medical Informatics and Decision Making10.1186/1472-6947-12-S1-S712:S1Online publication date: 30-Apr-2012
  • (2012)Compressed data structures for annotated web searchProceedings of the 21st international conference on World Wide Web10.1145/2187836.2187854(121-130)Online publication date: 16-Apr-2012
  • (2012)Finding Facet Content on Web by Position Inverted IndexProceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems10.1109/HPCC.2012.253(1699-1703)Online publication date: 25-Jun-2012
  • (2011)BOSSProceedings of the ACM fifth international workshop on Data and text mining in biomedical informatics10.1145/2064696.2064702(19-26)Online publication date: 24-Oct-2011
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media