Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Answering table queries on the web using column keywords

Published: 01 June 2012 Publication History

Abstract

We present the design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns. We answer such queries by exploiting the millions of tables on the Web because these are much richer sources of structured knowledge than free-format text. However, a corpus of tables harvested from arbitrary HTML web pages presents huge challenges of diversity and redundancy not seen in centrally edited knowledge bases. We concentrate on one concrete task in this paper. Given a set of Web tables T1,..., Tn, and a query Q with q sets of keywords Q1,..., Qq, decide for each Ti if it is relevant to Q and if so, identify the mapping between the columns of Ti and query columns. We represent this task as a graphical model that jointly maps all tables by incorporating diverse sources of clues spanning matches in different parts of the table, corpus-wide co-occurrence statistics, and content overlap across table columns. We define a novel query segmentation model for matching keywords to table columns, and a robust mechanism of exploiting content overlap across table columns. We design efficient inference algorithms based on bipartite matching and constrained graph cuts to solve the joint labeling task. Experiments on a workload of 59 queries over a 25 million web table corpus shows significant boost in accuracy over baseline IR methods.

References

[1]
Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell., 23(11): 1222--1239, 2001.
[2]
M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1): 1090--1101, 2009.
[3]
M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1): 538--549, 2008.
[4]
M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the relational web. In WebDB, 2008.
[5]
S. Chakrabarti, S. Sarawagi, and S. Sudarshan. Enhancing search with structure. IEEE Data Eng. Bull., 33(1): 3--24, 2010.
[6]
C. Chekuri, S. Khanna, J. S. Naor, and L. Zosin. Approximation algorithms for the metric labeling problem via a new linear programming formulation. In SODA, pages 109--118, 2001.
[7]
E. Crestan and P. Pantel. Web-scale table census and classification. In WSDM, pages 545--554, 2011.
[8]
A. Doan and A. Y. Halevy. Semantic integration research in the database community: A brief survey. The AI Magazine, 26(1): 83--94, 2005.
[9]
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. PVLDB, 2(1): 289--300, 2009.
[10]
D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT Press, 2009.
[11]
V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Pattern Anal. Mach. Intell., 28(10): 1568--1583, 2006.
[12]
P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In EMNLP, volume 2, pages 938--947, 2009.
[13]
C. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity, chapter 11, pages 247--254. Prentice Hall, 1982.
[14]
D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In SIGIR, pages 235--242, 2003.
[15]
J. Pound, I. F. Ilyas, and G. E. Weddell. Expressive and flexible access to web-extracted data: A keyword-based structured query language. In SIGMOD, pages 423--434, 2010.
[16]
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4): 334--350, 2001.
[17]
N. Sarkas, S. Paparizos, and P. Tsaparas. Structured annotations of web queries. In SIGMOD, pages 771--782, 2010.
[18]
D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using message passing. In UAI, pages 503--510, 2008.
[19]
R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother. A comparative study of energy minimization methods for markov random fields. In ECCV, volume 2, pages 16--29, 2006.
[20]
J. Washtell and K. Markert. A comparison of windowless and window-based computational association measures as predictors of syntagmatic human associations. In EMNLP, volume 2, pages 628--637, 2009.
[21]
J. X. Yu, L. Qin, and L. Chang. Keyword search in relational databases: A survey. IEEE Data Eng. Bull, 33(1): 67--78, 2010.

Cited By

View all
  • (2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
  • (2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
  • (2024)Data distribution tailoring revisited: cost-efficient integration of representative dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00849-w33:5(1283-1306)Online publication date: 1-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 5, Issue 10
June 2012
180 pages

Publisher

VLDB Endowment

Publication History

Published: 01 June 2012
Published in PVLDB Volume 5, Issue 10

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
  • (2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
  • (2024)Data distribution tailoring revisited: cost-efficient integration of representative dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00849-w33:5(1283-1306)Online publication date: 1-Sep-2024
  • (2023)Dataset Discovery and Exploration: A SurveyACM Computing Surveys10.1145/362652156:4(1-37)Online publication date: 9-Nov-2023
  • (2023)Table Discovery in Data Lakes: State-of-the-art and Future DirectionsCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589409(69-75)Online publication date: 4-Jun-2023
  • (2022)Towards distribution-aware query answering in data marketsProceedings of the VLDB Endowment10.14778/3551793.355185815:11(3137-3144)Online publication date: 29-Sep-2022
  • (2022)StruBERT: Structure-aware BERT for Table Search and MatchingProceedings of the ACM Web Conference 202210.1145/3485447.3511972(442-451)Online publication date: 25-Apr-2022
  • (2022)On extracting data from tables that are encoded using HTMLKnowledge-Based Systems10.1016/j.knosys.2019.105157190:COnline publication date: 22-Apr-2022
  • (2022)Matching news articles and wikipedia tables for news augmentationKnowledge and Information Systems10.1007/s10115-022-01815-065:4(1713-1734)Online publication date: 27-Dec-2022
  • (2022)A hybrid quantum approach to leveraging data from HTML tablesKnowledge and Information Systems10.1007/s10115-021-01636-764:2(441-474)Online publication date: 1-Feb-2022
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media