Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1935826.1935868acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited

Published: 09 February 2011 Publication History

Abstract

We consider the problem of jointly training structured models for extraction from multiple web sources whose records enjoy partial content overlap. This has important applications in open-domain extraction, e.g. a user materializing a table of interest from multiple relevant unstructured sources; or a site like Freebase augmenting an incomplete relation by extracting more rows from web sources. Such applications require extraction over arbitrary domains, so one cannot use a pre-trained extractor or demand a huge labeled dataset. We propose to overcome this lack of supervision by using content overlap across the related web sources. Existing methods of exploiting overlap have been developed under settings that do not generalize easily to the scale and diversity of overlap seen on Web sources.
We present an agreement-based learning framework that jointly trains the models by biasing them to agree on the agreement regions, i.e. shared text segments. We present alternatives within our framework to trade-off tractability, robustness to noise, and extent of agreement enforced; and propose a scheme of partitioning agreement regions that leads to efficient training while maximizing overall accuracy. Further, we present a principled scheme to discover low-noise agreement regions in unlabeled data across multiple sources.
Through extensive experiments over 58 different extraction domains, we establish that our framework provides significant boosts over uncoupled training, and scores over alternatives such as collective inference, staged training, and multi-view learning.

Supplementary Material

JPG File (wsdm2011_gupta_jto_01.jpg)
MP4 File (wsdm2011_gupta_jto_01.mp4)

References

[1]
Google squared. http://www.google.com/squared, 2009.
[2]
E. Agichtein and L. Gravano. Snowball: Extracting relations from large plaintext collections. In Proceedings of the 5th ACM International Conference on Digital Libraries, 2000.
[3]
D. Blei, D. Bagnell, and A. McCallum. Learning with scope, with application to information extraction and classification. In UAI, 2002.
[4]
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998.
[5]
U. Brefeld, C. Büscher, and T. Scheffer. Multi-view hidden markov perceptrons. In LWA, 2005.
[6]
R. Bunescu and R. J. Mooney. Collective information extraction with relational markov networks. In ACL, 2004.
[7]
M. Cafarella, N. Khoussainova, D. Wang, E. Wu, Y. Zhang, and A. Halevy. Uncovering the relational web. In WebDB, 2008.
[8]
A. Carlson, J. Betteridge, R. C. Wang, E. R. H. Jr., and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010.
[9]
M. Collins and Y. Singer. Unsupervised models for named entity classification. In EMNLP, 1999.
[10]
H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. In VLDB, 2009.
[11]
J. R. Finkel, T. Grenager, and C. D. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, 2005.
[12]
K. Ganchev, J. Graça, J. Blitzer, and B. Taskar. Multi-view learning over structured and non-identical outputs. In UAI, 2008.
[13]
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In PVLDB, 2009.
[14]
T. Jebara, J. Wang, and S. Chang. Graph construction and b-matching for semi-supervised learning. In ICML, 2009.
[15]
D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
[16]
V. Kolmogorov and M. J. Wainwright. On the optimality of tree-reweighted max-product message passing. In UAI, 2005.
[17]
V. Krishnan and C. D. Manning. An effective two-stage model for exploiting non-local dependencies in named entity recognition. In ACL-COLING, 2006.
[18]
P. Liang, D. Klein, and M. I. Jordan. Agreement-based learning. In NIPS, 2008.
[19]
P. Liang, B. Taskar, and D. Klein. Alignment by agreement. In HLT-NAACL, 2006.
[20]
T. Meltzer, A. Globerson, and Y. Weiss. Convergent message passing algorithms -- a unifying view. In UAI, 2009.
[21]
S. Sarawagi. Information extraction. FnT Databases, 1(3), 2008.
[22]
C. Sutton and A. McCallum. Collective segmentation and labeling of distant entities in information extraction. Technical Report TR # 04-49, University of Massachusetts, 2004.
[23]
B. Taskar, M. F. Wong, and D. Koller. Learning on the test data: Leveraging unseen features. In ICML, 2003.

Cited By

View all
  • (2019)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-0Online publication date: 22-Jul-2019
  • (2017)Distant supervision via prototype-based global representation learningProceedings of the Thirty-First AAAI Conference on Artificial Intelligence10.5555/3298023.3298069(3443-3449)Online publication date: 4-Feb-2017
  • (2016)Global distant supervision for relation extractionProceedings of the Thirtieth AAAI Conference on Artificial Intelligence10.5555/3016100.3016315(2950-2956)Online publication date: 12-Feb-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
February 2011
870 pages
ISBN:9781450304931
DOI:10.1145/1935826
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. collective training
  2. graphical models
  3. information extraction

Qualifiers

  • Research-article

Conference

Acceptance Rates

WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;
Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2019)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-0Online publication date: 22-Jul-2019
  • (2017)Distant supervision via prototype-based global representation learningProceedings of the Thirty-First AAAI Conference on Artificial Intelligence10.5555/3298023.3298069(3443-3449)Online publication date: 4-Feb-2017
  • (2016)Global distant supervision for relation extractionProceedings of the Thirtieth AAAI Conference on Artificial Intelligence10.5555/3016100.3016315(2950-2956)Online publication date: 12-Feb-2016
  • (2016)Regularizing Structured Classifier with Conditional Probabilistic Constraints for Semi-supervised LearningProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983860(1029-1038)Online publication date: 24-Oct-2016
  • (2016)Research on open domain Named entity recognition based on Chinese query logs2016 IEEE Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC)10.1109/IMCEC.2016.7867109(40-44)Online publication date: Oct-2016
  • (2015)3D object retrieval with stacked local convolutional autoencoderSignal Processing10.1016/j.sigpro.2014.09.005112:C(119-128)Online publication date: 1-Jul-2015
  • (2014)Acquisition of open-domain classes via intersective semanticsProceedings of the 23rd international conference on World wide web10.1145/2566486.2567966(551-562)Online publication date: 7-Apr-2014
  • (2014)Combining information extraction and human computing for crowdsourced knowledge acquisition2014 IEEE 30th International Conference on Data Engineering10.1109/ICDE.2014.6816717(988-999)Online publication date: Mar-2014
  • (2013)Data-based research at IIT BombayACM SIGMOD Record10.1145/2481528.248153642:1(38-43)Online publication date: 1-May-2013
  • (2013)Knowledge harvesting in the big-data eraProceedings of the 2013 ACM SIGMOD International Conference on Management of Data10.1145/2463676.2463724(933-938)Online publication date: 22-Jun-2013
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media