research-article

Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited

Authors:

Sunita SarawagiAuthors Info & Claims

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

Pages 217 - 226

https://doi.org/10.1145/1935826.1935868

Published: 09 February 2011 Publication History

Abstract

We consider the problem of jointly training structured models for extraction from multiple web sources whose records enjoy partial content overlap. This has important applications in open-domain extraction, e.g. a user materializing a table of interest from multiple relevant unstructured sources; or a site like Freebase augmenting an incomplete relation by extracting more rows from web sources. Such applications require extraction over arbitrary domains, so one cannot use a pre-trained extractor or demand a huge labeled dataset. We propose to overcome this lack of supervision by using content overlap across the related web sources. Existing methods of exploiting overlap have been developed under settings that do not generalize easily to the scale and diversity of overlap seen on Web sources.

We present an agreement-based learning framework that jointly trains the models by biasing them to agree on the agreement regions, i.e. shared text segments. We present alternatives within our framework to trade-off tractability, robustness to noise, and extent of agreement enforced; and propose a scheme of partitioning agreement regions that leads to efficient training while maximizing overall accuracy. Further, we present a principled scheme to discover low-noise agreement regions in unlabeled data across multiple sources.

Through extensive experiments over 58 different extraction domains, we establish that our framework provides significant boosts over uncoupled training, and scores over alternatives such as collective inference, staged training, and multi-view learning.

Supplementary Material

JPG File (wsdm2011_gupta_jto_01.jpg)

Download
16.92 KB

MP4 File (wsdm2011_gupta_jto_01.mp4)

Download
146.14 MB

References

[1]

Google squared. http://www.google.com/squared, 2009.

[2]

E. Agichtein and L. Gravano. Snowball: Extracting relations from large plaintext collections. In Proceedings of the 5th ACM International Conference on Digital Libraries, 2000.

Digital Library

[3]

D. Blei, D. Bagnell, and A. McCallum. Learning with scope, with application to information extraction and classification. In UAI, 2002.

Digital Library

[4]

A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998.

Digital Library

[5]

U. Brefeld, C. Büscher, and T. Scheffer. Multi-view hidden markov perceptrons. In LWA, 2005.

[6]

R. Bunescu and R. J. Mooney. Collective information extraction with relational markov networks. In ACL, 2004.

Digital Library

[7]

M. Cafarella, N. Khoussainova, D. Wang, E. Wu, Y. Zhang, and A. Halevy. Uncovering the relational web. In WebDB, 2008.

[8]

A. Carlson, J. Betteridge, R. C. Wang, E. R. H. Jr., and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010.

Digital Library

[9]

M. Collins and Y. Singer. Unsupervised models for named entity classification. In EMNLP, 1999.

[10]

H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. In VLDB, 2009.

Digital Library

[11]

J. R. Finkel, T. Grenager, and C. D. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, 2005.

Digital Library

[12]

K. Ganchev, J. Graça, J. Blitzer, and B. Taskar. Multi-view learning over structured and non-identical outputs. In UAI, 2008.

[13]

R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In PVLDB, 2009.

Digital Library

[14]

T. Jebara, J. Wang, and S. Chang. Graph construction and b-matching for semi-supervised learning. In ICML, 2009.

Digital Library

[15]

D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.

Digital Library

[16]

V. Kolmogorov and M. J. Wainwright. On the optimality of tree-reweighted max-product message passing. In UAI, 2005.

[17]

V. Krishnan and C. D. Manning. An effective two-stage model for exploiting non-local dependencies in named entity recognition. In ACL-COLING, 2006.

Digital Library

[18]

P. Liang, D. Klein, and M. I. Jordan. Agreement-based learning. In NIPS, 2008.

[19]

P. Liang, B. Taskar, and D. Klein. Alignment by agreement. In HLT-NAACL, 2006.

Digital Library

[20]

T. Meltzer, A. Globerson, and Y. Weiss. Convergent message passing algorithms -- a unifying view. In UAI, 2009.

Digital Library

[21]

S. Sarawagi. Information extraction. FnT Databases, 1(3), 2008.

Digital Library

[22]

C. Sutton and A. McCallum. Collective segmentation and labeling of distant entities in information extraction. Technical Report TR # 04-49, University of Massachusetts, 2004.

[23]

B. Taskar, M. F. Wong, and D. Koller. Learning on the test data: Leveraging unseen features. In ICML, 2003.

Cited By

Yuliana OChang C(2019)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-0Online publication date: 22-Jul-2019
https://doi.org/10.1007/s10489-019-01499-0
Han XSun LSingh SMarkovitch S(2017)Distant supervision via prototype-based global representation learningProceedings of the Thirty-First AAAI Conference on Artificial Intelligence10.5555/3298023.3298069(3443-3449)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3298023.3298069
Han XSun L(2016)Global distant supervision for relation extractionProceedings of the Thirtieth AAAI Conference on Artificial Intelligence10.5555/3016100.3016315(2950-2956)Online publication date: 12-Feb-2016
https://dl.acm.org/doi/10.5555/3016100.3016315
Show More Cited By

Index Terms

Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
    1. Learning paradigms
    2. Machine learning approaches
      1. Markov decision processes
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Markov decision processes

Recommendations

Unsupervised named-entity extraction from the Web: An experimental study

The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of ...
Collaborative filtering with collective training
RecSys '11: Proceedings of the fifth ACM conference on Recommender systems

Rating sparsity is a critical issue for collaborative filtering. For example, the well-known Netflix Movie rating data contain ratings of only about 1% user-item pairs. One way to address this rating sparsity problem is to develop more effective methods ...
Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

Information extraction from unstructured, ungrammatical data such as classified listings is difficult because traditional structural and grammatical extraction methods do not apply. Previous work has exploited reference sets to aid such extraction, but ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

February 2011

870 pages

ISBN:9781450304931

DOI:10.1145/1935826

General Chair:
Irwin King
CUHK, Hong Kong
,
Program Chairs:
Wolfgang Nejdl
L3S and University of Hannover, Germany
,
Hang Li
Microsoft Research Asia, China

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM'11

Sponsor:

WSDM'11: Fourth ACM International Conference on Web Search and Data Mining

February 9 - 12, 2011

Hong Kong, China

Acceptance Rates

WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
476
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yuliana OChang C(2019)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-0Online publication date: 22-Jul-2019
https://doi.org/10.1007/s10489-019-01499-0
Han XSun LSingh SMarkovitch S(2017)Distant supervision via prototype-based global representation learningProceedings of the Thirty-First AAAI Conference on Artificial Intelligence10.5555/3298023.3298069(3443-3449)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3298023.3298069
Han XSun L(2016)Global distant supervision for relation extractionProceedings of the Thirtieth AAAI Conference on Artificial Intelligence10.5555/3016100.3016315(2950-2956)Online publication date: 12-Feb-2016
https://dl.acm.org/doi/10.5555/3016100.3016315
Zheng VChang KMukhopadhyay SZhai CBertino ECrestani FMostafa JTang JSi LZhou XChang YLi YSondhi P(2016)Regularizing Structured Classifier with Conditional Probabilistic Constraints for Semi-supervised LearningProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983860(1029-1038)Online publication date: 24-Oct-2016
https://dl.acm.org/doi/10.1145/2983323.2983860
Yanxing Di Wei Song Hanshi Wang Lizhen Liu (2016)Research on open domain Named entity recognition based on Chinese query logs2016 IEEE Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC)10.1109/IMCEC.2016.7867109(40-44)Online publication date: Oct-2016
https://doi.org/10.1109/IMCEC.2016.7867109
Leng BGuo SZhang XXiong Z(2015)3D object retrieval with stacked local convolutional autoencoderSignal Processing10.1016/j.sigpro.2014.09.005112:C(119-128)Online publication date: 1-Jul-2015
https://dl.acm.org/doi/10.1016/j.sigpro.2014.09.005
Paşca MChung CBroder AShim KSuel T(2014)Acquisition of open-domain classes via intersective semanticsProceedings of the 23rd international conference on World wide web10.1145/2566486.2567966(551-562)Online publication date: 7-Apr-2014
https://dl.acm.org/doi/10.1145/2566486.2567966
Kondreddi STriantafillou PWeikum G(2014)Combining information extraction and human computing for crowdsourced knowledge acquisition2014 IEEE 30th International Conference on Data Engineering10.1109/ICDE.2014.6816717(988-999)Online publication date: Mar-2014
https://doi.org/10.1109/ICDE.2014.6816717
Chakrabarti SRamakrishnan GRamamritham KSarawagi SSudarshan S(2013)Data-based research at IIT BombayACM SIGMOD Record10.1145/2481528.248153642:1(38-43)Online publication date: 1-May-2013
https://dl.acm.org/doi/10.1145/2481528.2481536
Suchanek FWeikum GRoss KSrivastava DPapadias D(2013)Knowledge harvesting in the big-data eraProceedings of the 2013 ACM SIGMOD International Conference on Management of Data10.1145/2463676.2463724(933-938)Online publication date: 22-Jun-2013
https://dl.acm.org/doi/10.1145/2463676.2463724
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents