research-article

Intra-document structural frequency features for semi-supervised domain adaptation

Authors:

William W. CohenAuthors Info & Claims

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Pages 1291 - 1300

https://doi.org/10.1145/1458082.1458253

Published: 26 October 2008 Publication History

Abstract

In this work we try to bridge the gap often encountered by researchers who find themselves with few or no labeled examples from their desired target domain, yet still have access to large amounts of labeled data from other related, but distinct source domains, and seemingly no way to transfer knowledge from one to the other. Experimentally, we focus on the problem of extracting protein mentions from academic publications in the field of biology, where the source domain data are abstracts labeled with protein mentions, and the target domain data are wholly unlabeled captions. We mine the large number of such full text articles freely available on the Internet in order to supplement the limited amount of annotated data available. By exploiting the explicit and implicit common structure of the different subsections of these documents, including the unlabeled full text, we are able to generate robust features that are insensitive to changes in marginal and conditional distributions of classes and data across domains. We supplement these domain-insensitive features with automatically obtained high-confidence positive and negative predictions on the target domain to learn extractors that generalize well from one section of a document to another. Finally, lacking labeled target testing data, we employ comparative user preference studies to evaluate the relative performance of the proposed methods with respect to existing baselines.

References

[1]

R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. In JMLR 6, pages 1817--1853, 2005.

Digital Library

[2]

A. Arnold, R. Nallapati, and W. W. Cohen. A comparative study of methods for transductive transfer learning. In Proceedings of the IEEE International Conference on Data Mining (ICDM) 2007 Workshop on Mining and Management of Biological Data, 2007.

Digital Library

[3]

A. Arnold, R. Nallapati, and W. W. Cohen. Exploiting feature hierarchy for transfer learning in named entity recognition. In ACL:HLT '08, 2008.

[4]

J. Baxter. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28(1):7--39, 1997.

Digital Library

[5]

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS 20, Cambridge, MA, 2007. MIT Press.

[6]

D. M. Blei, J. A. Bagnell, and A. McCallum. Learning with scope, with application to information extraction and classification. In UAI, pages 53--60, 2002.

Digital Library

[7]

J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In EMNLP, Sydney, Australia, 2006.

Digital Library

[8]

A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, pages 92--100, 1998.

Digital Library

[9]

R. Caruana. Multitask learning. Machine Learning, 28(1):41--75, 1997.

Digital Library

[10]

W. W. Cohen. Minorthird: Methods for identifying names and ontological relations in text using heuristics for inducing regularities from data. http://minorthird.sourceforge.net, 2004.

[11]

W. W. Cohen, R. Wang, and R. Murphy. Understanding captions in biomedical publications. In KDD, pages 499--504, 2003.

Digital Library

[12]

H. Daumé III. Frustratingly easy domain adaptation. In ACL, 2007.

[13]

H. Daumé III and D. Marcu. Domain adaptation for statistical classifiers. In Journal of Artificial Intelligence Research 26, pages 101--126, 2006.

Digital Library

[14]

K. Franzén, G. Eriksson, F. Olsson, L. Asker, P. Lidén, and J. Cöster. Protein names and how to find them. In International Journal of Medical Informatics, 2002.

[15]

W. A. Gale, K. W. Church, and D. Yarowsky. One sense per discourse. In HLT '91: Proceedings of the workshop on Speech and Natural Language, pages 233--237, Morristown, NJ, USA, 1992. Association for Computational Linguistics.

Digital Library

[16]

Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In CAP, Nice, France, 2005.

[17]

J. Jiang and C. Zhai. Exploiting domain structure for named entity recognition. In Human Language Technology Conference, pages 74--81, 2006.

Digital Library

[18]

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001.

Digital Library

[19]

R. F. Murphy, Z. Kou, J. Hua, M. Joffe, and W. W. Cohen. Extracting and structuring subcellular location information from on-line journal articles: The subcellular location image finder. In KSCE, 2004.

[20]

National Institues of Health. http://www.central.nih.gov/.

[21]

T. Ohta, Y. Tateisi, H. Mima, and J. Tsujii. Genia corpus: an annotated research abstract corpus in molecular biology domain. In HLT: Human Language Technology Conference, pages 92--100, 2002.

Digital Library

[22]

B. Schölkopf, F. Steinke, and V. Blanz. Object correspondence as a machine learning problem. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 776--783, New York, NY, USA, 2005. ACM.

Digital Library

[23]

L. Shi and F. Campagne. Building a protein name dictionary from full text: a machine learning term extraction approach. BMC Bioinformatics, 6(88), 2005.

[24]

C. Sutton and A. McCallum. Composition of conditional random fields for transfer learning. In HLT/EMLNLP, 2005.

Digital Library

[25]

B. Taskar, M.-F. Wong, and D. Koller. Learning on the test data: Leveraging 'unseen' features. In Proc. Twentieth International Conference on Machine Learning (ICML), 2003.

[26]

S. Thrun. Is learning the n-th thing any easier than learning the first? In NIPS, volume 8, pages 640--646. MIT, 1996.

[27]

J. Zhang, Z. Ghahramani, and Y. Yang. Learning multiple related tasks using latent independent component analysis, 2005.

[28]

X. Zhu. Semi-supervised learning literature survey. In Technical Report 1530. University of Wisconsin, 2005.

[29]

X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002

Cited By

Kluegl PToepfer MLemmerich FHotho APuppe F(2012)Exploiting Structural Consistencies with Stacked Conditional Random FieldsMathematical Methodologies in Pattern Recognition and Machine Learning10.1007/978-1-4614-5076-4_8(111-125)Online publication date: 16-Oct-2012
https://doi.org/10.1007/978-1-4614-5076-4_8
Daumé HKumar ASaha A(2010)Co-regularization based semi-supervised domain adaptationProceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 110.5555/2997189.2997243(478-486)Online publication date: 6-Dec-2010
https://dl.acm.org/doi/10.5555/2997189.2997243
Daumé HKumar ASaha ADaumé HDeoskar TMcClosky DPlank BTiedemann J(2010)Frustratingly easy semi-supervised domain adaptationProceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing10.5555/1870526.1870534(53-59)Online publication date: 15-Jul-2010
https://dl.acm.org/doi/10.5555/1870526.1870534
Show More Cited By

Index Terms

Intra-document structural frequency features for semi-supervised domain adaptation
1. Applied computing
  1. Life and medical sciences
2. Information systems
  1. Information systems applications

Recommendations

Semi-supervised transfer subspace for domain adaptation

A new semi-supervised method for domain adaptation.Labeled and unlabeled data are exploited effectively.The method provides significant reduction of domain shift.The method is more effective than other state-of-the-art methods. Domain shift is defined ...
Semi-supervised document classification using heterogeneous rule selection
ICEC '17: Proceedings of the International Conference on Electronic Commerce

In traditional supervised classification, a large set of labeled data is required to train the model. However, labeled data are often hard to obtain and expensive, because human efforts are needed for the labeling. Therefore, semi-supervised learning ...
Enhancing semi-supervised document clustering with feature supervision
SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing

Traditional semi-supervised clustering uses only limited user supervision in the form of labeled instances and pairwise instance constraints to aid unsupervised clustering. However, user supervision can also be provided in alternative forms for document ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

October 2008

1562 pages

ISBN:9781595939913

DOI:10.1145/1458082

General Chair:
James G. Shanahan
Church and Duncan Group Inc, USA
,
Program Chairs:
Sihem Amer-Yahia
Yahoo! Research, USA
,
Ioana Manolescu
INRIA, France
,
Yi Zhang
University of California, Santa Cruz, USA
,
David A. Evans
JustSystems Evans Research, USA
,
Alek Kolcz
Microsoft Live Labs, USA
,
Key-Sun Choi
KAIST, Korea
,
Abdur Chowdury
Twitter, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM08

Sponsor:

CIKM08: Conference on Information and Knowledge Management

October 26 - 30, 2008

California, Napa Valley, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
231
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kluegl PToepfer MLemmerich FHotho APuppe F(2012)Exploiting Structural Consistencies with Stacked Conditional Random FieldsMathematical Methodologies in Pattern Recognition and Machine Learning10.1007/978-1-4614-5076-4_8(111-125)Online publication date: 16-Oct-2012
https://doi.org/10.1007/978-1-4614-5076-4_8
Daumé HKumar ASaha A(2010)Co-regularization based semi-supervised domain adaptationProceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 110.5555/2997189.2997243(478-486)Online publication date: 6-Dec-2010
https://dl.acm.org/doi/10.5555/2997189.2997243
Daumé HKumar ASaha ADaumé HDeoskar TMcClosky DPlank BTiedemann J(2010)Frustratingly easy semi-supervised domain adaptationProceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing10.5555/1870526.1870534(53-59)Online publication date: 15-Jul-2010
https://dl.acm.org/doi/10.5555/1870526.1870534
Miyanishi KOzaki TOhkawa T(2009)Selection of Effective Sentences from a Corpus to Improve the Accuracy of Identification of Protein NamesIPSJ Transactions on Bioinformatics10.2197/ipsjtbio.2.932(93-100)Online publication date: 2009
https://doi.org/10.2197/ipsjtbio.2.93

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents