Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1458082.1458253acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Intra-document structural frequency features for semi-supervised domain adaptation

Published: 26 October 2008 Publication History

Abstract

In this work we try to bridge the gap often encountered by researchers who find themselves with few or no labeled examples from their desired target domain, yet still have access to large amounts of labeled data from other related, but distinct source domains, and seemingly no way to transfer knowledge from one to the other. Experimentally, we focus on the problem of extracting protein mentions from academic publications in the field of biology, where the source domain data are abstracts labeled with protein mentions, and the target domain data are wholly unlabeled captions. We mine the large number of such full text articles freely available on the Internet in order to supplement the limited amount of annotated data available. By exploiting the explicit and implicit common structure of the different subsections of these documents, including the unlabeled full text, we are able to generate robust features that are insensitive to changes in marginal and conditional distributions of classes and data across domains. We supplement these domain-insensitive features with automatically obtained high-confidence positive and negative predictions on the target domain to learn extractors that generalize well from one section of a document to another. Finally, lacking labeled target testing data, we employ comparative user preference studies to evaluate the relative performance of the proposed methods with respect to existing baselines.

References

[1]
R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. In JMLR 6, pages 1817--1853, 2005.
[2]
A. Arnold, R. Nallapati, and W. W. Cohen. A comparative study of methods for transductive transfer learning. In Proceedings of the IEEE International Conference on Data Mining (ICDM) 2007 Workshop on Mining and Management of Biological Data, 2007.
[3]
A. Arnold, R. Nallapati, and W. W. Cohen. Exploiting feature hierarchy for transfer learning in named entity recognition. In ACL:HLT '08, 2008.
[4]
J. Baxter. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28(1):7--39, 1997.
[5]
S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS 20, Cambridge, MA, 2007. MIT Press.
[6]
D. M. Blei, J. A. Bagnell, and A. McCallum. Learning with scope, with application to information extraction and classification. In UAI, pages 53--60, 2002.
[7]
J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In EMNLP, Sydney, Australia, 2006.
[8]
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, pages 92--100, 1998.
[9]
R. Caruana. Multitask learning. Machine Learning, 28(1):41--75, 1997.
[10]
W. W. Cohen. Minorthird: Methods for identifying names and ontological relations in text using heuristics for inducing regularities from data. http://minorthird.sourceforge.net, 2004.
[11]
W. W. Cohen, R. Wang, and R. Murphy. Understanding captions in biomedical publications. In KDD, pages 499--504, 2003.
[12]
H. Daumé III. Frustratingly easy domain adaptation. In ACL, 2007.
[13]
H. Daumé III and D. Marcu. Domain adaptation for statistical classifiers. In Journal of Artificial Intelligence Research 26, pages 101--126, 2006.
[14]
K. Franzén, G. Eriksson, F. Olsson, L. Asker, P. Lidén, and J. Cöster. Protein names and how to find them. In International Journal of Medical Informatics, 2002.
[15]
W. A. Gale, K. W. Church, and D. Yarowsky. One sense per discourse. In HLT '91: Proceedings of the workshop on Speech and Natural Language, pages 233--237, Morristown, NJ, USA, 1992. Association for Computational Linguistics.
[16]
Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In CAP, Nice, France, 2005.
[17]
J. Jiang and C. Zhai. Exploiting domain structure for named entity recognition. In Human Language Technology Conference, pages 74--81, 2006.
[18]
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001.
[19]
R. F. Murphy, Z. Kou, J. Hua, M. Joffe, and W. W. Cohen. Extracting and structuring subcellular location information from on-line journal articles: The subcellular location image finder. In KSCE, 2004.
[20]
National Institues of Health. http://www.central.nih.gov/.
[21]
T. Ohta, Y. Tateisi, H. Mima, and J. Tsujii. Genia corpus: an annotated research abstract corpus in molecular biology domain. In HLT: Human Language Technology Conference, pages 92--100, 2002.
[22]
B. Schölkopf, F. Steinke, and V. Blanz. Object correspondence as a machine learning problem. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 776--783, New York, NY, USA, 2005. ACM.
[23]
L. Shi and F. Campagne. Building a protein name dictionary from full text: a machine learning term extraction approach. BMC Bioinformatics, 6(88), 2005.
[24]
C. Sutton and A. McCallum. Composition of conditional random fields for transfer learning. In HLT/EMLNLP, 2005.
[25]
B. Taskar, M.-F. Wong, and D. Koller. Learning on the test data: Leveraging 'unseen' features. In Proc. Twentieth International Conference on Machine Learning (ICML), 2003.
[26]
S. Thrun. Is learning the n-th thing any easier than learning the first? In NIPS, volume 8, pages 640--646. MIT, 1996.
[27]
J. Zhang, Z. Ghahramani, and Y. Yang. Learning multiple related tasks using latent independent component analysis, 2005.
[28]
X. Zhu. Semi-supervised learning literature survey. In Technical Report 1530. University of Wisconsin, 2005.
[29]
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002

Cited By

View all
  • (2012)Exploiting Structural Consistencies with Stacked Conditional Random FieldsMathematical Methodologies in Pattern Recognition and Machine Learning10.1007/978-1-4614-5076-4_8(111-125)Online publication date: 16-Oct-2012
  • (2010)Co-regularization based semi-supervised domain adaptationProceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 110.5555/2997189.2997243(478-486)Online publication date: 6-Dec-2010
  • (2010)Frustratingly easy semi-supervised domain adaptationProceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing10.5555/1870526.1870534(53-59)Online publication date: 15-Jul-2010
  • Show More Cited By

Index Terms

  1. Intra-document structural frequency features for semi-supervised domain adaptation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
      October 2008
      1562 pages
      ISBN:9781595939913
      DOI:10.1145/1458082
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 October 2008

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. ir_content::structured
      2. km_information_extraction
      3. km_statistical_techniques
      4. km_text_mining
      5. meta data
      6. semi structured
      7. social tagging

      Qualifiers

      • Research-article

      Conference

      CIKM08
      CIKM08: Conference on Information and Knowledge Management
      October 26 - 30, 2008
      California, Napa Valley, USA

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 30 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2012)Exploiting Structural Consistencies with Stacked Conditional Random FieldsMathematical Methodologies in Pattern Recognition and Machine Learning10.1007/978-1-4614-5076-4_8(111-125)Online publication date: 16-Oct-2012
      • (2010)Co-regularization based semi-supervised domain adaptationProceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 110.5555/2997189.2997243(478-486)Online publication date: 6-Dec-2010
      • (2010)Frustratingly easy semi-supervised domain adaptationProceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing10.5555/1870526.1870534(53-59)Online publication date: 15-Jul-2010
      • (2009)Selection of Effective Sentences from a Corpus to Improve the Accuracy of Identification of Protein NamesIPSJ Transactions on Bioinformatics10.2197/ipsjtbio.2.932(93-100)Online publication date: 2009

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media