Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1645953.1645959acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

An empirical study on using hidden markov model for search interface segmentation

Published: 02 November 2009 Publication History
  • Get Citation Alerts
  • Abstract

    This paper describes a hidden Markov model (HMM) based approach to perform search interface segmentation. Automatic processing of an interface is a must to access the invisible contents of deep Web. This entails automatic segmentation, i.e., the task of grouping related components of an interface together. While it is easy for a human to discern the logical relationships among interface components, machine processing of an interface is difficult. In this paper, we propose an approach to segmentation that leverages the probabilistic nature of the interface design process. The design process involves choosing components based on the underlying database query requirements, and organizing them into suitable patterns. We simulate this process by creating an "artificial designer" in the form of a 2-layered HMM. The learned HMM acquires the implicit design knowledge required for segmentation. We empirically study the effectiveness of the approach across several representative domains of deep Web. In terms of segmentation accuracy, the HMM-based approach outperforms an existing state-of-the-art approach by at least 10% in most cases. Furthermore, our cross-domain investigation shows that a single HMM trained on data having varied and frequent design patterns can accurately segment interfaces from multiple domains.

    References

    [1]
    Benslimane, S. M., Malki, M., Rahmouni, M. K., and Benslimane, D. 2007. Extracting personalised ontology from data-intensive web application: An HTML forms-based reverse engineering approach. Informatica, 18, 4 (Dec. 2007), 511--534.
    [2]
    Freitag, D. and Mccallum, A. K. 1999. Information extraction with HMMs and shrinkage. AAAI-99 Workshop on Machine Learning for Information Extraction (Orlando, Florida, July 18-19, 1999).31--36.
    [3]
    Gupta, S., Kaiser, G. E., Grimm, P., Chiang, M. F., and Starren, J. 2005. Automating content extraction of HTML documents. World Wide Web, 8, 2 (Jun. 2005), 179--224.
    [4]
    Halevy, A. Y. 2005. Why your data won't mix: Semantic heterogeneity. Queue, 3, 8(Oct. 2005), 50--58. DOI=http://doi.acm.org/10.1145/1103822.1103836.
    [5]
    He, B., and Chang, K. C. 2003. Statistical schema matching across web query interfaces. In Proc. of the ACM International Conference on Management of Data (San Diego, California, June 9-12, 2003). SIGMOD '03. ACM Press, New York, NY, 217--228. DOI=http://doi.acm.org/10.1145/872757.872784.
    [6]
    He, B., Patel, M., Zhang, Z., and Chang, K. C. 2007. Accessing the deep web. Communications of the ACM, 50, 5 (Oct. 2008), 94--101. DOI=http://doi.acm.org/10.1145/1230819.1241670.
    [7]
    He, H., Meng, W., Lu, Y., Yu, C., and Wu, Z. 2007. Towards deeper understanding of the search interfaces of the deep web. World Wide Web, 10,2 (Jun. 2007), 133--155.
    [8]
    He, H., Meng, W., Yu, C., and Wu, Z. 2004. Automatic integration of web search interfaces with WISE-integrator. The VLDB Journal the International Journal on very Large Data Bases, 13, 3 (Sep. 2004), 256--273.
    [9]
    Kalijuvee, O., Buyukkokten, O., Garcia-Molina, H., and Paepcke, A. 2001. Efficient web form entry on PDAs. In Proc. of the 10th International Conference on World Wide Web (Hong Kong, China, May 1-5, 2001). WWW '01. ACM Press, New York, NY, 663--672. DOI=http://doi.acm.org/10.1145/371920.372180.
    [10]
    Kushmerick, N. 2002. Finite-state approaches to web information extraction. 3rd Summer Convention on Information Extraction (Frascati, Italy, July 15-19 2002) SCIE'02, Springer, Berlin, Heidelberg, 77--91.
    [11]
    Kushmerick, N. 2003. Learning to invoke web forms. In On the move to meaningful internet systems. Springer Berlin, Heidelberg, 997--1013.
    [12]
    Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., and Halevy, A. Y. 2008. Google's deep web crawl. Proc. of the VLDB Endowment, 1, 2 (Aug. 2008), 1241--1252. DOI= http://doi.acm.org/10.1145/1454159.1454163.
    [13]
    Nguyen, H., Nguyen, T., and Freire, J. 2008. Learning to extract form labels. In Proceedings of the VLDB Endowment, Auckland, New Zealand., 1, 1(Aug. 2008), 684--694. DOI= http://doi.acm.org/10.1145/1453856.1453931.
    [14]
    Oliver, N., Garg, A., and Horvitz, E. 2004. Layered representations for learning and inferring office activity from multiple sensory channels. Computer Vision and Image Understanding, 96, 2 (Nov. 2004), 163--180.
    [15]
    Pei, J., Hong, J., and Bell, D. 2006. A robust approach to schema matching over web query interfaces. In Proc. of the 22nd International Conference on Data Engineering Workshops (Atlanta, Georgia, April 3-7, 2006). ICDEW'06. IEEE Computer Society, Washington, DC, 46--55.
    [16]
    Rabiner, L., R. 1990. A tutorial on hidden markov models and selected applications in speech recognition. Readings in Speech Recognition, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 267--296.
    [17]
    Raghavan, S., and Garcia-Molina, H. 2001. Crawling the hidden web. In Proc. of the 27th International Conference on very Large Data Bases (Rome, Italy, September 11-14, 2001) VLDB '01, Morgan Kaufmann Publishers Inc, San Francisco, CA, 129--138.
    [18]
    Russell, S. J., and Norvig, P. 2002. Artificial intelligence: Modern approach, Prentice Hall, Upper Saddle River, NJ, USA.
    [19]
    Seymore, K., Mccallum, A. K., and Rosenfeld, R. 1999. Learning hidden markov model structure for information extraction. AAAI 99 Workshop on Machine Learning for Information Extraction (Orlando, Florida, July 18-19, 1999). 37--42.
    [20]
    Wang, J., Wen, J., Lochovsky, F., and Ma, W. 2004. Instance-based schema matching for web databases by domain-specific query probing. In Proc. of 30th International Conference on very Large Data Bases (Toronto, Canada, August 29-30, 2004) VLDB '04, VLDB Endowment, 408--419.
    [21]
    Wu, W., Yu, C., Doan, A., and Meng, W. 2004. An interactive clustering-based approach to integrating source query interfaces on the deep web. In Proc. of the ACM International Conference on Management of Data (Paris, France, June 13-18, 2004) SIGMOD '04. ACM, New York, NY, 95--106. DOI= http://doi.acm.org/10.1145/1007568.1007582.
    [22]
    Zhang, Z., He, B., and Chang, K. C. 2004. Understanding web query interfaces: Best-effort parsing with hidden syntax. In Proc. of the ACM International Conference on Management of Data (Paris, France, June 13-18, 2004) SIGMOD '04. ACM, New York, NY, 107--118. DOI= http://doi.acm.org/10.1145/1007568.1007583.
    [23]
    Zhong, P.,&Chen, J. 2006. A generalized hidden markov model approach for web information extraction. In Proc. of ACM International Conference on Web Intelligence (Hong, Kong, China, Dec 18-22, 2006) WI '06, ACM, New York, NY. 709--718.

    Cited By

    View all
    • (2021)Dependency-aware Form Understanding2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE52982.2021.00026(139-149)Online publication date: Oct-2021
    • (2017)Heuristics-Based Schema Extraction for Deep Web Query Interfaces2017 IEEE International Conference on Information Reuse and Integration (IRI)10.1109/IRI.2017.80(389-396)Online publication date: Aug-2017
    • (2017)VR-TreeJournal of Intelligent Information Systems10.1007/s10844-017-0449-449:3(367-390)Online publication date: 1-Dec-2017
    • Show More Cited By

    Index Terms

    1. An empirical study on using hidden markov model for search interface segmentation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
      November 2009
      2162 pages
      ISBN:9781605585123
      DOI:10.1145/1645953
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 November 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. design pattern
      2. information extraction
      3. search interfaces
      4. segmentation

      Qualifiers

      • Research-article

      Conference

      CIKM '09
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 09 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Dependency-aware Form Understanding2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE52982.2021.00026(139-149)Online publication date: Oct-2021
      • (2017)Heuristics-Based Schema Extraction for Deep Web Query Interfaces2017 IEEE International Conference on Information Reuse and Integration (IRI)10.1109/IRI.2017.80(389-396)Online publication date: Aug-2017
      • (2017)VR-TreeJournal of Intelligent Information Systems10.1007/s10844-017-0449-449:3(367-390)Online publication date: 1-Dec-2017
      • (2013)Web object identification for web automation and meta-searchProceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics10.1145/2479787.2479798(1-12)Online publication date: 12-Jun-2013
      • (2013)Understanding query interfaces by statistical parsingACM Transactions on the Web10.1145/2460383.24603877:2(1-22)Online publication date: 29-May-2013
      • (2013)Web-based closed-domain data extraction on online advertisementsInformation Systems10.1016/j.is.2012.07.00638:2(183-197)Online publication date: 1-Apr-2013
      • (2013)The ontological keyThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-013-0323-022:5(615-640)Online publication date: 1-Oct-2013
      • (2012)Vision-Based Label Extraction and MatchingAdvanced Materials Research10.4028/www.scientific.net/AMR.459.155459(155-160)Online publication date: Jan-2012
      • (2012)Learning to discover complex mappings from web forms to ontologiesProceedings of the 21st ACM international conference on Information and knowledge management10.1145/2396761.2398427(1253-1262)Online publication date: 29-Oct-2012
      • (2012)OPALProceedings of the 21st International Conference on World Wide Web10.1145/2187980.2188047(353-356)Online publication date: 16-Apr-2012
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media