Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

HW-STALKER: A Machine Learning-Based Approach to Transform Hidden Web Data to XML

  • Conference paper
Database and Expert Systems Applications (DEXA 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3180))

Included in the following conference series:

Abstract

In this paper, we propose an algorithm called HW-Transform for transforming hidden web data to XML format using machine learning by extending stalker to handle hyperlinked hidden web pages. One of the key features of our approach is that we identify and transform key attributes of query results into XML attributes. These key attributes facilitate applications such as change detection and data integration. by efficiently identifying related or identical results. Based on the proposed algorithm, we have implemented a prototype system called hw-stalker using Java. Our experiments demonstrate that HW-Transform shows acceptable performance for transforming query results to XML.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of the 27th VLDB Conference, Roma, Italy (2001)

    Google Scholar 

  2. Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. In: 8th World Wide Web Conference (May 1999)

    Google Scholar 

  3. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 26th International Conference on Very Large Database Systems, Roma, Italy, pp. 109–118 (2001)

    Google Scholar 

  4. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling using Context Graphs. In: 26th International Conference on Very Large Databases, VLDB 2000 (September 2000)

    Google Scholar 

  5. Freitag, D.: Machine Learning for Information Extraction in Informal Domains. Machine Learning 39(2/3), 169–202 (2000)

    Article  MATH  Google Scholar 

  6. Davulku, H., Freire, J., Kifer, M., Ramakrishnan, I.V.: A Layered Architecture for Querying Dynamic Web Content. In: ACM Conference on Management of Data (SIGMOD) (June 1999)

    Google Scholar 

  7. Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Engineering Bulletin 23(4), 33–41 (2000)

    Google Scholar 

  8. Kushmerick, N.: Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence Journal 118(1-2), 15–68 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  9. Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kovalev, V., Bhowmick, S.S., Madria, S. (2004). HW-STALKER: A Machine Learning-Based Approach to Transform Hidden Web Data to XML. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds) Database and Expert Systems Applications. DEXA 2004. Lecture Notes in Computer Science, vol 3180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30075-5_90

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30075-5_90

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22936-0

  • Online ISBN: 978-3-540-30075-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics