HW-STALKER: A Machine Learning-Based Approach to Transform Hidden Web Data to XML

Kovalev, Vladimir; Bhowmick, Sourav S.; Madria, Sanjay

doi:10.1007/978-3-540-30075-5_90

Vladimir Kovalev¹⁹,
Sourav S. Bhowmick¹⁹ &
Sanjay Madria²⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3180))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

682 Accesses
2 Citations

Abstract

In this paper, we propose an algorithm called HW-Transform for transforming hidden web data to XML format using machine learning by extending stalker to handle hyperlinked hidden web pages. One of the key features of our approach is that we identify and transform key attributes of query results into XML attributes. These key attributes facilitate applications such as change detection and data integration. by efficiently identifying related or identical results. Based on the proposed algorithm, we have implemented a prototype system called hw-stalker using Java. Our experiments demonstrate that HW-Transform shows acceptable performance for transforming query results to XML.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Enabling Real Time Analytics over Raw XML Data

Transformation of XML Data Sources for Sequential Path Mining

Inferring a Relax NG Schema from XML Documents

References

Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of the 27th VLDB Conference, Roma, Italy (2001)
Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. In: 8th World Wide Web Conference (May 1999)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 26th International Conference on Very Large Database Systems, Roma, Italy, pp. 109–118 (2001)
Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling using Context Graphs. In: 26th International Conference on Very Large Databases, VLDB 2000 (September 2000)
Google Scholar
Freitag, D.: Machine Learning for Information Extraction in Informal Domains. Machine Learning 39(2/3), 169–202 (2000)
Article MATH Google Scholar
Davulku, H., Freire, J., Kifer, M., Ramakrishnan, I.V.: A Layered Architecture for Querying Dynamic Web Content. In: ACM Conference on Management of Data (SIGMOD) (June 1999)
Google Scholar
Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Engineering Bulletin 23(4), 33–41 (2000)
Google Scholar
Kushmerick, N.: Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence Journal 118(1-2), 15–68 (2000)
Article MATH MathSciNet Google Scholar
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Engineering, Nanyang Technological University, Singapore
Vladimir Kovalev & Sourav S. Bhowmick
Department of Computer Science, University of Missouri-Rolla, Rolla, MO, 65409, USA
Sanjay Madria

Authors

Vladimir Kovalev
View author publications
You can also search for this author in PubMed Google Scholar
Sourav S. Bhowmick
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay Madria
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Zaragoza, Ciudad Universitaria, Plaza San Francisco, 50009, Zaragoza
Fernando Galindo
Seikei University, Japan
Makoto Takizawa
Institute of Informatics in Business and Government, University of Linz, Altenbergerstr. 69, 4040, Linz, Austria
Roland Traunmüller

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kovalev, V., Bhowmick, S.S., Madria, S. (2004). HW-STALKER: A Machine Learning-Based Approach to Transform Hidden Web Data to XML. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds) Database and Expert Systems Applications. DEXA 2004. Lecture Notes in Computer Science, vol 3180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30075-5_90

Download citation

DOI: https://doi.org/10.1007/978-3-540-30075-5_90
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22936-0
Online ISBN: 978-3-540-30075-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

HW-STALKER: A Machine Learning-Based Approach to Transform Hidden Web Data to XML

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Enabling Real Time Analytics over Raw XML Data

Transformation of XML Data Sources for Sequential Path Mining

Inferring a Relax NG Schema from XML Documents

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

HW-STALKER: A Machine Learning-Based Approach to Transform Hidden Web Data to XML

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Enabling Real Time Analytics over Raw XML Data

Transformation of XML Data Sources for Sequential Path Mining

Inferring a Relax NG Schema from XML Documents

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation