Abstract
In this paper, we propose an algorithm called HW-Transform for transforming hidden web data to XML format using machine learning by extending stalker to handle hyperlinked hidden web pages. One of the key features of our approach is that we identify and transform key attributes of query results into XML attributes. These key attributes facilitate applications such as change detection and data integration. by efficiently identifying related or identical results. Based on the proposed algorithm, we have implemented a prototype system called hw-stalker using Java. Our experiments demonstrate that HW-Transform shows acceptable performance for transforming query results to XML.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of the 27th VLDB Conference, Roma, Italy (2001)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. In: 8th World Wide Web Conference (May 1999)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 26th International Conference on Very Large Database Systems, Roma, Italy, pp. 109–118 (2001)
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling using Context Graphs. In: 26th International Conference on Very Large Databases, VLDB 2000 (September 2000)
Freitag, D.: Machine Learning for Information Extraction in Informal Domains. Machine Learning 39(2/3), 169–202 (2000)
Davulku, H., Freire, J., Kifer, M., Ramakrishnan, I.V.: A Layered Architecture for Querying Dynamic Web Content. In: ACM Conference on Management of Data (SIGMOD) (June 1999)
Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Engineering Bulletin 23(4), 33–41 (2000)
Kushmerick, N.: Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence Journal 118(1-2), 15–68 (2000)
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kovalev, V., Bhowmick, S.S., Madria, S. (2004). HW-STALKER: A Machine Learning-Based Approach to Transform Hidden Web Data to XML. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds) Database and Expert Systems Applications. DEXA 2004. Lecture Notes in Computer Science, vol 3180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30075-5_90
Download citation
DOI: https://doi.org/10.1007/978-3-540-30075-5_90
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22936-0
Online ISBN: 978-3-540-30075-5
eBook Packages: Springer Book Archive