research-article

Link-based hidden attribute discovery for objects on Web

Authors:

Ariel FuxmanAuthors Info & Claims

EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database Technology

Pages 473 - 484

https://doi.org/10.1145/1951365.1951421

Published: 21 March 2011 Publication History

Abstract

Information extraction from the Web is of growing importance. Objects on the Web are often associated with many attributes that describe the objects. It is essential to extract these attributes and map them to their corresponding objects. However, much attribute information about an object is hidden in the dynamic user interaction and is not on the Web page that describes the object. Existing information extraction approaches focus on getting information from the object Web page only, which means a lot of attribute information is lost. In this paper, we study the dynamic user interaction on exploratory search Websites and propose a novel link-based approach to discover attributes and map them to objects. We build an exploratory search model for exploratory Web sites, and we propose algorithms for identifying, clustering, and relationship mining of related Web pages based on the model. Using the unsupervised method in our approach, we are able to discover hidden attributes not explicitly shown on object Web pages. We test our approach on two online shopping Websites. We achieve high precision and recall: For entirely crawled Web sites the precision and recall are 98% and 97% respectively. For randomly crawled (sampled) Web sites the precision and recall are 98% and 80% respectively.

References

[1]

Buttler, D., Liu, L. and Pu, C. A Fully Automated Object Extraction System for the World Wide Web. In Proceedings of the The 21st International Conference on Distributed Computing Systems (2001). IEEE Computer Society.

Digital Library

[2]

Ghani, R., Probst, K., Liu, Y., Krema, M. and Fano, A. Text mining for product attribute extraction. ACM SIGKDD Explorations Newsletter, 8, 1 2006), 41--48.

Digital Library

[3]

Wang, J., Shao, B., Wang, H. and Zhu, K. Understanding Tables on the Web. Under submission (2010). Tech. Report.

[4]

Probst, K., Ghani, R., Krema, M., Fano, A. and Liu, Y. Extracting and Using Attribute-Value Pairs from Product Descriptions on the Web. Springer-Verlag, City, 2007.

Digital Library

[5]

Zhu, J., Nie, Z., Wen, J.-R., Zhang, B. and Ma, W.-Y. Simultaneous record detection and attribute labeling in web data extraction. In Proceedings of the 12th ACM SIGKDD (Philadelphia, PA, USA, 2006). ACM.

Digital Library

[6]

English, J., Hearst, M., Sinha, R., Swearingen, K. and Yee, K.-P. Hierarchical faceted metadata in site search interfaces. In Proceedings of the CHI '02 extended abstracts on Human factors in computing systems (Minneapolis, Minnesota, USA, 2002). ACM.

Digital Library

[7]

Marchionini, G. Exploratory search: from finding to understanding. Communications of the ACM, 49, 4 2006), 46.

Digital Library

[8]

Fu, W.-T. The microstructures of social tagging: a rational model. In Proceedings of the 2008 ACM conference on Computer supported cooperative work (San Diego, CA, USA, 2008). ACM.

Digital Library

[9]

Gupta, S., Kaiser, G., Neistadt, D. and Grimm, P. DOM-based content extraction of HTML documents. In Proceedings of the Proceedings of the 12th international conference on World Wide Web (Budapest, Hungary, 2003). ACM.

Digital Library

[10]

Wang, F., Li, J. and Homayounfar, H. A space efficient XML DOM parser. Data & Knowledge Engineering, 60, 1 2007), 185--207.

Digital Library

[11]

Wood, L. Programming the Web: the W3C DOM specification. IEEE Internet Computing, 3, 1 1999), 48--54.

Digital Library

[12]

Choi, B. and Yao, Z. Web Page Classification*. Foundations and Advances in Data Mining 2005), 221--274.

[13]

Cooley, R., Mobasher, B. and Srivastava, J. Data preparation for mining world wide web browsing patterns. Knowledge and Information systems, 1, 1 1999), 5--32.

[14]

Abiteboul, S. Querying Semi-Structured Data. In Proceedings of the 6th International Conference on Database Theory (1997). Springer-Verlag.

Digital Library

[15]

Maynard, D., Tablan, V., Ursu, C., Cunningham, H. and Wilks, Y. Named entity recognition from diverse text types. Citeseer, City, 2001.

[16]

Lehnert, W., McCarthy, J., Soderland, S., Riloff, E., Cardie, C., Peterson, J., Feng, F., Dolan, C. and Goldman, S. UMass/Hughes: description of the CIRCUS system used for Tipster text. In Proceedings of TIPSTER' 93 workshop (Fredericksburg, Virginia, 1993). ACL.

Digital Library

[17]

Muslea, I., Minton, S. and Knoblock, C. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4, 1 2001), 93--114.

Digital Library

[18]

Kushmerick, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118, 1--2 2000), 15--68.

Digital Library

[19]

Mansuri, I. R. and Sarawagi, S. Integrating Unstructured Data into Relational Databases. In Proceedings of the 22nd International Conference on Data Engineering (2006). IEEE Computer Society.

Digital Library

[20]

Seymore, K., McCallum, A. and Rosenfeld, R. Learning hidden Markov model structure for information extraction. City, 1999.

[21]

Holzinger, W., Krpl, B. and Herzog, M. Using ontologies for extracting product features from web pages. The Semantic Web-ISWC 2006 2006), 286--299.

Digital Library

[22]

Zhu, J., Nie, Z., Wen, J.-R., Zhang, B. and Ma, W.-Y. 2D Conditional Random Fields for Web information extraction. In Proceedings of the 22nd international conference on Machine learning (Bonn, Germany, 2005). ACM.

Digital Library

[23]

Han, X. and Zhao, J. CASIANED: People Attribute Extraction based on Information Extraction. City, 2009.

[24]

Zhai, Y. and Liu, B. Web data extraction based on partial tree alignment. In Proceedings of the 14th international conference on World Wide Web (Chiba, Japan, 2005). ACM.

Digital Library

[25]

Gatterbauer, W., Bohunsky, P., Herzog, M., Kr\, B., \#252, pl and Pollak, B. Towards domain-independent information extraction from web tables. In Proceedings of the 16th international conference on World Wide Web (Banff, Alberta, Canada, 2007). ACM.

Digital Library

[26]

Lerman, K., Getoor, L., Minton, S. and Knoblock, C. Using the structure of Websites for automatic segmentation of tables. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data (Paris, France, 2004). ACM.

Digital Library

[27]

Doorenbos, R. B., Etzioni, O. and Weld, D. S. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the first international conference on Autonomous agents (Marina del Rey, California, United States, 1997). ACM.

Digital Library

[28]

Teo, T. and Choo, W. Assessing the impact of using the Internet for competitive intelligence. Information & Management, 39, 1 2001), 67--83.

Digital Library

[29]

Artiles, J., Gonzalo, J. and Sekine, S. Weps 2 evaluation campaign: overview of the web people search clustering task. City, 2009.

[30]

Wu, W., Li, H., Wang, H. and Zhu, Q. Towards a Universal Taxonomy of Many Concepts. Under submission (2010). Tech. Report.

Cited By

Hao Wei Chen XChao Wang (2012)User behavior analyses based on network data stream scenario2012 IEEE 14th International Conference on Communication Technology10.1109/ICCT.2012.6511348(1017-1021)Online publication date: Nov-2012
https://doi.org/10.1109/ICCT.2012.6511348
Zhu YYin GLi XWang HShi DYuan L(2011)Exploiting attribute redundancy for web entity data extractionProceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation10.5555/2075271.2075289(98-107)Online publication date: 24-Oct-2011
https://dl.acm.org/doi/10.5555/2075271.2075289
Zhu YYin GLi XWang HShi DYuan L(2011)Exploiting Attribute Redundancy for Web Entity Data ExtractionDigital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation10.1007/978-3-642-24826-9_15(98-107)Online publication date: 2011
https://doi.org/10.1007/978-3-642-24826-9_15

Recommendations

A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in Technology

For context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...
Attribute domain discovery for hidden web databases
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Many web databases are hidden behind restrictive form-like interfaces which may or may not provide domain information for an attribute. When attribute domains are not available, domain discovery becomes a critical challenge facing the application of a ...
Simultaneous record detection and attribute labeling in web data extraction
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Recent work has shown the feasibility and promise of template-independent Web data extraction. However, existing approaches use decoupled strategies - attempting to do data record detection and attribute labeling in two separate phases. In this paper, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database Technology

March 2011

587 pages

ISBN:9781450305280

DOI:10.1145/1951365

Editors:
Anastasia Ailamaki
EPFL, Switzerland
,
Sihem Amer-Yahia
Yahoo! Research
,
Jignesh Pate
University of Wisconsin-Madison
,
Tore Risch
Uppsala University, Sweden
,
Pierre Senellart
Télécom ParisTech, France
,
Julia Stoyanovich
University of Pennsylvania

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Microsoft Research: Microsoft Research

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

EDBT/ICDT '11

Sponsor:

Microsoft Research

EDBT/ICDT '11: EDBT/ICDT '11 joint conference

March 21 - 24, 2011

Uppsala, Sweden

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
219
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hao Wei Chen XChao Wang (2012)User behavior analyses based on network data stream scenario2012 IEEE 14th International Conference on Communication Technology10.1109/ICCT.2012.6511348(1017-1021)Online publication date: Nov-2012
https://doi.org/10.1109/ICCT.2012.6511348
Zhu YYin GLi XWang HShi DYuan L(2011)Exploiting attribute redundancy for web entity data extractionProceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation10.5555/2075271.2075289(98-107)Online publication date: 24-Oct-2011
https://dl.acm.org/doi/10.5555/2075271.2075289
Zhu YYin GLi XWang HShi DYuan L(2011)Exploiting Attribute Redundancy for Web Entity Data ExtractionDigital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation10.1007/978-3-642-24826-9_15(98-107)Online publication date: 2011
https://doi.org/10.1007/978-3-642-24826-9_15

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten