Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1951365.1951421acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Link-based hidden attribute discovery for objects on Web

Published: 21 March 2011 Publication History

Abstract

Information extraction from the Web is of growing importance. Objects on the Web are often associated with many attributes that describe the objects. It is essential to extract these attributes and map them to their corresponding objects. However, much attribute information about an object is hidden in the dynamic user interaction and is not on the Web page that describes the object. Existing information extraction approaches focus on getting information from the object Web page only, which means a lot of attribute information is lost. In this paper, we study the dynamic user interaction on exploratory search Websites and propose a novel link-based approach to discover attributes and map them to objects. We build an exploratory search model for exploratory Web sites, and we propose algorithms for identifying, clustering, and relationship mining of related Web pages based on the model. Using the unsupervised method in our approach, we are able to discover hidden attributes not explicitly shown on object Web pages. We test our approach on two online shopping Websites. We achieve high precision and recall: For entirely crawled Web sites the precision and recall are 98% and 97% respectively. For randomly crawled (sampled) Web sites the precision and recall are 98% and 80% respectively.

References

[1]
Buttler, D., Liu, L. and Pu, C. A Fully Automated Object Extraction System for the World Wide Web. In Proceedings of the The 21st International Conference on Distributed Computing Systems (2001). IEEE Computer Society.
[2]
Ghani, R., Probst, K., Liu, Y., Krema, M. and Fano, A. Text mining for product attribute extraction. ACM SIGKDD Explorations Newsletter, 8, 1 2006), 41--48.
[3]
Wang, J., Shao, B., Wang, H. and Zhu, K. Understanding Tables on the Web. Under submission (2010). Tech. Report.
[4]
Probst, K., Ghani, R., Krema, M., Fano, A. and Liu, Y. Extracting and Using Attribute-Value Pairs from Product Descriptions on the Web. Springer-Verlag, City, 2007.
[5]
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B. and Ma, W.-Y. Simultaneous record detection and attribute labeling in web data extraction. In Proceedings of the 12th ACM SIGKDD (Philadelphia, PA, USA, 2006). ACM.
[6]
English, J., Hearst, M., Sinha, R., Swearingen, K. and Yee, K.-P. Hierarchical faceted metadata in site search interfaces. In Proceedings of the CHI '02 extended abstracts on Human factors in computing systems (Minneapolis, Minnesota, USA, 2002). ACM.
[7]
Marchionini, G. Exploratory search: from finding to understanding. Communications of the ACM, 49, 4 2006), 46.
[8]
Fu, W.-T. The microstructures of social tagging: a rational model. In Proceedings of the 2008 ACM conference on Computer supported cooperative work (San Diego, CA, USA, 2008). ACM.
[9]
Gupta, S., Kaiser, G., Neistadt, D. and Grimm, P. DOM-based content extraction of HTML documents. In Proceedings of the Proceedings of the 12th international conference on World Wide Web (Budapest, Hungary, 2003). ACM.
[10]
Wang, F., Li, J. and Homayounfar, H. A space efficient XML DOM parser. Data & Knowledge Engineering, 60, 1 2007), 185--207.
[11]
Wood, L. Programming the Web: the W3C DOM specification. IEEE Internet Computing, 3, 1 1999), 48--54.
[12]
Choi, B. and Yao, Z. Web Page Classification*. Foundations and Advances in Data Mining 2005), 221--274.
[13]
Cooley, R., Mobasher, B. and Srivastava, J. Data preparation for mining world wide web browsing patterns. Knowledge and Information systems, 1, 1 1999), 5--32.
[14]
Abiteboul, S. Querying Semi-Structured Data. In Proceedings of the 6th International Conference on Database Theory (1997). Springer-Verlag.
[15]
Maynard, D., Tablan, V., Ursu, C., Cunningham, H. and Wilks, Y. Named entity recognition from diverse text types. Citeseer, City, 2001.
[16]
Lehnert, W., McCarthy, J., Soderland, S., Riloff, E., Cardie, C., Peterson, J., Feng, F., Dolan, C. and Goldman, S. UMass/Hughes: description of the CIRCUS system used for Tipster text. In Proceedings of TIPSTER' 93 workshop (Fredericksburg, Virginia, 1993). ACL.
[17]
Muslea, I., Minton, S. and Knoblock, C. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4, 1 2001), 93--114.
[18]
Kushmerick, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118, 1--2 2000), 15--68.
[19]
Mansuri, I. R. and Sarawagi, S. Integrating Unstructured Data into Relational Databases. In Proceedings of the 22nd International Conference on Data Engineering (2006). IEEE Computer Society.
[20]
Seymore, K., McCallum, A. and Rosenfeld, R. Learning hidden Markov model structure for information extraction. City, 1999.
[21]
Holzinger, W., Krpl, B. and Herzog, M. Using ontologies for extracting product features from web pages. The Semantic Web-ISWC 2006 2006), 286--299.
[22]
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B. and Ma, W.-Y. 2D Conditional Random Fields for Web information extraction. In Proceedings of the 22nd international conference on Machine learning (Bonn, Germany, 2005). ACM.
[23]
Han, X. and Zhao, J. CASIANED: People Attribute Extraction based on Information Extraction. City, 2009.
[24]
Zhai, Y. and Liu, B. Web data extraction based on partial tree alignment. In Proceedings of the 14th international conference on World Wide Web (Chiba, Japan, 2005). ACM.
[25]
Gatterbauer, W., Bohunsky, P., Herzog, M., Kr\, B., \#252, pl and Pollak, B. Towards domain-independent information extraction from web tables. In Proceedings of the 16th international conference on World Wide Web (Banff, Alberta, Canada, 2007). ACM.
[26]
Lerman, K., Getoor, L., Minton, S. and Knoblock, C. Using the structure of Websites for automatic segmentation of tables. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data (Paris, France, 2004). ACM.
[27]
Doorenbos, R. B., Etzioni, O. and Weld, D. S. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the first international conference on Autonomous agents (Marina del Rey, California, United States, 1997). ACM.
[28]
Teo, T. and Choo, W. Assessing the impact of using the Internet for competitive intelligence. Information & Management, 39, 1 2001), 67--83.
[29]
Artiles, J., Gonzalo, J. and Sekine, S. Weps 2 evaluation campaign: overview of the web people search clustering task. City, 2009.
[30]
Wu, W., Li, H., Wang, H. and Zhu, Q. Towards a Universal Taxonomy of Many Concepts. Under submission (2010). Tech. Report.

Cited By

View all
  • (2012)User behavior analyses based on network data stream scenario2012 IEEE 14th International Conference on Communication Technology10.1109/ICCT.2012.6511348(1017-1021)Online publication date: Nov-2012
  • (2011)Exploiting attribute redundancy for web entity data extractionProceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation10.5555/2075271.2075289(98-107)Online publication date: 24-Oct-2011
  • (2011)Exploiting Attribute Redundancy for Web Entity Data ExtractionDigital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation10.1007/978-3-642-24826-9_15(98-107)Online publication date: 2011

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database Technology
March 2011
587 pages
ISBN:9781450305280
DOI:10.1145/1951365
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • Microsoft Research: Microsoft Research

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. attribute discovery
  2. attribute labeling
  3. exploratory search
  4. information extraction
  5. query link

Qualifiers

  • Research-article

Conference

EDBT/ICDT '11
Sponsor:
  • Microsoft Research
EDBT/ICDT '11: EDBT/ICDT '11 joint conference
March 21 - 24, 2011
Uppsala, Sweden

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2012)User behavior analyses based on network data stream scenario2012 IEEE 14th International Conference on Communication Technology10.1109/ICCT.2012.6511348(1017-1021)Online publication date: Nov-2012
  • (2011)Exploiting attribute redundancy for web entity data extractionProceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation10.5555/2075271.2075289(98-107)Online publication date: 24-Oct-2011
  • (2011)Exploiting Attribute Redundancy for Web Entity Data ExtractionDigital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation10.1007/978-3-642-24826-9_15(98-107)Online publication date: 2011

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media