research-article

Specification and discovery of web patterns

Authors:

Kang ZhangAuthors Info & Claims

Information Sciences—Informatics and Computer Science, Intelligent Systems, Applications: An International Journal, Volume 328, Issue C

Pages 528 - 545

https://doi.org/10.1016/j.ins.2015.08.052

Published: 20 January 2016 Publication History

Abstract

Finding useful information from the Web becomes increasingly difficult as the volume of Web data rapidly grows. To facilitate effective Web browsing, Web designers usually display the same type of information with a consistent layout (referred to as a Web pattern). Discovering Web patterns can benefit many applications, such as extracting structured data. This paper presents a generic framework for discovering Web patterns and recognizing their instances (i.e., structured data) based on graph grammars. In our framework, a Web pattern is visually yet formally specified as a graph grammar, which is automatically induced through a grammar induction engine. The grammar induction engine is featured by converting the problem of (2-dimensional) graph grammar induction to (1-dimensional) string induction. Based on the induced pattern, matching instances are recognized from Web pages through a graph parsing process. We have evaluated the framework on twenty-one e-commerce Web sites. The evaluation results are promising with a high F1-score.

References

[1]

H. Ahmadi, J. Kong, Efficient web browsing on small screens, in: Proceedings of the Working Conference on Advanced Visual interfaces, 2008, pp. 23-30.

Digital Library

[2]

H. Ahmadi, J. Kong, User-Centric Adaptation of Web Information for Small Screens, J. Visual Lang. Comput., vol. 23 (2012) 13-28.

[3]

M. Alvarez, A. Pan, J. Raposo, F. Bellas, F. Cacheda, Finding and extracting data records from Web pages, J. Signal Process. Syst., 59 (2010) 123-137.

Digital Library

[4]

M.S. Amin, H. Jamil, An efficient Web-based wrapper and annotator for tabular data, Int. J. Softw. Eng. Knowl. Eng., vol. 20 (2010) 215-231.

[5]

N. Anderson, J. Hong, Visually extracting data records from the deep Web, in: Proc. WWW'13, 2013, pp. 1233-1238.

[6]

A. Arasu, H. Garcia-Molina, Extracting structured data from Web pages, in: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 2003, pp. 337-348.

Digital Library

[7]

K. Ates, J. Kukluk, L. Holder, D. Cook, K. Zhang, Graph grammar induction on structural data for visual programming, in: Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelligence, 2006, pp. 232-242.

[8]

K. Ates, K. Zhang, Constructing VEGGIE: machine learning for context-sensitive graph grammars, in: Proceedings of the 19th IEEE International Conference on Tools with Artificial intelligence, vol. 02, 2007, pp. 456-463.

Digital Library

[9]

L.D. Bing, W. Lam, Y. Gu, Towards a unified solution data record region detection and segmentation, in: Proc. CIKM'11, 2011, pp. 1265-1274.

[10]

L.D. Bing, W. Lam, T.L. Wong, Robust detection of semi-structured Web records using a DOM structure-knowledge-driven model, ACM Trans. Web, 7 (2013).

[11]

D. Cai, S. Yu, J. Wen, M.W., Extracting content structure for web pages based on visual representation, in: Proc. Asia Pacific Web, 2003, pp. 406-417.

[12]

T.C. Chang, S.S. Xu, Object-image-based quality-on-demand energy saving schemes for OLED displays, Electron. Lett., vol. 50 (2014) 1595-1597.

[13]

J. Chen, K. Xiao, Perception-oriented online news extraction, in: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, 2008, pp. 363-366.

Digital Library

[14]

S. Chuang, J.Y. Hsu, Tree-structured template generation for Web pages, in: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, 2004, pp. 327-333.

[15]

P.T. Cox, T. Smedley, Building environments for visual programming of robots by demonstration, J. Visual Lang. Comput., vol. 11 (2000) 549-571.

[16]

V. Crescenzi, G. Mecca, P. Merialdo, RoadRunner: towards automatic data extraction from large Web sites, in: Proceedings of the 27th International Conference on Very Large Data Bases, 2001, pp. 109-118.

Digital Library

[17]

N. Dalvi, R. Kumar, M. Soliman, Automatic wrappers for large scale Web extraction, in: Proceedings of the VLDB Endowment, 4, 2011, pp. 219-230.

[18]

O. Ermelinda, M. Ruffolo, SILA: A spatial instance learning approach for deep web pages, in: Proc. CIKM'11, 2011, pp. 2329-2332.

[19]

D. Freitag, N. Kushmerick, Boosted wrapper induction, in: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, 2000, pp. 577-583.

Digital Library

[20]

E. Fersini, E. Messina, F. Archetti, Enhancing web page classification through image-block importance analysis, Inf. Process. Manage., 44 (2008) 1431-1447.

[21]

J. Geng, J. Yang, Automatic extraction and integration of bibliographic information on the Web, IDEAS'04 (2004) 193-204.

[22]

C. Hsu, M. Dung, Generating finite-state transducers for semi-structured data extraction from the Web, Inf. Syst., 23 (1998) 521-538.

Digital Library

[23]

F. Hu, T. Ruan, Z.Q. Shao, J. Ding, Automatic Web information extraction based on rules, Proc. WISE 2011 (2011) 265-272.

[24]

P.M. Joshi, S. Liu, Web document text and images extraction using DOM analysis and natural language processing, in: Proceedings of the 9th ACM Symposium on Document Engineering, 2009, pp. 218-221.

Digital Library

[25]

J. Kong, K. Zhang, X. Zeng, Spatial graph grammars for graphical user interfaces, ACM Trans. Comput. Human Interact., 13 (2006) 268-307.

Digital Library

[26]

J. Kong, O. Barkol, R. Bergman, A. Pnueli, S. Schein, C.Y. Zhao, K. Zhang, Web interface interpretation using graph grammars, IEEE Trans. SMC - Part C, 42 (2012) 590-602.

Digital Library

[27]

M. Kovacevic, M. Diligenti, M. Gori, V. Milutinovic, Recognition of common areas in a Web page using visual information: a possible application in a page classification, in: Proceedings of the 2002 IEEE International Conference on Data Mining, 2002, pp. 250.

[28]

N. Kushmerick, D. Weld, R. Doorenbos, Wrapper induction for information extraction, in: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, 1997, pp. 729-737.

[29]

E.S. Laber, C.P. de Souza, I.V. Jabour, E.C. de Amorim, E.T. Cardoso, R.P. Rentería, L.C. Tinoco, C.D. Valentim, A fast and simple method for extracting relevant content from news webpages, in: Proceeding of the 18th ACM Conference on Information and Knowledge Management, 2009, pp. 1685-1688.

[30]

P. Ladyzynski, P. Grzegorzewski, Retrieving informative content from Web pages with conditional learning of support vector machines and semantic analysis, in: Proc. ICAISC 2012, Part II, 2012, pp. 128-135.

[31]

A.H. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira, A brief survey of web data extraction tools, SIGMOD Rec., 31 (2002) 84-93.

Digital Library

[32]

B. Liu, R. Grossman, Y. Zhai, Mining data records in Web pages, in: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, pp. 601-606.

Digital Library

[33]

W. Liu, X. Meng, W. Meng, Vide: a vision-based approach for deep web data extraction, IEEE Trans. Knowl. Data Eng., 22 (2010) 447-460.

Digital Library

[34]

R.E. Mayer, Multimedia Learning, Cambridge University Press, New York, 2005.

[35]

G.X. Miao, J. Tatemura, W.P. Hsiung, A. Sawires, L. Moser, Extracting data records from the Web using tag path clustering, in: Proc. WWW'2009, 2009, pp. 981-990.

[36]

K. Mohammed, C.H. Chang, FiVaTech: page-level web data extraction from template pages, IEEE Trans. Knowl. Data Eng., 22 (2010) 249-263.

Digital Library

[37]

I. Muslea, S. Minton, C. Knoblock, A hierarchical approach to wrapper induction, in: Proceedings of the Third Annual Conference on Autonomous Agents, 1999, pp. 190-197.

Digital Library

[38]

I. Muslea, S. Minton, C.A. Knoblock, Hierarchical wrapper induction for semistructured information sources, Auton. Agents Multi-Agent Syst., 4 (2001) 93-114.

[39]

A. Penev, R.K. Wong, Grouping hyperlinks for improved voice/mobile accessibility, in: Proceedings of the 2008 international cross-disciplinary conference on Web accessibility, 2008, pp. 50-53.

Digital Library

[40]

S. Raeymatkers, M. Bruynooghe, Sub node extraction with tree based wrappers, in: Proceedings of the European Conference on Artificial Intelligence, 2008, pp. 137-141.

[41]

D.C. Reis, P.B. Golgher, A.S. Silva, A.F. Laender, Automatic web news extraction using tree edit distance, in: Proceedings of the 13th International Conference on World Wide Web, 2004, pp. 502-511.

[42]

Handbook on Graph Grammars and Computing by Graph Transformation: Foundations, vol. 1, in: Handbook on Graph Grammars and Computing by Graph Transformation: Foundations, vol. 1, World Scientific, 1997.

[43]

A. Roudaki, J. Kong, Graph grammar based Web data extraction, in: Proceedings of International Conference on Software Engineering and Knowledge Engineering, 2011, pp. 373-378.

[44]

S. Sarawagi, Automation in information extraction and data integration (tutorial), in: Proceedings of VLDB 2002, 2002.

[45]

B. Shneiderman, C. Plaisant, Designing the User Interface: Strategies for Effective Human-Computer Interaction, Addison-Wesley Longman Publishing Co., Inc., 2009.

[46]

K. Simon, G. Lausen, ViPER: augmenting automatic information extraction with visual perceptions, in: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, 2005, pp. 381-388.

[47]

M. Skounakis, M. Craven, S. Ray, Hierarchical hidden Markov models for information extraction, in: Proceedings of the 18th International Joint Conference on Artificial Intelligence, 2003, pp. 427-433.

Digital Library

[48]

H.A. Sleiman, R. Corchuelo, A survey on region extractor from web documents, IEEE Trans. Knowl. Data Eng., 25 (2012) 1960-1981.

Digital Library

[49]

H.A. Sleiman, R. Corchuelo, TEX: an efficient and effective unsupervised Web information extractor, Knowl. Based Syst., 39 (2013) 109-123.

Digital Library

[50]

R. Song, H. Liu, J. Wen, W. Ma, Learning block importance models for web pages, in: Proceedings of the 13th International Conference on World Wide Web, 2004, pp. 203-211.

[51]

X.Y. Song, J. Liu, Y.B. Cao, C.Y. Lin, H.W. Hon, Automatic extraction of Web data records containing user-generated content, in: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, 2010, pp. 39-48.

[52]

X.Y. Xiao, Q. Luo, D. Hong, H.B. Fu, X. Xie, W.Y. Ma, Browsing on small displays by transforming Web pages into hierarchically structured subpages, ACM Trans. Web, vol. 3 (2009).

Digital Library

[53]

X. Yin, W.S. Lee, Using link analysis to improve layout on mobile devices, in: Proceedings of the 13th International Conference on World Wide Web, 2004, pp. 338-344.

[54]

X. Yin, W.S. Lee, Understanding the function of web elements for mobile content delivery using random walk models, in: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, 2005, pp. 1150-1151.

[55]

Y. You, G. Xu, J. Cao, Y.C. Zhang, G. Huang, Leveraging visual features and hierarchical dependencies for conference information extraction, in: Proc. APWeb, LNCS 7808, 2013, pp. 404-416.

[56]

Z. Zhang, B. He, K.C.-C. Chang, Understanding Web query interfaces: best-effort parsing with hidden syntax, in: Proc. 2004 ACM SIGMOD International Conference on Management of Data, 2004, pp. 107-118.

Digital Library

[57]

H. Zhao, W. Meng, Z. Wu, V. Raghavan, C. Yu, Fully automatic wrapper generation for search engines, in: Proceedings of the 14th International Conference on World Wide Web, 2005, pp. 66-75.

Digital Library

[58]

Y. Zhai, B. Liu, Web data extraction based on partial tree alignment, in: Proceedings of the 14th International Conference on World Wide Web, 2005, pp. 76-85.

Digital Library

[59]

Y. Zhai, B. Liu, Extracting Web data using instance-based learning, World Wide Web, 10 (2007) 113-132.

Digital Library

[60]

Q. Zhang, Y. Shi, X. Huang, L. Wu, Template-independent wrapper for web forums, in: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009, pp. 794-795.

[61]

S. Zheng, R. Song, J. Wen, Template-independent news extraction based on visual consistency, in: Proceedings of the 22nd National Conference on Artificial Intelligence, vol. 2, 2007, pp. 1507-1512.

Cited By

Liu YYang FLiu J(2023)Graph Grammar Formalism with Multigranularity for Spatial GraphsJournal of Logic, Language and Information10.1007/s10849-023-09406-032:5(809-827)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1007/s10849-023-09406-0
Zou YZeng XZhu Y(2022)A general parsing algorithm with context matching for context-sensitive graph grammarsMultimedia Tools and Applications10.1007/s11042-021-11076-881:1(273-297)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1007/s11042-021-11076-8
Wang XLiu YLi JZhang KKerren AKlein KLi Y(2018)Generating Tractable Designs by Transforming Shape Grammars to Graph GrammarsProceedings of the 11th International Symposium on Visual Information Communication and Interaction10.1145/3231622.3231637(41-48)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3231622.3231637
Show More Cited By

Recommendations

Cluster, SOM and NMF Analyses of Web Patterns
NWESP '09: Proceedings of the 2009 Fifth International Conference on Next Generation Web Services Practices

This paper focuses on web pages clustering as a tool for typical Web patterns searching and using. Traditional methods of cluster analysis, self-organizing map and nonnegative matrix factorization were applied. Web pages on products sale and ...
Understanding web documents: finding pagelets for transformation using structural patterns

Understanding a web document and the sections inside the document is very important for web transformation and information retrieval from web pages. Detecting pagelets, which are small features located inside a web page, in order to understand a web ...
Web Services Patterns: Java Edition

Comments

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 328, Issue C

January 2016

595 pages

ISSN:0020-0255

Issue’s Table of Contents

Copyright © Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 20 January 2016

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu YYang FLiu J(2023)Graph Grammar Formalism with Multigranularity for Spatial GraphsJournal of Logic, Language and Information10.1007/s10849-023-09406-032:5(809-827)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1007/s10849-023-09406-0
Zou YZeng XZhu Y(2022)A general parsing algorithm with context matching for context-sensitive graph grammarsMultimedia Tools and Applications10.1007/s11042-021-11076-881:1(273-297)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1007/s11042-021-11076-8
Wang XLiu YLi JZhang KKerren AKlein KLi Y(2018)Generating Tractable Designs by Transforming Shape Grammars to Graph GrammarsProceedings of the 11th International Symposium on Visual Information Communication and Interaction10.1145/3231622.3231637(41-48)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3231622.3231637
Liu YZeng XZhang KBalakrishnan PMcMahan R(2018)Quantitative spatial semantics in a graph grammar formalismProceedings of the 3rd International Workshop on Interactive and Spatial Computing10.1145/3191801.3191803(1-7)Online publication date: 12-Apr-2018
https://dl.acm.org/doi/10.1145/3191801.3191803

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents