Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Specification and discovery of web patterns

Published: 20 January 2016 Publication History

Abstract

Finding useful information from the Web becomes increasingly difficult as the volume of Web data rapidly grows. To facilitate effective Web browsing, Web designers usually display the same type of information with a consistent layout (referred to as a Web pattern). Discovering Web patterns can benefit many applications, such as extracting structured data. This paper presents a generic framework for discovering Web patterns and recognizing their instances (i.e., structured data) based on graph grammars. In our framework, a Web pattern is visually yet formally specified as a graph grammar, which is automatically induced through a grammar induction engine. The grammar induction engine is featured by converting the problem of (2-dimensional) graph grammar induction to (1-dimensional) string induction. Based on the induced pattern, matching instances are recognized from Web pages through a graph parsing process. We have evaluated the framework on twenty-one e-commerce Web sites. The evaluation results are promising with a high F1-score.

References

[1]
H. Ahmadi, J. Kong, Efficient web browsing on small screens, in: Proceedings of the Working Conference on Advanced Visual interfaces, 2008, pp. 23-30.
[2]
H. Ahmadi, J. Kong, User-Centric Adaptation of Web Information for Small Screens, J. Visual Lang. Comput., vol. 23 (2012) 13-28.
[3]
M. Alvarez, A. Pan, J. Raposo, F. Bellas, F. Cacheda, Finding and extracting data records from Web pages, J. Signal Process. Syst., 59 (2010) 123-137.
[4]
M.S. Amin, H. Jamil, An efficient Web-based wrapper and annotator for tabular data, Int. J. Softw. Eng. Knowl. Eng., vol. 20 (2010) 215-231.
[5]
N. Anderson, J. Hong, Visually extracting data records from the deep Web, in: Proc. WWW'13, 2013, pp. 1233-1238.
[6]
A. Arasu, H. Garcia-Molina, Extracting structured data from Web pages, in: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 2003, pp. 337-348.
[7]
K. Ates, J. Kukluk, L. Holder, D. Cook, K. Zhang, Graph grammar induction on structural data for visual programming, in: Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelligence, 2006, pp. 232-242.
[8]
K. Ates, K. Zhang, Constructing VEGGIE: machine learning for context-sensitive graph grammars, in: Proceedings of the 19th IEEE International Conference on Tools with Artificial intelligence, vol. 02, 2007, pp. 456-463.
[9]
L.D. Bing, W. Lam, Y. Gu, Towards a unified solution data record region detection and segmentation, in: Proc. CIKM'11, 2011, pp. 1265-1274.
[10]
L.D. Bing, W. Lam, T.L. Wong, Robust detection of semi-structured Web records using a DOM structure-knowledge-driven model, ACM Trans. Web, 7 (2013).
[11]
D. Cai, S. Yu, J. Wen, M.W., Extracting content structure for web pages based on visual representation, in: Proc. Asia Pacific Web, 2003, pp. 406-417.
[12]
T.C. Chang, S.S. Xu, Object-image-based quality-on-demand energy saving schemes for OLED displays, Electron. Lett., vol. 50 (2014) 1595-1597.
[13]
J. Chen, K. Xiao, Perception-oriented online news extraction, in: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, 2008, pp. 363-366.
[14]
S. Chuang, J.Y. Hsu, Tree-structured template generation for Web pages, in: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, 2004, pp. 327-333.
[15]
P.T. Cox, T. Smedley, Building environments for visual programming of robots by demonstration, J. Visual Lang. Comput., vol. 11 (2000) 549-571.
[16]
V. Crescenzi, G. Mecca, P. Merialdo, RoadRunner: towards automatic data extraction from large Web sites, in: Proceedings of the 27th International Conference on Very Large Data Bases, 2001, pp. 109-118.
[17]
N. Dalvi, R. Kumar, M. Soliman, Automatic wrappers for large scale Web extraction, in: Proceedings of the VLDB Endowment, 4, 2011, pp. 219-230.
[18]
O. Ermelinda, M. Ruffolo, SILA: A spatial instance learning approach for deep web pages, in: Proc. CIKM'11, 2011, pp. 2329-2332.
[19]
D. Freitag, N. Kushmerick, Boosted wrapper induction, in: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, 2000, pp. 577-583.
[20]
E. Fersini, E. Messina, F. Archetti, Enhancing web page classification through image-block importance analysis, Inf. Process. Manage., 44 (2008) 1431-1447.
[21]
J. Geng, J. Yang, Automatic extraction and integration of bibliographic information on the Web, IDEAS'04 (2004) 193-204.
[22]
C. Hsu, M. Dung, Generating finite-state transducers for semi-structured data extraction from the Web, Inf. Syst., 23 (1998) 521-538.
[23]
F. Hu, T. Ruan, Z.Q. Shao, J. Ding, Automatic Web information extraction based on rules, Proc. WISE 2011 (2011) 265-272.
[24]
P.M. Joshi, S. Liu, Web document text and images extraction using DOM analysis and natural language processing, in: Proceedings of the 9th ACM Symposium on Document Engineering, 2009, pp. 218-221.
[25]
J. Kong, K. Zhang, X. Zeng, Spatial graph grammars for graphical user interfaces, ACM Trans. Comput. Human Interact., 13 (2006) 268-307.
[26]
J. Kong, O. Barkol, R. Bergman, A. Pnueli, S. Schein, C.Y. Zhao, K. Zhang, Web interface interpretation using graph grammars, IEEE Trans. SMC - Part C, 42 (2012) 590-602.
[27]
M. Kovacevic, M. Diligenti, M. Gori, V. Milutinovic, Recognition of common areas in a Web page using visual information: a possible application in a page classification, in: Proceedings of the 2002 IEEE International Conference on Data Mining, 2002, pp. 250.
[28]
N. Kushmerick, D. Weld, R. Doorenbos, Wrapper induction for information extraction, in: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, 1997, pp. 729-737.
[29]
E.S. Laber, C.P. de Souza, I.V. Jabour, E.C. de Amorim, E.T. Cardoso, R.P. Rentería, L.C. Tinoco, C.D. Valentim, A fast and simple method for extracting relevant content from news webpages, in: Proceeding of the 18th ACM Conference on Information and Knowledge Management, 2009, pp. 1685-1688.
[30]
P. Ladyzynski, P. Grzegorzewski, Retrieving informative content from Web pages with conditional learning of support vector machines and semantic analysis, in: Proc. ICAISC 2012, Part II, 2012, pp. 128-135.
[31]
A.H. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira, A brief survey of web data extraction tools, SIGMOD Rec., 31 (2002) 84-93.
[32]
B. Liu, R. Grossman, Y. Zhai, Mining data records in Web pages, in: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, pp. 601-606.
[33]
W. Liu, X. Meng, W. Meng, Vide: a vision-based approach for deep web data extraction, IEEE Trans. Knowl. Data Eng., 22 (2010) 447-460.
[34]
R.E. Mayer, Multimedia Learning, Cambridge University Press, New York, 2005.
[35]
G.X. Miao, J. Tatemura, W.P. Hsiung, A. Sawires, L. Moser, Extracting data records from the Web using tag path clustering, in: Proc. WWW'2009, 2009, pp. 981-990.
[36]
K. Mohammed, C.H. Chang, FiVaTech: page-level web data extraction from template pages, IEEE Trans. Knowl. Data Eng., 22 (2010) 249-263.
[37]
I. Muslea, S. Minton, C. Knoblock, A hierarchical approach to wrapper induction, in: Proceedings of the Third Annual Conference on Autonomous Agents, 1999, pp. 190-197.
[38]
I. Muslea, S. Minton, C.A. Knoblock, Hierarchical wrapper induction for semistructured information sources, Auton. Agents Multi-Agent Syst., 4 (2001) 93-114.
[39]
A. Penev, R.K. Wong, Grouping hyperlinks for improved voice/mobile accessibility, in: Proceedings of the 2008 international cross-disciplinary conference on Web accessibility, 2008, pp. 50-53.
[40]
S. Raeymatkers, M. Bruynooghe, Sub node extraction with tree based wrappers, in: Proceedings of the European Conference on Artificial Intelligence, 2008, pp. 137-141.
[41]
D.C. Reis, P.B. Golgher, A.S. Silva, A.F. Laender, Automatic web news extraction using tree edit distance, in: Proceedings of the 13th International Conference on World Wide Web, 2004, pp. 502-511.
[42]
Handbook on Graph Grammars and Computing by Graph Transformation: Foundations, vol. 1, in: Handbook on Graph Grammars and Computing by Graph Transformation: Foundations, vol. 1, World Scientific, 1997.
[43]
A. Roudaki, J. Kong, Graph grammar based Web data extraction, in: Proceedings of International Conference on Software Engineering and Knowledge Engineering, 2011, pp. 373-378.
[44]
S. Sarawagi, Automation in information extraction and data integration (tutorial), in: Proceedings of VLDB 2002, 2002.
[45]
B. Shneiderman, C. Plaisant, Designing the User Interface: Strategies for Effective Human-Computer Interaction, Addison-Wesley Longman Publishing Co., Inc., 2009.
[46]
K. Simon, G. Lausen, ViPER: augmenting automatic information extraction with visual perceptions, in: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, 2005, pp. 381-388.
[47]
M. Skounakis, M. Craven, S. Ray, Hierarchical hidden Markov models for information extraction, in: Proceedings of the 18th International Joint Conference on Artificial Intelligence, 2003, pp. 427-433.
[48]
H.A. Sleiman, R. Corchuelo, A survey on region extractor from web documents, IEEE Trans. Knowl. Data Eng., 25 (2012) 1960-1981.
[49]
H.A. Sleiman, R. Corchuelo, TEX: an efficient and effective unsupervised Web information extractor, Knowl. Based Syst., 39 (2013) 109-123.
[50]
R. Song, H. Liu, J. Wen, W. Ma, Learning block importance models for web pages, in: Proceedings of the 13th International Conference on World Wide Web, 2004, pp. 203-211.
[51]
X.Y. Song, J. Liu, Y.B. Cao, C.Y. Lin, H.W. Hon, Automatic extraction of Web data records containing user-generated content, in: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, 2010, pp. 39-48.
[52]
X.Y. Xiao, Q. Luo, D. Hong, H.B. Fu, X. Xie, W.Y. Ma, Browsing on small displays by transforming Web pages into hierarchically structured subpages, ACM Trans. Web, vol. 3 (2009).
[53]
X. Yin, W.S. Lee, Using link analysis to improve layout on mobile devices, in: Proceedings of the 13th International Conference on World Wide Web, 2004, pp. 338-344.
[54]
X. Yin, W.S. Lee, Understanding the function of web elements for mobile content delivery using random walk models, in: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, 2005, pp. 1150-1151.
[55]
Y. You, G. Xu, J. Cao, Y.C. Zhang, G. Huang, Leveraging visual features and hierarchical dependencies for conference information extraction, in: Proc. APWeb, LNCS 7808, 2013, pp. 404-416.
[56]
Z. Zhang, B. He, K.C.-C. Chang, Understanding Web query interfaces: best-effort parsing with hidden syntax, in: Proc. 2004 ACM SIGMOD International Conference on Management of Data, 2004, pp. 107-118.
[57]
H. Zhao, W. Meng, Z. Wu, V. Raghavan, C. Yu, Fully automatic wrapper generation for search engines, in: Proceedings of the 14th International Conference on World Wide Web, 2005, pp. 66-75.
[58]
Y. Zhai, B. Liu, Web data extraction based on partial tree alignment, in: Proceedings of the 14th International Conference on World Wide Web, 2005, pp. 76-85.
[59]
Y. Zhai, B. Liu, Extracting Web data using instance-based learning, World Wide Web, 10 (2007) 113-132.
[60]
Q. Zhang, Y. Shi, X. Huang, L. Wu, Template-independent wrapper for web forums, in: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009, pp. 794-795.
[61]
S. Zheng, R. Song, J. Wen, Template-independent news extraction based on visual consistency, in: Proceedings of the 22nd National Conference on Artificial Intelligence, vol. 2, 2007, pp. 1507-1512.

Cited By

View all
  • (2023)Graph Grammar Formalism with Multigranularity for Spatial GraphsJournal of Logic, Language and Information10.1007/s10849-023-09406-032:5(809-827)Online publication date: 1-Dec-2023
  • (2022)A general parsing algorithm with context matching for context-sensitive graph grammarsMultimedia Tools and Applications10.1007/s11042-021-11076-881:1(273-297)Online publication date: 1-Jan-2022
  • (2018)Generating Tractable Designs by Transforming Shape Grammars to Graph GrammarsProceedings of the 11th International Symposium on Visual Information Communication and Interaction10.1145/3231622.3231637(41-48)Online publication date: 13-Aug-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal
Information Sciences: an International Journal  Volume 328, Issue C
January 2016
595 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 20 January 2016

Author Tags

  1. Graph grammar induction
  2. Spatial graph grammar
  3. Web patterns

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Graph Grammar Formalism with Multigranularity for Spatial GraphsJournal of Logic, Language and Information10.1007/s10849-023-09406-032:5(809-827)Online publication date: 1-Dec-2023
  • (2022)A general parsing algorithm with context matching for context-sensitive graph grammarsMultimedia Tools and Applications10.1007/s11042-021-11076-881:1(273-297)Online publication date: 1-Jan-2022
  • (2018)Generating Tractable Designs by Transforming Shape Grammars to Graph GrammarsProceedings of the 11th International Symposium on Visual Information Communication and Interaction10.1145/3231622.3231637(41-48)Online publication date: 13-Aug-2018
  • (2018)Quantitative spatial semantics in a graph grammar formalismProceedings of the 3rd International Workshop on Interactive and Spatial Computing10.1145/3191801.3191803(1-7)Online publication date: 12-Apr-2018

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media