Abstract
This paper shows how visual information can be used to identify false positive entities from those returned by a state-of-the-art web information extraction algorithm and hence further improve extraction results. The proposed validation method is unsupervised and can be integrated into most web information extraction systems effortlessly without any impact on existing processes, system’s robustness or maintenance. Instead of relying on visual patterns, we focus on identifying visual outliers, i.e. entities that visually differ from the norm. In the context of web information extraction, we show that visual outliers tend to be erroneous extracted entities. In order to validate our method, we post-processed the entities obtained by Boilerpipe, which is known as the best overall main content extraction algorithm for web documents. We show that our validation method improves Boilerpipe’s initial precision by more than 10% while \(F_1\) score is increased by at least 3% in all relevant cases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In the literature, visual web information extraction may refer to the use of a graphical user interface (GUI) that allows the user to generate wrappers. This is not the intended meaning here as we refer to the visual formatting of documents.
- 2.
Retained properties are the following: background-color; border-bottom-color; border-bottom-style; border-bottom-width; border-left-color; border-left-style; border-left-width; border-right-color; border-right-style; border-right-width; border-top-color; border-top-left-radius; border-top-right-radius; border-top-style; border-top-width; color; font-size; font-style; font-weight; margin-bottom; margin-left; margin-right; margin-top; outline-color; padding-bottom; padding-left; padding-right; padding-top; position; text-align; text-decoration; visibility;.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
Most developer tools included in browsers, such as Firebug for Firefox or Chrome DevTools, allow to access computed style properties of DOM nodes.
- 12.
References
Agyemang, M.: Web content outlier mining: motivation, framework, and algorithms. University of Calgary (2006)
Agyemang, M., Barker, K., Alhajj, R.: Framework for mining web content outliers. In: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 590–594. ACM (2004)
Agyemang, M., Barker, K., Alhajj, R.: Web outlier mining: discovering outliers from web datasets. Intell. Data Anal. 9(5), 473–486 (2005)
Apostolova, E., Tomuro, N.: Combining visual and textual features for information extraction from online flyers. In: EMNLP, pp. 1924–1929 (2014)
Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: 2009 First Asian Conference on Intelligent Information and Database Systems, ACIIDS 2009, pp. 67–72. IEEE (2009)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)
Chenthamarakshan, V., Varadarajan, R., Deshpande, P.M., Krishnapuram, R., Stolze, K.: WYSIWYE: an algebra for expressing spatial and textual rules for information extraction. In: Gao, H., Lim, L., Wang, W., Li, C., Chen, L. (eds.) WAIM 2012. LNCS, vol. 7418, pp. 419–433. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32281-5_41
Della Penna, G., Magazzeni, D., Orefice, S.: Visual extraction of information from web pages. J. Vis. Lang. Comput. 21(1), 23–32 (2010)
Della Penna, G., Magazzeni, D., Orefice, S.: A spatial relation-based framework to perform visual information extraction. Knowl. Inf. Syst. 30(3), 667 (2012)
Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)
Gatterbauer, W., Bohunsky, P.: Table extraction using spatial reasoning on the CSS2 visual box model. In: Proceedings of the 21st National Conference on Artificial Intelligence (2006)
Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for web page information extraction. In: Iliadis, L., Maglogiannis, I. (eds.) AIAI 2016. IAICT, vol. 475, pp. 154–163. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44944-9_14
Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4), e0152173 (2016)
Goldstein, M.B.: Anomaly Detection in Large Datasets. Verlag Dr. Hut, Munich (2014)
Huosong, X., Zhaoyan, F., Liuyan, P.: Chinese web text outlier mining based on domain knowledge. In: 2010 Second WRI Global Congress on Intelligent Systems (GCIS), vol. 2, pp. 73–77. IEEE (2010)
Khan, M.R.R., Ahmed, M.I., Riyad, M.A.: A novel analytical approach for identifying outliers from web documents. Int. J. Appl. Eng. Res. 12(22), 12156–12161 (2017)
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 441–450. ACM (2010)
Kovacic, T.: Evaluating Web Content Extraction Algorithms. University of Ljubljana, Ljubljana (2012)
Krüpl-Sypien, B., Fayzrakhmanov, R.R., Holzinger, W., Panzenböck, M., Baumgartner, R.: A versatile model for web page representation, information extraction and content re-packaging. In: Proceedings of the 11th ACM Symposium on Document Engineering, pp. 129–138. ACM (2011)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Li, W., Mo, W., Zhang, X., Lu, Y., Squiers, J.J., Sellke, E.W., Fan, W., DiMaio, J.M., Thatcher, J.E.: Burn injury diagnostic imaging device’s accuracy improved by outlier detection and removal. In: SPIE Defense+ Security, p. 947206. International Society for Optics and Photonics (2015)
Vu, H., Nguyen, T.D., Travers, A., Venkatesh, S., Phung, D.: Energy-based localized anomaly detection in video surveillance. In: Kim, J., et al. (eds.) PAKDD 2017. LNCS (LNAI), vol. 10234, pp. 641–653. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57454-7_50
Weninger, T., Palacios, R., Crescenzi, V., Gottron, T., Merialdo, P.: Web content extraction: a meta-analysis of its past and thoughts on its future. ACM SIGKDD Explor. Newsl. 17(2), 17–23 (2016)
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Los Altos (2016)
Zhao, J., Cao, N., Wen, Z., Song, Y., Lin, Y.R., Collins, C.: # FluxFlow: visual analysis of anomalous information spreading on social media. IEEE Trans. Vis. Comput. Graph. 20(12), 1773–1782 (2014)
Acknowledgements
The authors gratefully acknowledge the financial support of the Natural Sciences and Engineering Research Council of Canada (NSERC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Potvin, B., Villemaire, R. (2018). When Different Is Wrong: Visual Unsupervised Validation for Web Information Extraction. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2018. Lecture Notes in Computer Science(), vol 10935. Springer, Cham. https://doi.org/10.1007/978-3-319-96133-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-96133-0_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96132-3
Online ISBN: 978-3-319-96133-0
eBook Packages: Computer ScienceComputer Science (R0)