Leonhardt J, Anand A and Khosla M. Boilerplate Removal using a Neural Sequence Labeling Model. Companion Proceedings of the Web Conference 2020. (226-229).
Uzun E. A Novel Web Scraping Approach Using the Additional Information Obtained From Web Pages. IEEE Access. 10.1109/ACCESS.2020.2984503. 8. (61726-61740).
Jiang Z, Yin H, Wu Y, Lyu Y, Min G and Zhang X.
(2019). Constructing Novel Block Layouts for Webpage Analysis. ACM Transactions on Internet Technology. 19:3. (1-18). Online publication date: 31-Aug-2019.
Alarte J, Silva J and Tamarit S.
(2019). What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors. ACM Transactions on the Web. 13:2. (1-19). Online publication date: 31-May-2019.
Chen Y and Yao Z.
(2019). Multi-layer Filtering Webpage Classification Method Based on SVM. Human Centered Computing. 10.1007/978-3-030-37429-7_56. (554-559).
Vogels T, Ganea O and Eickhoff C.
(2018). Web2Text: Deep Structured Boilerplate Removal. Advances in Information Retrieval. 10.1007/978-3-319-76941-7_13. (167-179).
Uçar E, Uzun E and Tüfekci P.
(2017). A novel algorithm for extracting the user reviews from web pages. Journal of Information Science. 43:5. (696-712). Online publication date: 1-Oct-2017.
Omari A, Kimelfeld B, Yahav E and Shoham S. Lossless Separation of Web Pages into Layout Code and Data. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (1805-1814).
Yuan P, Li Y, Jin H and Liu L. Self-Adaptive Extracting Academic Entities from World Wide Web. Proceedings of the 2015 IEEE Conference on Collaboration and Internet Computing (CIC). (270-277).
Madaan A and Chu W.
(2015). In-depth querying of web-based medical documents. International Journal of Computational Science and Engineering. 11:3. (284-296). Online publication date: 1-Oct-2015.
AL-Ghuribi S and Alshomrani S.
(2015). Bi-languages Mining Algorithm for Extraction Useful Web Contents (BiLEx). Arabian Journal for Science and Engineering. 10.1007/s13369-014-1530-8. 40:2. (501-518). Online publication date: 1-Feb-2015.
Freire de Amorim E. HTML Segmentation for Different Types of Web Pages. The Evolution of the Internet in the Business Sector. 10.4018/978-1-4666-7262-8.ch005. (98-119).
Wang J, Wu J, Zhang Y and He G. Content Information Extraction of Theme Web Pages Based on Tag Information. Proceedings of the 2014 Seventh International Symposium on Computational Intelligence and Design - Volume 01. (501-504).
Uzun E, Serdar Güner E, Kılıçaslan Y, Yerlikaya T and Agun H.
(2014). An effective and efficient Web content extractor for optimizing the crawling process. Software—Practice & Experience. 44:10. (1181-1199). Online publication date: 1-Oct-2014.
Soska K and Christin N. Automatically detecting vulnerable websites before they turn malicious. Proceedings of the 23rd USENIX conference on Security Symposium. (625-640).
Kurmi R and Jain P.
(2014). Text summarization using enhanced MMR technique 2014 International Conference on Computer Communication and Informatics (ICCCI). 10.1109/ICCCI.2014.6921769. 978-1-4799-2352-6. (1-5).
Gao B and Fan Q.
(2014). Multiple Template Detection Based on Segments. Advances in Data Mining. Applications and Theoretical Aspects. 10.1007/978-3-319-08976-8_3. (24-38).
Fan Q, Yan C, Huang L and Huang L.
(2014). Discovering Informative Contents of Web Pages. Web-Age Information Management. 10.1007/978-3-319-08010-9_20. (180-191).
Hachenberg C and Gottron T. Locality sensitive hashing for scalable structural classification and clustering of web documents. Proceedings of the 22nd ACM international conference on Information & Knowledge Management. (359-368).
Schäfer R and Bildhauer F.
(2013). Web Corpus Construction. Synthesis Lectures on Human Language Technologies. 10.2200/S00508ED1V01Y201305HLT022. 6:4. (1-145). Online publication date: 19-Jul-2013.
Uzun E, Agun H and Yerlikaya T.
(2013). A hybrid approach for extracting informative content from web pages. Information Processing and Management: an International Journal. 49:4. (928-944). Online publication date: 1-Jul-2013.
Geraci F and Maggini M.
(2013). A Fast Method for Web Template Extraction via a Multi-sequence Alignment Approach. Knowledge Discovery, Knowledge Engineering and Knowledge Management. 10.1007/978-3-642-37186-8_11. (172-184).
Hu F, Li M, Zhang Y, Peng T and Lei Y.
(2013). A Non-Template Approach to Purify Web Pages Based on Word Density. Proceedings of the International Conference on Information Engineering and Applications (IEA) 2012. 10.1007/978-1-4471-4847-0_27. (221-228).
Ly P, Pedrinaci C and Domingue J. Automated information extraction from web APIs documentation. Proceedings of the 13th international conference on Web Information Systems Engineering. (497-511).
Pappas N, Katsimpras G and Stamatatos E. Extracting informative textual parts from web pages containing user-generated content. Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies. (1-8).
Uzun E, Agun H and Yerlikaya T.
(2012). Web content extraction by using decision tree learning 2012 20th Signal Processing and Communications Applications Conference (SIU). 10.1109/SIU.2012.6204476. 978-1-4673-0056-8. (1-4).
D’souza R, Kulkarni A and Mirza I.
(2012). Automatic Link Generation for Search Engine Optimization. International Journal of Information and Education Technology. 10.7763/IJIET.2012.V2.163. (401-403).
Mukund S, Indurkhya N and Sundaresan N. Segmenting eBay item descriptions into coherent sections. Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data. (1-8).
Seo J, Diaz F, Gabrilovich E, Josifovski V and Pang B. Generalized link suggestions via web site clustering. Proceedings of the 20th international conference on World wide web. (77-86).
Kang J, Yang J and Choi J.
(2010). Repetition-based web page segmentation by detecting tag patterns for small-screen devices. IEEE Transactions on Consumer Electronics. 56:2. (980-986). Online publication date: 1-May-2010.
Kohlschütter C, Fankhauser P and Nejdl W. Boilerplate detection using shallow text features. Proceedings of the third ACM international conference on Web search and data mining. (441-450).
TSURUTA M and MASUYAMA S.
(2010). An Extraction Method of an Informative DOM Node from a Web Page by Using Layout Information. Transactions of the Japanese Society for Artificial Intelligence. 10.1527/tjsai.25.742. 25. (742-756).
Guo W, Kim Y and Kang B.
(2010). Webpage Segments Classification with Incremental Knowledge Acquisition. U- and E-Service, Science and Technology. 10.1007/978-3-642-17644-9_9. (79-87).
Román P, Dell R and Velásquez J.
(2010). Advanced Techniques in Web Data Pre-processing and Cleaning. Advanced Techniques in Web Intelligence - I. 10.1007/978-3-642-14461-5_2. (19-48).
Vineel G. Web page DOM node characterization and its application to page segmentation. Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications. (325-330).
Vineel G.
(2009). Web page DOM node characterization and its application to page segmentation 2009 3rd International Conference on Internet Multimedia Services Architecture and Application (IMSAA). 10.1109/IMSAA.2009.5439444. 978-1-4244-4792-3. (1-6).
Kohlschütter C and Nejdl W. A densitometric approach to web page segmentation. Proceedings of the 17th ACM conference on Information and knowledge management. (1173-1182).
Wang Y, Fang B, Cheng X, Guo L and Xu H.
(2008). Incremental Web Page Template Detection by Text Segments 2008 IEEE International Workshop on Semantic Computing and Systems (WSCS). 10.1109/WSCS.2008.17. . (174-180).
Lidong Bing , Yexin Wang , Yan Zhang and Hui Wang .
(2008). Primary content extraction with Mountain Model 2008 8th IEEE International Conference on Computer and Information Technology (CIT). 10.1109/CIT.2008.4594722. 978-1-4244-2357-6. (479-484).
Wang Y, Fang B, Cheng X, Guo L and Xu H. Incremental web page template detection. Proceedings of the 17th international conference on World Wide Web. (1247-1248).
Chakrabarti D, Kumar R and Punera K. A graph-theoretic approach to webpage segmentation. Proceedings of the 17th international conference on World Wide Web. (377-386).
Punera K and Ghosh J. Enhanced hierarchical classification via isotonic smoothing. Proceedings of the 17th international conference on World Wide Web. (151-160).
Gottron T. Clustering template based web documents. Proceedings of the IR research, 30th European conference on Advances in information retrieval. (40-51).
Urvoy T, Chauveau E, Filoche P and Lavergne T.
(2008). Tracking Web spam with HTML style similarities. ACM Transactions on the Web. 2:1. (1-28). Online publication date: 1-Feb-2008.