Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2756406.2756925acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

Published: 21 June 2015 Publication History

Abstract

Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.

References

[1]
Apache Nutch: Highly extensible, highly scalable Web crawler. Available online: http://nutch.apache.org/ (accessed on 23 October 2014).
[2]
S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. In World Wide Web Conference, WWW '03, 2003. 10.1145/775152.775192.
[3]
C. Aggarwal, F. Al-Garawi, and P. S. Yu. Intelligent crawling on the world wide web with arbitrary predicates. In World Wide Web Conference, pages 96--105, 2001. 10.1145/371920.371955.
[4]
L. Aiello, G. Petkos, C. Martin, D. Corney, S. Papadopoulos, R. Skraba, A. Goker, I. Kompatsiaris, and A. Jaimes. Sensing trending topics in twitter. IEEE Transactions on Multimedia, 15 (6): 1268--1282, Oct 2013. 10.1109/TMM.2013.2265080.
[5]
D. Bergmark, C. Lagoze, and A. Sbityakov. Focused crawls, tunneling, and digital libraries. In Research and Advanced Technology for Digital Libraries. Springer, 2002.
[6]
M. Boanjak, E. Oliveira, J. Martins, E. Mendes Rodrigues, and L. Sarmento. Twitterecho: A distributed focused crawler to support open research with twitter data. In World Wide Web Conference Companion, pages 1233--1240, 2012. 10.1145/2187980.2188266.
[7]
M. Bouzeghoub. A framework for analysis of data freshness. In Proceedings of the Workshop on Information Quality in Information Systems, IQIS '04, pages 59--67, 2004. 10.1145/1012453.1012464.
[8]
S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks, 31 (11--16): 1623--1640, 1999. 10.1016/S1389--1286(99)00052--3.
[9]
M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Conference on Very Large Data Bases, pages 527--534, 2000.
[10]
A. Dong, Y. Chang, R. Zhang, Z. Zheng, G. Mishne, J. Bai, K. Buchner, C. Liao, S. Ji, G. Leung, et al. Incorporating recency in network search using machine learning, Apr. 21 2011. US Patent App. 12/579,855.
[11]
H. Dong and F. K. Hussain. SOF: a semi-supervised ontology-learning-based focused crawler. Concurrency and Computation: Practice and Experience, 25 (12): 1755--1770, 2013.
[12]
M. Ehrig and A. Maedche. Ontology-focused crawling of web documents. In ACM Symposium on Applied Computing, pages 1174--1178, 2003. 10.1145/952532.952761.
[13]
G. Gossen, E. Demidova, and T. Risse. The iCrawl Wizard -- supporting interactive focused crawl specification. In Proceedings of the European Conference on Information Retrieval (ECIR) 2015, 2015.
[14]
J. Jiang, X. Song, N. Yu, and C.-Y. Lin. Focus: Learning to crawl web forums. IEEE Transactions on Knowledge and Data Engineering, 25 (6): 1293--1306, June 2013. 10.1109/TKDE.2012.56.
[15]
e}manning2008C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. ISBN 0521865719.
[16]
G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. Introduction to Heritrix, an archival quality web crawler. In 4th International Web Archiving Workshop (IWAW04), 2004.
[17]
C. Olston and M. Najork. Web crawling. Foundations and Trends in Information Retrieval, 4 (3): 175--246, 2010. 10.1561/1500000017.
[18]
G. Pant and P. Srinivasan. Learning to crawl: Comparing classification schemes. ACM Transactions on Information Systems, 23 (4): 430--462, Oct. 2005. 10.1145/1095872.1095875.
[19]
P. Pereira, J. Macedo, O. Craveiro, and H. Madeira. Time-aware focused web crawling. In European Conference on IR Research, ECIR 2014, pages 534--539, 2014. 10.1007/978--3--319-06028--6_53.
[20]
F. Psallidas, A. Ntoulas, and A. Delis. Soc web: Efficient monitoring of social network activities. In Web Information Systems Engineering 2013, pages 118--136. Springer, 2013. 10.1007/978--3--642--41154-0_9.
[21]
J. Qin, Y. Zhou, and M. Chau. Building domain-specific web collections for scientific digital libraries. In Joint ACM/IEEE Conference on Digital Libraries, 2004, pages 135--141, June 2004. 10.1109/JCDL.2004.1336110.
[22]
T. Risse, E. Demidova, S. Dietze, W. Peters, N. Papailiou, K. Doka, Y. Stavrakas, V. Plachouras, P. Senellart, F. Carpentier, A. Mantrach, B. Cautis, P. Siehndel, and D. Spiliotopoulos. The ARCOMEM architecture for social- and semantic-driven web archiving. Future Internet, 6 (4): 688--716, 2014. ISSN 1999--5903.
[23]
T. Risse, E. Demidova, and G. Gossen. What do you want to collect from the web? In Proc. of the Building Web Observatories Workshop (BWOW) 2014, 2014.
[24]
H. M. SalahEldeen and M. L. Nelson. Losing my revolution: How many resources shared on social media have been lost? In Theory and Practice of Digital Libraries, pages 125--137. Springer, 2012. 10.1007/978--3--642--33290--6_14.
[25]
X. Tannier. Extracting news web page creation time with DCTFinder. In Conference on Language Resources and Evaluation (LREC-2014), pages 2037--2042, 2014.
[26]
S. Yang, K. Chitturi, G. Wilson, M. Magdy, and E. A. Fox. A study of automation from seed URL generation to focused web archive development: The CTRnet context. In Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '12, pages 341--342, 2012. 10.1145/2232817.2232881.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '15: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries
June 2015
324 pages
ISBN:9781450335942
DOI:10.1145/2756406
  • General Chairs:
  • Paul Logasa Bogen,
  • Suzie Allard,
  • Holly Mercer,
  • Micah Beck,
  • Program Chairs:
  • Sally Jo Cunningham,
  • Dion Goh,
  • Geneva Henry
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. focused crawling
  2. social media
  3. web archives
  4. web crawling

Qualifiers

  • Research-article

Funding Sources

Conference

JCDL '15
Sponsor:
JCDL '15: 15th ACM/IEEE-CS Joint Conference on Digital Libraries
June 21 - 25, 2015
Tennessee, Knoxville, USA

Acceptance Rates

JCDL '15 Paper Acceptance Rate 18 of 60 submissions, 30%;
Overall Acceptance Rate 415 of 1,482 submissions, 28%

Upcoming Conference

JCDL '24
The 2024 ACM/IEEE Joint Conference on Digital Libraries
December 16 - 20, 2024
Hong Kong , China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Real-Time Focused Extraction of Social Media UsersIEEE Access10.1109/ACCESS.2022.316897710(42607-42622)Online publication date: 2022
  • (2021)Creating Event-Centric Collections from Web ArchivesThe Past Web10.1007/978-3-030-63291-5_6(57-67)Online publication date: 1-Jul-2021
  • (2020)Modeling Updates of Scholarly Webpages Using Archived Data2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377796(1868-1877)Online publication date: 10-Dec-2020
  • (2020)The Development of an Integrated Corpus for Malay LanguageComputational Science and Technology10.1007/978-981-15-0058-9_41(425-433)Online publication date: 2020
  • (2019)Data4UrbanMobility: Towards Holistic Data Analytics for Mobility Applications in Urban RegionsCompanion Proceedings of The 2019 World Wide Web Conference10.1145/3308560.3317055(137-145)Online publication date: 13-May-2019
  • (2018)Bootstrapping Web Archive Collections from Social MediaProceedings of the 29th on Hypertext and Social Media10.1145/3209542.3209560(64-72)Online publication date: 3-Jul-2018
  • (2018)Towards extracting event-centric collections from Web archivesInternational Journal on Digital Libraries10.1007/s00799-018-0258-6Online publication date: 27-Oct-2018
  • (2018)Focused crawler for eventsInternational Journal on Digital Libraries10.1007/s00799-016-0207-119:1(3-19)Online publication date: 1-Mar-2018
  • (2018)EventKG+TL: Creating Cross-Lingual Timelines from an Event-Centric Knowledge GraphThe Semantic Web: ESWC 2018 Satellite Events10.1007/978-3-319-98192-5_31(164-169)Online publication date: 3-Jun-2018
  • (2018)EventKG: A Multilingual Event-Centric Temporal Knowledge GraphThe Semantic Web10.1007/978-3-319-93417-4_18(272-287)Online publication date: 3-Jun-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media