research-article

Weakly Supervised Extraction of Computer Security Events from Twitter

Authors:

Tom MitchellAuthors Info & Claims

WWW '15: Proceedings of the 24th International Conference on World Wide Web

Pages 896 - 905

https://doi.org/10.1145/2736277.2741083

Published: 18 May 2015 Publication History

Abstract

Twitter contains a wealth of timely information, however staying on top of breaking events requires that an information analyst constantly scan many sources, leading to information overload. For example, a user might wish to be made aware whenever an infectious disease outbreak takes place, when a new smartphone is announced or when a distributed Denial of Service (DoS) attack might affect an organization's network connectivity. There are many possible event categories an analyst may wish to track, making it impossible to anticipate all those of interest in advance. We therefore propose a weakly supervised approach, in which extractors for new categories of events are easy to define and train, by specifying a small number of seed examples. We cast seed-based event extraction as a learning problem where only positive and unlabeled data is available. Rather than assuming unlabeled instances are negative, as is common in previous work, we propose a learning objective which regularizes the label distribution towards a user-provided expectation. Our approach greatly outperforms heuristic negatives, used in most previous work, in experiments on real-world data. Significant performance gains are also demonstrated over two novel and competitive baselines: semi-supervised EM and one-class support-vector machines. We investigate three security-related events breaking on Twitter: DoS attacks, data breaches and account hijacking. A demonstration of security events extracted by our system is available at: http://kb1.cse.ohio-state.edu:8123/events/hacked

References

[1]

E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digital libraries, pages 85--94. ACM, 2000.

Digital Library

[2]

H. Becker, D. Iter, M. Naaman, and L. Gravano. Identifying content for planned events across social media sites. In Proceedings of the fifth ACM international conference on Web search and data mining, pages 533--542. ACM, 2012.

Digital Library

[3]

E. Benson, A. Haghighi, and R. Barzilay. Event discovery in social media feeds. In ACL, 2011.

Digital Library

[4]

B. Bishop. High-profile twitter account hijackings leave questions about security. web, May 2013.

[5]

S. Brin. Extracting patterns and relations from the world wide web. In The World Wide Web and Databases. 1999.

Digital Library

[6]

A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.

Digital Library

[7]

F. Chierichetti, J. Kleinberg, R. Kumar, M. Mahdian, and S. Pandey. Event detection via communication pattern analysis. 2014.

[8]

B. Claise. RFC 3954 - Cisco Systems NetFlow Services Export Version 9, Oct. 2004.

[9]

C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008.

Digital Library

[10]

H. Fei, Y. Kim, S. Sahu, M. Naphade, S. K. Mamidipalli, and J. Hutchinson. Heat pump detection from coarse grained smart meter data with positive and unlabeled learning. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '13, pages 1330--1338, New York, NY, USA, 2013. ACM.

Digital Library

[11]

K. Ganchev, J. Graça, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable models. The Journal of Machine Learning Research, 11:2001--2049, 2010.

Digital Library

[12]

R. Grishman and B. Sundheim. Message understanding conference-6: A brief history. In COLING, 1996.

Digital Library

[13]

W. Guo, H. Li, H. Ji, and M. T. Diab. Linking tweets to news: A framework to enrich short text data in social media. In ACL (1), pages 239--249. Citeseer, 2013.

[14]

M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics-Volume 2, pages 539--545. Association for Computational Linguistics, 1992.

Digital Library

[15]

R. Heatherly, M. Kantarcioglu, and B. Thuraisingham. Preventing private information inference attacks on social networks. Knowledge and Data Engineering, IEEE Transactions on, 25(8):1849--1862, 2013.

Digital Library

[16]

R. Huang and E. Riloff. Bootstrapped training of event extraction classifiers. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2012.

Digital Library

[17]

H. Ji and R. Grishman. Refining event extraction through cross-document inference. In Proceedings of ACL-08: HLT, Columbus, Ohio, 2008. Association for Computational Linguistics.

[18]

A. Joshi, R. Lal, T. Finin, and A. Joshi. Extracting cybersecurity related linked data from text. In Semantic Computing (ICSC), 2013 IEEE Seventh International Conference on, pages 252--259. IEEE, 2013.

Digital Library

[19]

W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In ICML, volume 3, pages 448--455, 2003.

Digital Library

[20]

C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: named entity recognition in targeted twitter stream. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 721--730. ACM, 2012.

Digital Library

[21]

X. Li and B. Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, pages 587--592, 2003.

Digital Library

[22]

X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 359--367. Association for Computational Linguistics, 2011.

Digital Library

[23]

M. Lui and T. Baldwin. langid. py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, pages 25--30. Association for Computational Linguistics, 2012.

Digital Library

[24]

G. S. Mann and A. McCallum. Simple, robust, scalable semi-supervised learning via expectation regularization. In Proceedings of the 24th international conference on Machine learning, pages 593--600. ACM, 2007.

Digital Library

[25]

G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. The Journal of Machine Learning Research, pages 955--984, 2010.

Digital Library

[26]

M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 2009.

Digital Library

[27]

M. Motoyama, B. Meeder, K. Levchenko, G. M. Voelker, and S. Savage. Measuring online service availability using twitter. WOSN'10, pages 13--13, 2010.

Digital Library

[28]

A. Narayanan. Fast dictionary attacks on passwords using time-space tradeoff. In In ACM Conference on Computer and Communications Security, pages 364--372. ACM Press, 2005.

Digital Library

[29]

G. Neubig, Y. Matsubayashi, M. Hagiwara, and K. Murakami. Safety information mining-what can nlp do in a disaster-. In IJCNLP, pages 965--973, 2011.

[30]

K. Nigam, A. McCallum, and T. Mitchell. Semi-supervised text classification using em. Semi-Supervised Learning, 2006.

[31]

M. Paşca. Weakly-supervised discovery of named entities using web search queries. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM '07, pages 683--690, New York, NY, USA, 2007. ACM.

Digital Library

[32]

P. Pantel and M. Pennacchiotti. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 113--120. Association for Computational Linguistics, 2006.

Digital Library

[33]

T. Peng, C. Leckie, and K. Ramamohanarao. Survey of network-based defense mechanisms countering the dos and ddos problems. ACM Comput. Surv., 39(1), April 2007.

Digital Library

[34]

S. Petrović, M. Osborne, and V. Lavrenko. Streaming first story detection with application to twitter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 181--189. Association for Computational Linguistics, 2010.

Digital Library

[35]

V. Qazvinian, E. Rosengren, D. R. Radev, and Q. Mei. Rumor has it: Identifying misinformation in microblogs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1589--1599. Association for Computational Linguistics, 2011.

Digital Library

[36]

K. Reschke, M. Jankowiak, M. Surdeanu, C. D. Manning, and D. Jurafsky. Event extraction using distant supervision. 2014.

[37]

A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011.

Digital Library

[38]

A. Ritter, O. Etzioni, S. Clark, et al. Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1104--1112. ACM, 2012.

Digital Library

[39]

A. Ritter, L. Zettlemoyer, Mausam, and O. Etzioni. Modeling missing data in distant supervision for information extraction. TACL, 2013.

[40]

K. Roberts, T. Goodwin, and S. M. Harabagiu. Annotating spatial containment relations between events. In LREC, 2012.

[41]

T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In WWW, 2010.

Digital Library

[42]

B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 2001.

Digital Library

[43]

B. Wellner. Weakly supervised learning methods for improving the quality of gene name normalization data. In Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, ISMB '05, pages 1--8, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics.

Digital Library

[44]

W. Xu, R. Hoffmann, L. Zhao, and R. Grishman. Filling knowledge base gaps for distant supervision of relation extraction. In ACL (2), 2013.

Cited By

Lee SMujammami AKim K(2025)Leveraging Social Networks for Cyber Threat Intelligence: Analyzing Attack Trends and TTPs in the Arab WorldIEEE Access10.1109/ACCESS.2024.350802513(5679-5693)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2024.3508025
Shafee SBessani AFerreira P(2025)Evaluation of LLM-based chatbots for OSINT-based Cyber Threat AwarenessExpert Systems with Applications10.1016/j.eswa.2024.125509261(125509)Online publication date: Feb-2025
https://doi.org/10.1016/j.eswa.2024.125509
Liu ZZhao HWang XWang SLi JZhong Y(2024)PU-KBS: A Robust Positive and Unlabeled Learning Framework With Key Band Selection for One-Class Hyperspectral Image ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.339798962(1-15)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3397989
Show More Cited By

Index Terms

Weakly Supervised Extraction of Computer Security Events from Twitter
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Coupled semi-supervised learning for information extraction
WSDM '10: Proceedings of the third ACM international conference on Web search and data mining

We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or ...
Weakly-supervised relation classification for information extraction
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

This paper approaches the relation classification problem in information extraction framework with bootstrapping on top of Support Vector Machines. A new bootstrapping algorithm is proposed and empirically evaluated on the ACE corpus. We show that the ...
A Novel Weakly Supervised Problem: Learning from Positive-Unlabeled Proportions
Proceedings of the 16th Conference of the Spanish Association for Artificial Intelligence on Advances in Artificial Intelligence - Volume 9422

Standard supervised classification learns a classifier from a set of labeled examples. Alternatively, in the field of weakly supervised classification different frameworks have been presented where the training data cannot be certainly labeled. In this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '15: Proceedings of the 24th International Conference on World Wide Web

May 2015

1460 pages

ISBN:9781450334693

General Chairs:
Aldo Gangemi
National Research Council, Italy & Paris 13 University-CNRS, France
,
Stefano Leonardi
Sapienza University of Rome, Italy
,
Alessandro Panconesi
Sapienza University of Rome, Italy

Copyright © 2015 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Sponsors

IW3C2: International World Wide Web Conference Committee

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

International World Wide Web Conferences Steering Committee

Republic and Canton of Geneva, Switzerland

Publication History

Published: 18 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Department of Defense
DARPA

Conference

WWW '15

Sponsor:

IW3C2

WWW '15: 24th International World Wide Web Conference

May 18 - 22, 2015

Florence, Italy

Acceptance Rates

WWW '15 Paper Acceptance Rate 131 of 929 submissions, 14%;

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

83
Total Citations
View Citations
942
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)4

Reflects downloads up to 16 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee SMujammami AKim K(2025)Leveraging Social Networks for Cyber Threat Intelligence: Analyzing Attack Trends and TTPs in the Arab WorldIEEE Access10.1109/ACCESS.2024.350802513(5679-5693)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2024.3508025
Shafee SBessani AFerreira P(2025)Evaluation of LLM-based chatbots for OSINT-based Cyber Threat AwarenessExpert Systems with Applications10.1016/j.eswa.2024.125509261(125509)Online publication date: Feb-2025
https://doi.org/10.1016/j.eswa.2024.125509
Liu ZZhao HWang XWang SLi JZhong Y(2024)PU-KBS: A Robust Positive and Unlabeled Learning Framework With Key Band Selection for One-Class Hyperspectral Image ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.339798962(1-15)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3397989
Alsubaei F(2023)Detection of Inappropriate Tweets Linked to Fake Accounts on TwitterApplied Sciences10.3390/app1305301313:5(3013)Online publication date: 26-Feb-2023
https://doi.org/10.3390/app13053013
Kajal Uttam Kumar Singh Dr. Nikhat Akhtar Satendra Kumar Vishwakarma Niranjan Kumar Dr. Yusuf Perwej (2023)A Potent Technique for Identifying Fake Accounts on Social PlatformsInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2390425(308-324)Online publication date: 1-Aug-2023
https://doi.org/10.32628/CSEIT2390425
Hammouchi HNejjari NMezzour GGhogho MBenbrahim H(2023)STRisk: A Socio-Technical Approach to Assess Hacking Breaches RiskIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.314920820:2(1074-1087)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TDSC.2022.3149208
Marinho RHolanda R(2023)Automated Emerging Cyber Threat Identification and Profiling Based on Natural Language ProcessingIEEE Access10.1109/ACCESS.2023.326002011(58915-58936)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3260020
Tang MGuo YBai QZhang H(2023)Trigger-free cybersecurity event detection based on contrastive learningThe Journal of Supercomputing10.1007/s11227-023-05454-279:18(20984-21007)Online publication date: 18-Jun-2023
https://doi.org/10.1007/s11227-023-05454-2
Cui BLi JHou W(2023)ATDG: An Automatic Cyber Threat Intelligence Extraction Model of DPCNN and BIGRU Combined with Attention MechanismWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_15(189-204)Online publication date: 21-Oct-2023
https://doi.org/10.1007/978-981-99-7254-8_15
Ma YChen YWang YYu JLi YLu JWang Y(2023)The Advancement of Knowledge Graphs in Cybersecurity: A Comprehensive OverviewComputational and Experimental Simulations in Engineering10.1007/978-3-031-42987-3_6(65-103)Online publication date: 1-Dec-2023
https://doi.org/10.1007/978-3-031-42987-3_6
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents