Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2736277.2741083acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Weakly Supervised Extraction of Computer Security Events from Twitter

Published: 18 May 2015 Publication History

Abstract

Twitter contains a wealth of timely information, however staying on top of breaking events requires that an information analyst constantly scan many sources, leading to information overload. For example, a user might wish to be made aware whenever an infectious disease outbreak takes place, when a new smartphone is announced or when a distributed Denial of Service (DoS) attack might affect an organization's network connectivity. There are many possible event categories an analyst may wish to track, making it impossible to anticipate all those of interest in advance. We therefore propose a weakly supervised approach, in which extractors for new categories of events are easy to define and train, by specifying a small number of seed examples. We cast seed-based event extraction as a learning problem where only positive and unlabeled data is available. Rather than assuming unlabeled instances are negative, as is common in previous work, we propose a learning objective which regularizes the label distribution towards a user-provided expectation. Our approach greatly outperforms heuristic negatives, used in most previous work, in experiments on real-world data. Significant performance gains are also demonstrated over two novel and competitive baselines: semi-supervised EM and one-class support-vector machines. We investigate three security-related events breaking on Twitter: DoS attacks, data breaches and account hijacking. A demonstration of security events extracted by our system is available at: http://kb1.cse.ohio-state.edu:8123/events/hacked

References

[1]
E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digital libraries, pages 85--94. ACM, 2000.
[2]
H. Becker, D. Iter, M. Naaman, and L. Gravano. Identifying content for planned events across social media sites. In Proceedings of the fifth ACM international conference on Web search and data mining, pages 533--542. ACM, 2012.
[3]
E. Benson, A. Haghighi, and R. Barzilay. Event discovery in social media feeds. In ACL, 2011.
[4]
B. Bishop. High-profile twitter account hijackings leave questions about security. web, May 2013.
[5]
S. Brin. Extracting patterns and relations from the world wide web. In The World Wide Web and Databases. 1999.
[6]
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.
[7]
F. Chierichetti, J. Kleinberg, R. Kumar, M. Mahdian, and S. Pandey. Event detection via communication pattern analysis. 2014.
[8]
B. Claise. RFC 3954 - Cisco Systems NetFlow Services Export Version 9, Oct. 2004.
[9]
C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008.
[10]
H. Fei, Y. Kim, S. Sahu, M. Naphade, S. K. Mamidipalli, and J. Hutchinson. Heat pump detection from coarse grained smart meter data with positive and unlabeled learning. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '13, pages 1330--1338, New York, NY, USA, 2013. ACM.
[11]
K. Ganchev, J. Graça, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable models. The Journal of Machine Learning Research, 11:2001--2049, 2010.
[12]
R. Grishman and B. Sundheim. Message understanding conference-6: A brief history. In COLING, 1996.
[13]
W. Guo, H. Li, H. Ji, and M. T. Diab. Linking tweets to news: A framework to enrich short text data in social media. In ACL (1), pages 239--249. Citeseer, 2013.
[14]
M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics-Volume 2, pages 539--545. Association for Computational Linguistics, 1992.
[15]
R. Heatherly, M. Kantarcioglu, and B. Thuraisingham. Preventing private information inference attacks on social networks. Knowledge and Data Engineering, IEEE Transactions on, 25(8):1849--1862, 2013.
[16]
R. Huang and E. Riloff. Bootstrapped training of event extraction classifiers. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2012.
[17]
H. Ji and R. Grishman. Refining event extraction through cross-document inference. In Proceedings of ACL-08: HLT, Columbus, Ohio, 2008. Association for Computational Linguistics.
[18]
A. Joshi, R. Lal, T. Finin, and A. Joshi. Extracting cybersecurity related linked data from text. In Semantic Computing (ICSC), 2013 IEEE Seventh International Conference on, pages 252--259. IEEE, 2013.
[19]
W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In ICML, volume 3, pages 448--455, 2003.
[20]
C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: named entity recognition in targeted twitter stream. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 721--730. ACM, 2012.
[21]
X. Li and B. Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, pages 587--592, 2003.
[22]
X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 359--367. Association for Computational Linguistics, 2011.
[23]
M. Lui and T. Baldwin. langid. py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, pages 25--30. Association for Computational Linguistics, 2012.
[24]
G. S. Mann and A. McCallum. Simple, robust, scalable semi-supervised learning via expectation regularization. In Proceedings of the 24th international conference on Machine learning, pages 593--600. ACM, 2007.
[25]
G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. The Journal of Machine Learning Research, pages 955--984, 2010.
[26]
M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 2009.
[27]
M. Motoyama, B. Meeder, K. Levchenko, G. M. Voelker, and S. Savage. Measuring online service availability using twitter. WOSN'10, pages 13--13, 2010.
[28]
A. Narayanan. Fast dictionary attacks on passwords using time-space tradeoff. In In ACM Conference on Computer and Communications Security, pages 364--372. ACM Press, 2005.
[29]
G. Neubig, Y. Matsubayashi, M. Hagiwara, and K. Murakami. Safety information mining-what can nlp do in a disaster-. In IJCNLP, pages 965--973, 2011.
[30]
K. Nigam, A. McCallum, and T. Mitchell. Semi-supervised text classification using em. Semi-Supervised Learning, 2006.
[31]
M. Paşca. Weakly-supervised discovery of named entities using web search queries. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM '07, pages 683--690, New York, NY, USA, 2007. ACM.
[32]
P. Pantel and M. Pennacchiotti. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 113--120. Association for Computational Linguistics, 2006.
[33]
T. Peng, C. Leckie, and K. Ramamohanarao. Survey of network-based defense mechanisms countering the dos and ddos problems. ACM Comput. Surv., 39(1), April 2007.
[34]
S. Petrović, M. Osborne, and V. Lavrenko. Streaming first story detection with application to twitter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 181--189. Association for Computational Linguistics, 2010.
[35]
V. Qazvinian, E. Rosengren, D. R. Radev, and Q. Mei. Rumor has it: Identifying misinformation in microblogs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1589--1599. Association for Computational Linguistics, 2011.
[36]
K. Reschke, M. Jankowiak, M. Surdeanu, C. D. Manning, and D. Jurafsky. Event extraction using distant supervision. 2014.
[37]
A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011.
[38]
A. Ritter, O. Etzioni, S. Clark, et al. Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1104--1112. ACM, 2012.
[39]
A. Ritter, L. Zettlemoyer, Mausam, and O. Etzioni. Modeling missing data in distant supervision for information extraction. TACL, 2013.
[40]
K. Roberts, T. Goodwin, and S. M. Harabagiu. Annotating spatial containment relations between events. In LREC, 2012.
[41]
T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In WWW, 2010.
[42]
B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 2001.
[43]
B. Wellner. Weakly supervised learning methods for improving the quality of gene name normalization data. In Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, ISMB '05, pages 1--8, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics.
[44]
W. Xu, R. Hoffmann, L. Zhao, and R. Grishman. Filling knowledge base gaps for distant supervision of relation extraction. In ACL (2), 2013.

Cited By

View all
  • (2025)Leveraging Social Networks for Cyber Threat Intelligence: Analyzing Attack Trends and TTPs in the Arab WorldIEEE Access10.1109/ACCESS.2024.350802513(5679-5693)Online publication date: 2025
  • (2025)Evaluation of LLM-based chatbots for OSINT-based Cyber Threat AwarenessExpert Systems with Applications10.1016/j.eswa.2024.125509261(125509)Online publication date: Feb-2025
  • (2024)PU-KBS: A Robust Positive and Unlabeled Learning Framework With Key Band Selection for One-Class Hyperspectral Image ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.339798962(1-15)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '15: Proceedings of the 24th International Conference on World Wide Web
May 2015
1460 pages
ISBN:9781450334693

Sponsors

  • IW3C2: International World Wide Web Conference Committee

In-Cooperation

Publisher

International World Wide Web Conferences Steering Committee

Republic and Canton of Geneva, Switzerland

Publication History

Published: 18 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. information extraction
  2. text mining

Qualifiers

  • Research-article

Funding Sources

  • Department of Defense
  • DARPA

Conference

WWW '15
Sponsor:
  • IW3C2

Acceptance Rates

WWW '15 Paper Acceptance Rate 131 of 929 submissions, 14%;
Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)4
Reflects downloads up to 16 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Leveraging Social Networks for Cyber Threat Intelligence: Analyzing Attack Trends and TTPs in the Arab WorldIEEE Access10.1109/ACCESS.2024.350802513(5679-5693)Online publication date: 2025
  • (2025)Evaluation of LLM-based chatbots for OSINT-based Cyber Threat AwarenessExpert Systems with Applications10.1016/j.eswa.2024.125509261(125509)Online publication date: Feb-2025
  • (2024)PU-KBS: A Robust Positive and Unlabeled Learning Framework With Key Band Selection for One-Class Hyperspectral Image ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.339798962(1-15)Online publication date: 2024
  • (2023)Detection of Inappropriate Tweets Linked to Fake Accounts on TwitterApplied Sciences10.3390/app1305301313:5(3013)Online publication date: 26-Feb-2023
  • (2023)A Potent Technique for Identifying Fake Accounts on Social PlatformsInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2390425(308-324)Online publication date: 1-Aug-2023
  • (2023)STRisk: A Socio-Technical Approach to Assess Hacking Breaches RiskIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.314920820:2(1074-1087)Online publication date: 1-Mar-2023
  • (2023)Automated Emerging Cyber Threat Identification and Profiling Based on Natural Language ProcessingIEEE Access10.1109/ACCESS.2023.326002011(58915-58936)Online publication date: 2023
  • (2023)Trigger-free cybersecurity event detection based on contrastive learningThe Journal of Supercomputing10.1007/s11227-023-05454-279:18(20984-21007)Online publication date: 18-Jun-2023
  • (2023)ATDG: An Automatic Cyber Threat Intelligence Extraction Model of DPCNN and BIGRU Combined with Attention MechanismWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_15(189-204)Online publication date: 21-Oct-2023
  • (2023)The Advancement of Knowledge Graphs in Cybersecurity: A Comprehensive OverviewComputational and Experimental Simulations in Engineering10.1007/978-3-031-42987-3_6(65-103)Online publication date: 1-Dec-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media