Identifying Products in Online Cybercrime Marketplaces: A Dataset for Fine-grained Domain Adaptation

Durrett, Greg; Kummerfeld, Jonathan K.; Berg-Kirkpatrick, Taylor; Portnoff, Rebecca S.; Afroz, Sadia; McCoy, Damon; Levchenko, Kirill; Paxson, Vern

doi:10.18653/v1/D17-1275

Computer Science > Computation and Language

arXiv:1708.09609 (cs)

[Submitted on 31 Aug 2017]

Title:Identifying Products in Online Cybercrime Marketplaces: A Dataset for Fine-grained Domain Adaptation

Authors:Greg Durrett, Jonathan K. Kummerfeld, Taylor Berg-Kirkpatrick, Rebecca S. Portnoff, Sadia Afroz, Damon McCoy, Kirill Levchenko, Vern Paxson

View PDF

Abstract:One weakness of machine-learned NLP models is that they typically perform poorly on out-of-domain data. In this work, we study the task of identifying products being bought and sold in online cybercrime forums, which exhibits particularly challenging cross-domain effects. We formulate a task that represents a hybrid of slot-filling information extraction and named entity recognition and annotate data from four different forums. Each of these forums constitutes its own "fine-grained domain" in that the forums cover different market sectors with different properties, even though all forums are in the broad domain of cybercrime. We characterize these domain differences in the context of a learning-based system: supervised models see decreased accuracy when applied to new forums, and standard techniques for semi-supervised learning and domain adaptation have limited effectiveness on this data, which suggests the need to improve these techniques. We release a dataset of 1,938 annotated posts from across the four forums.

Comments:	To appear at EMNLP 2017
Subjects:	Computation and Language (cs.CL)
ACM classes:	I.2.7
Cite as:	arXiv:1708.09609 [cs.CL]
	(or arXiv:1708.09609v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1708.09609
Journal reference:	EMNLP (2017) 2598-2607
Related DOI:	https://doi.org/10.18653/v1/D17-1275

Submission history

From: Jonathan K Kummerfeld [view email]
[v1] Thu, 31 Aug 2017 08:18:12 UTC (65 KB)

Computer Science > Computation and Language

Title:Identifying Products in Online Cybercrime Marketplaces: A Dataset for Fine-grained Domain Adaptation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Identifying Products in Online Cybercrime Marketplaces: A Dataset for Fine-grained Domain Adaptation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators