CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection

Weld, Henry; Huang, Guanghao; Lee, Jean; Zhang, Tongshu; Wang, Kunze; Guo, Xinghong; Long, Siqu; Poon, Josiah; Han, Soyeon Caren

Computer Science > Computation and Language

arXiv:2106.06213 (cs)

[Submitted on 11 Jun 2021 (v1), last revised 23 Jul 2021 (this version, v2)]

Title:CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection

Authors:Henry Weld, Guanghao Huang, Jean Lee, Tongshu Zhang, Kunze Wang, Xinghong Guo, Siqu Long, Josiah Poon, Soyeon Caren Han

View PDF

Abstract:Traditional toxicity detection models have focused on the single utterance level without deeper understanding of context. We introduce CONDA, a new dataset for in-game toxic language detection enabling joint intent classification and slot filling analysis, which is the core task of Natural Language Understanding (NLU). The dataset consists of 45K utterances from 12K conversations from the chat logs of 1.9K completed Dota 2 matches. We propose a robust dual semantic-level toxicity framework, which handles utterance and token-level patterns, and rich contextual chatting history. Accompanying the dataset is a thorough in-game toxicity analysis, which provides comprehensive understanding of context at utterance, token, and dual levels. Inspired by NLU, we also apply its metrics to the toxicity detection tasks for assessing toxicity and game-specific aspects. We evaluate strong NLU models on CONDA, providing fine-grained results for different intent classes and slot classes. Furthermore, we examine the coverage of toxicity nature in our dataset by comparing it with other toxicity datasets.

Comments:	Accepted by ACL-IJCNLP 2021
Subjects:	Computation and Language (cs.CL)
ACM classes:	I.2.7
Cite as:	arXiv:2106.06213 [cs.CL]
	(or arXiv:2106.06213v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2106.06213

Submission history

From: Henry Weld [view email]
[v1] Fri, 11 Jun 2021 07:42:12 UTC (935 KB)
[v2] Fri, 23 Jul 2021 03:29:12 UTC (937 KB)

Computer Science > Computation and Language

Title:CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators