Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic

Ryan Cotterell, Chris Callison-Burch


Abstract
This paper presents a multi-dialect, multi-genre, human annotated corpus of dialectal Arabic. We collected utterances in five Arabic dialects: Levantine, Gulf, Egyptian, Iraqi and Maghrebi. We scraped newspaper websites for user commentary and Twitter for two distinct types of dialectal content. To the best of the authors’ knowledge, this work is the most diverse corpus of dialectal Arabic in both the source of the content and the number of dialects. Every utterance in the corpus was human annotated on Amazon’s Mechanical Turk; this stands in contrast to Al-Sabbagh and Girju (2012) where only a small subset was human annotated in order to train a classifier to automatically annotate the remainder of the corpus. We provide a discussion of the methodology used for the annotation in addition to the performance of the individual workers. We extend the Arabic dialect identification task to the Iraqi and Maghrebi dialects and improve the results of Zaidan and Callison-Burch (2011a) on Levantine, Gulf and Egyptian.
Anthology ID:
L14-1510
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
241–245
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/641_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Ryan Cotterell and Chris Callison-Burch. 2014. A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 241–245, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic (Cotterell & Callison-Burch, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/641_Paper.pdf