We are happy to announce the release of the Sarcasm SIGN corpus: a parallel corpus of sarcastic tweets and their non-sarcastic interpretations, as created by human experts (3000 tweets annotated by their authors with the hashtag #sarcasm, 5 human translations per tweet). The corpus was created as part of the paper: Sarcasm SIGN: Interpreting Sarcasm with Sentiment Based Monolingual Machine Translation, Lotem Peled and Roi Reichart, ACL 2017 (https://arxiv.org/pdf/1704.06836.pdf) The corpus and the project details can be found at: https://github.com/lotemp/SarcasmSIGN Sarcasm SIGN dataset, a parallel corpus of sarcastic tweets and their non-sarcastic…
Category: Monolingual
ELRA resources – 15 new corpora (written) & 7 updated corpora
We are happy to announce that a new set of 15 Written Corpora is now available in our catalogue. Arabic-English, Arabic-French, Chinese-English and Chinese-French Written Parallel Corpora: This set of 15 written corpora was produced by ELDA within PEA TRAD, a project supported by the French Ministry of Defence (DGA). Available resources are listed below (click on the links for further details). ELRA-W0098 TRAD Arabic-French Newspaper Parallel corpus – Test set 1 – ISLRN: 922-732-502-473-8 This is a parallel corpus of 10,000 words in Arabic and 4 reference translations in French. The…
[Resource] Corpus ‘Australia 2015/2016’
The corpus ‘Australia 2015/2016’ includes all articles from major Australian newspapers published from August 2015 to July 2016 that include the key term ‘Australia’ or ‘Australian(s)’ in the title. Altogether, the corpus contains over 7 million tokens in almost 13,000 articles from 18 newspapers. The corpus thus reflects one year of printed media coverage of topics directly relevant to Australia. Download Australia2015/2016 Corpus here Download word frequencies from this corpus here
Corpus of Historical American English
I am publishing here Mark Davies’ announcement of COHA: We are pleased to announce the release of the 400 million word Corpus of Historical American English (1810-2009). The corpus has been funded by a generous grant from the US National Endowment for the Humanities, and it is freely available at http://corpus.byu.edu/coha/. COHA is the largest structured corpus of historical English, and it contains more than 100,000 texts from fiction, popular magazines, newspapers, and non-fiction books, with the same genre balance decade by decade from the 1810s-2000s.