Sarcasm SIGN: Sarcasm interpretation corpus

We are happy to announce the release of the Sarcasm SIGN corpus: a parallel corpus of sarcastic tweets and their non-sarcastic interpretations, as created by human experts (3000 tweets annotated by their authors with the hashtag #sarcasm, 5 human translations per tweet). The corpus was created as part of the paper: Sarcasm SIGN: Interpreting Sarcasm with Sentiment Based Monolingual Machine Translation, Lotem Peled and Roi Reichart, ACL 2017 ( The corpus and the project details can be found at: Sarcasm SIGN dataset, a parallel corpus of sarcastic tweets and their non-sarcastic…

ELRA resources – 15 new corpora (written) & 7 updated corpora

We are happy to announce that a new set of 15 Written Corpora is now available in our catalogue. Arabic-English, Arabic-French, Chinese-English and Chinese-French Written Parallel Corpora: This set of 15 written corpora was produced by ELDA within PEA TRAD, a project supported by the French Ministry of Defence (DGA). Available resources are listed below (click on the links for further details). ELRA-W0098 TRAD Arabic-French Newspaper Parallel corpus – Test set 1 – ISLRN: 922-732-502-473-8 This is a parallel corpus of 10,000 words in Arabic and 4 reference translations in French. The…

[Resource] Corpus ‘Australia 2015/2016’

The corpus ‘Australia 2015/2016’ includes all articles from major Australian newspapers published from August 2015 to July 2016 that include the key term ‘Australia’ or ‘Australian(s)’ in the title. Altogether, the corpus contains over 7 million tokens in almost 13,000 articles from 18 newspapers. The corpus thus reflects one year of printed media coverage of topics directly relevant to Australia. Download Australia2015/2016 Corpus here Download word frequencies from this corpus here

Corpus of Historical American English

I am publishing here Mark Davies’ announcement of COHA: We are pleased to announce the release of the 400 million word Corpus of Historical American English (1810-2009). The corpus has been funded by a generous grant from the US National Endowment for the Humanities, and it is freely available at COHA is the largest structured corpus of historical English, and it contains more than 100,000 texts from fiction, popular magazines, newspapers, and non-fiction books, with the same genre balance decade by decade from the 1810s-2000s.