Sarcasm Detection on Czech and English Twitter


Tomáš Hercig and Ivan Habernal
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (2014)

PDF

Research topics:

Sentiment Analysis

Abstract

This paper presents a machine learning approach to sarcasm detection on Twitter in two languages – English and Czech. Although there has been some research in sarcasm detection in languages other than English (e.g., Dutch, Italian, and Brazilian Portuguese), our work is the first attempt at sarcasm detection in the Czech language. We created a large Czech Twitter corpus consisting of 7,000 manually-labeled tweets and provide it to the community. We evaluate two classifiers with various combinations of features on both the Czech and English datasets. Furthermore, we tackle the issues of rich Czech morphology by examining different preprocessing techniques. Experiments show that our language-independent approach significantly outperforms adapted state-of-the-art methods in English (F-measure 0.947) and also represents a strong baseline for further research in Czech (F-measure 0.582).

Authors

BibTex

@InProceedings{ptavcek-habernal-hong:2014:Coling, author = {Pt\'{a}\v{c}ek, Tom\'{a}\v{s} and Habernal, Ivan and Hong, Jun}, title = {Sarcasm Detection on Czech and English Twitter}, booktitle = {Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers}, month = {August}, year = {2014}, address = {Dublin, Ireland}, publisher = {Dublin City University and Association for Computational Linguistics}, pages = {213--223}, url = {http://www.aclweb.org/anthology/C14-1022}, abstract = {This paper presents a machine learning approach to sarcasm detection on Twitter in two languages -- English and Czech. Although there has been some research in sarcasm detection in languages other than English (e.g., Dutch, Italian, and Brazilian Portuguese), our work is the first attempt at sarcasm detection in the Czech language. We created a large Czech Twitter corpus consisting of 7,000 manually-labeled tweets and provide it to the community. We evaluate two classifiers with various combinations of features on both the Czech and English datasets. Furthermore, we tackle the issues of rich Czech morphology by examining different preprocessing techniques. Experiments show that our language-independent approach significantly outperforms adapted state-of-the-art methods in English (F-measure 0.947) and also represents a strong baseline for further research in Czech (F-measure 0.582).} }
Back to Top