Czech news dataset for semantic textual similarity


Jakub Sido and Ondřej Pražák and Miloslav Konopík and Václav Moravec
Language Resources and Evaluation (2024)

PDF

Research topics:

Semantic Analysis

Abstract

This paper describes a novel dataset consisting of sentences with two different semantic similarity annotations; with and without surrounding context. The data originate from the journalistic domain in the Czech language. The final dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the final annotations as an average of 9 individual annotation scores. We evaluate the dataset quality by measuring inter and intra-annotator agreements. Besides agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116,956), the model significantly outperforms an average annotator (0.92 versus 0.86 of Pearson’s correlation coefficient).;

Authors

BibTex

@article{sido2024czech, title={Czech news dataset for semantic textual similarity}, author={Sido, Jakub and Sej{\'a}k, Michal and Pra{\v{z}}{\'a}k, Ond{\v{r}}ej and Konop{\'\i}k, Miloslav and Moravec, V{\'a}clav}, journal={Language Resources and Evaluation}, pages={1--18}, year={2024}, publisher={Springer} }
Back to Top