NLP group

Czech news dataset for semantic textual similarity

Jakub Sido and Ondřej Pražák and Miloslav Konopík and Václav Moravec
Language Resources and Evaluation (2024)

Research topics:

Semantic Analysis

Abstract

This paper describes a novel dataset consisting of sentences with two different semantic similarity annotations; with and without surrounding context. The data originate from the journalistic domain in the Czech language. The final dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the final annotations as an average of 9 individual annotation scores. We evaluate the dataset quality by measuring inter and intra-annotator agreements. Besides agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116,956), the model significantly outperforms an average annotator (0.92 versus 0.86 of Pearson’s correlation coefficient).;

NLP group

Research & development

Czech news dataset for semantic textual similarity

Research topics:

Abstract

Authors

Ing. Jakub Sido, Ph.D.

Researcher

Ing. Ondřej Pražák, Ph.D.

Researcher

Ing. Miloslav Konopík, Ph.D.

Researcher

BibTex

Contact Us

NLP group

We offer