NLP group

Czech Historical Named Entity Corpus v 1.0

Helena Hubková and Pavel Král and Eva Pettersson
Proceedings of The 12th Language Resources and Evaluation Conference (2020)

PDF

Abstract

As the number of digitized archival documents increases very rapidly, named entity recognition (NER) in historical documents has become very important for information extraction and data mining. For this task an annotated corpus is needed, which has up to now been missing for Czech. In this paper we present a new annotated data collection for historical NER, composed of Czech historical newspapers. This corpus is freely available for research purposes. For this corpus, we have defined relevant domain-specific named entity types and created an annotation manual for corpus labelling. We further conducted some experiments on this corpus using recurrent neural networks. We experimented with randomly initialized embeddings and static and dynamic fastText word embeddings. We achieved 0.73 F1 score with a bidirectional LSTM model using static fastText embeddings.

NLP group

Research & development

Czech Historical Named Entity Corpus v 1.0

Abstract

Authors

prof. Ing. Pavel Král, Ph.D.

Team leader

BibTex

Contact Us

NLP group

We offer