NLP group

Evaluation of the Document Classification Approaches

Michal Hrala and Pavel Král
8th International Conference on Computer Recognition Systems (CORES 2013) (2013)

Research topics:

Document Classification

Abstract

This paper deals with one class automatic document classification. Five feature selection methods and three classifiers are evaluated on a Czech corpus in order to build an efficient Czech document classification system. Lemmatization and POS tagging are used for a precise representation of the Czech documents. We demonstrated, that POS tag filtering is very important, while the lemmatization plays a marginal role for classification. We also showed that Maximum Entropy and Support Vector Machines are very robust to the feature vector size and outperform significantly the Naive Bayes classifier from the view point of the classification accuracy. The best classification accuracy is about 90% which is enough for an application for the Czech News Agency, our commercial partner.

NLP group

Research & development

Evaluation of the Document Classification Approaches

Research topics:

Abstract

Authors

prof. Ing. Pavel Král, Ph.D.

Team leader

BibTex

Contact Us

NLP group

We offer