Evaluation of the Document Classification Approaches

Michal Hrala and Pavel Král
8th International Conference on Computer Recognition Systems (CORES 2013) (2013)
BibTex  | PDF

Research topics

Document classification

Abstract

This paper deals with one class automatic document classi ca- tion. Five feature selection methods and three classi ers are evaluated on a Czech corpus in order to build an ecient Czech document classi cation sys- tem. Lemmatization and POS tagging are used for a precise representation of the Czech documents. We demonstrated, that POS tag ltering is very important, while the lemmatization plays a marginal role for classi cation. We also showed that Maximum Entropy and Support Vector Machines are very robust to the feature vector size and outperform signi cantly the Naive Bayes classi er from the view point of the classi cation accuracy. The best classi cation accuracy is about 90% which is enough for an application for the Czech News Agency, our commercial partner.

Authors of the publication

Back to Top