Evaluation of the Document Classification Approaches
Michal Hrala and
Pavel Král
8th International Conference on Computer Recognition Systems (CORES 2013) (2013)
PDF
Abstract
This paper deals with one class automatic document classification. Five feature selection methods and three classifiers are evaluated on a Czech corpus in order to build an efficient Czech document classification system. Lemmatization and POS tagging are used for a precise representation of the Czech documents. We demonstrated, that POS tag filtering is very important, while the lemmatization plays a marginal role for classification. We also showed that Maximum Entropy and Support Vector Machines are very robust to the feature vector size and outperform significantly the Naive Bayes classifier from the view point of the classification accuracy. The best classification accuracy is about 90% which is enough for an application for the Czech News Agency, our commercial partner.