Evaluation of the Document Classification Approaches
Michal Hrala
and
Pavel Král
8th International Conference on Computer Recognition Systems (CORES 2013) (2013)
BibTex
|
PDF
Research topics
Document classification
Abstract
This paper deals with one class automatic document classica-
tion. Five feature selection methods and three classiers are evaluated on a
Czech corpus in order to build an ecient Czech document classication sys-
tem. Lemmatization and POS tagging are used for a precise representation
of the Czech documents. We demonstrated, that POS tag ltering is very
important, while the lemmatization plays a marginal role for classication.
We also showed that Maximum Entropy and Support Vector Machines are
very robust to the feature vector size and outperform signicantly the Naive
Bayes classier from the view point of the classication accuracy. The best
classication accuracy is about 90% which is enough for an application for
the Czech News Agency, our commercial partner.