Evaluation of the Document Classification Approaches

Michal Hrala and Pavel Král
8th International Conference on Computer Recognition Systems (CORES 2013) (2013)


Research topics:

Document Classification


This paper deals with one class automatic document classification. Five feature selection methods and three classifiers are evaluated on a Czech corpus in order to build an efficient Czech document classification system. Lemmatization and POS tagging are used for a precise representation of the Czech documents. We demonstrated, that POS tag filtering is very important, while the lemmatization plays a marginal role for classification. We also showed that Maximum Entropy and Support Vector Machines are very robust to the feature vector size and outperform significantly the Naive Bayes classifier from the view point of the classification accuracy. The best classification accuracy is about 90% which is enough for an application for the Czech News Agency, our commercial partner.



@InProceedings{Kral13CORES, author = {Hrala, M. and Kr\'al, P.}, title = {Evaluation of the Document Classification Approaches}, booktitle = {8th International Conference on Computer Recognition Systems (CORES 2013)}, pages = {877-885}, year = {2013}, address = {Milkow, Poland}, month = {27-29 May}, publisher = {Springer}, isbn = {978-3-319-00968-1}, doi = {10.1007/978-3-319-00969-8\_86} }
Back to Top