ČTK: Document classification

We have implemented a system for automatic categorization of press releases (sport, politics, etc.) for the Czech News Agency. It simplifies the process of adding new press releases as well as searching in them. The challenge is the multi-label nature of this task, i.e. each press release can belong to multiple categories and we do not know in advance how many there are. Our system works with the semantics of the press releases, because the mere use of words without knowing their relations is not sufficient. Our know-how was delivered in the form of a software license.

The task

There are approximately 12,000 press releases already stored in the database of the Czech News Agency. The goal is to provide an effective search of this database. One of the key aspects of this goal is categorization, but editors make mistakes in the categorization, or they do not categorize at all. Our system will improve this situation by automatic and consistent categorization of press releases.

The categorization of press releases is additional, unnecessary work for the editor. A press release can belong to more than one category and choosing from a list with more than 40 categories is grueling. This results in press releases being incorrectly categorized and searching in the archive becomes problematic and ineffective.

Our task is to improve the categorization of press releases. Using our algorithms for identifying word meaning we should better be able to determine the correct categories. It is not enough to simply use keywords, but we must also use the substantive meaning of the words. The editor receives preselected categories for a press release and then s/he needs only to approve them or make some minor changes. The automatic categorization improves the overall quality of categorization and therefore also the quality and effectivity of any subsequent work with the archive.

Our algorithms have been integrated into the system of the Czech News Agency. The editor works with only this system where the categories for press releases are automatically preselected.

The challenge and solution

The main challenge is to reliably determine all the categories for a press release without knowing how many there are in advance. Selection is made from more than 40 categories. The solution is based on the results of our research in the form of software libraries which we prepared for use in the implementation of product innovations.

It is difficult to classify a press release into a previously unknown number of categories. Conventional systems are based simply on keywords and achieve only average results. We have achieved distinctly better results by incorporating our algorithms for understanding word meanings and semantic recognition. We have significantly improved the ability to correctly recognize sparsely represented categories, meaning that editors can rely more on the system.

The results of our system are comparable to that achieved by humans. The differences were mostly caused by incorrect categorization of the article by a human.

Back to Top