ČTK: Document classification

We have implemented a system for automatic categorization of press releases (sport, politics, etc.) for the Czech News Agency. It simplifies the process of adding new press releases as well as searching in them. The challenge is the multi-label nature of this task, i.e. each press release can belong to multiple categories and we do not know in advance how many there are. Our system works with the semantics of the press releases, because the mere use of words without knowing their relations is not sufficient. Our know-how was delivered in the form of a software license.

About

The categorization of press releases is additional, unnecessary work for the editor. A press release can belong to more than one category and choosing from a list with more than 40 categories is grueling. This results in press releases being incorrectly categorized and searching in the archive becomes problematic and ineffective. Our task is to improve the categorization of press releases. Using our algorithms for identifying word meaning we should better be able to determine the correct categories. It is not enough to simply use keywords, but we must also use the substantive meaning of the words. The editor receives preselected categories for a press release and then s/he needs only to approve them or make some minor changes. The automatic categorization improves the overall quality of categorization and therefore also the quality and effectivity of any subsequent work with the archive. Our algorithms have been integrated into the system of the Czech News Agency. The editor works with only this system where the categories for press releases are automatically preselected. It is difficult to classify a press release into a previously unknown number of categories. Conventional systems are based simply on keywords and achieve only average results. We have achieved distinctly better results by incorporating our algorithms for understanding word meanings and semantic recognition. We have significantly improved the ability to correctly recognize sparsely represented categories, meaning that editors can rely more on the system. The results of our system are comparable to that achieved by humans. The differences were mostly caused by incorrect categorization of the article by a human.

Back to Top