Preview

SibScript

Advanced search

AUTOMATIC CLASSIFICATION OF SEMISTRUCTURED DOCUMENTS IN SCIENTIFIC AND EDUCATIONAL PROCESS

Abstract

Numerous semi-structured documents are used daily in education and research activities at universities. Dealing with metadata rather than documents themselves is one of the ways of processing documents uniformly. However, as far as many semi-structured documents are concerned, this method is considered to be efficient only in case of the existing procedure of automatic extraction of documents content metadata. The procedure includes 3 stages: document class identification, clusterization of the documents whose classes could not be identified, extraction of metadata from the documents of identified classes. The paper is dedicated to possible solutions for the first stage, i.e. automatic classification of semi-structured documents. The paper includes the definition of a semi-structured document, criteria of methods efficiency classification, comparative analysis of different methods regarding 5 top criteria. To estimate 2 additionally developped criteria the following methods are used: multilayer neural networks, Rocchio algorithm, k-nearest neighbor method. Based on the analysis results, the neural networks method appears to be the most efficient in the context of accuracy and speed correlation. However, classification accuracy is not enough when dealing with semi-structured documents. The authors suppose the accuracy of the methods can be improved by using not only key words but also determined document structure during classification process.

About the Authors

A. M. Gudov
Kemerovo State University
Russian Federation
Alexander M. Gudov – Doctor of Technical Sciences, Associate Professor, Head of the UNESCO Department for New Information Technologies


S. Yu. Zavozkin
Kemerovo State University
Russian Federation
Sergey Yu. Zavozkin – Candidate of Technical Sciences, Assistant Professor at the UNESCO Department for New Information Technologies


V. A. Shevnin
Kemerovo State University
Russian Federation
Vasily A. Shevnin – post-graduate student at the UNESCO Department for New Information Technologies


References

1. Галушкин А. И. Синтез многослойных систем распознавания образов. М.: Энергия. 1974.

2. Гудов А. М., Завозкин С. Ю., Меньшиков А. С. Модуль автоматического определения метаданных документа в системе электронного документооборота вуза // Вестник КемГУ. 2006. № 1(25). С. 31 – 36.

3. Гудов А. М., Завозкин С. Ю., Шевнин В. А. Автоматическое извлечение метаданных из слабоструктурированных документов, участвующих в научно-образовательном процессе // Информационные технологии и математическое моделирование (ИТММ-2013): материалы XII Всероссийской научно-практической конференции с международным участием (им. А. Ф. Терпугова), 29 – 30 ноября 2013 г. Ч. I.

4. Кристофер Д. Маннинг, Правхакар Рагхаван, Хайнрих Шютце. Введение в информационный поиск [пер. с англ.] М.: И. Д. Вильямс, 2011.

5. Лебедев А. Словарь русского языка ispell // Кафедра физики полупроводников. 2014. Режим доступа: scon155.phys.msu.su/~swan/orthography.html (дата обращения: 30.01.2014).

6. Пескова О. В. Методы автоматической классификации текстовых электронных документов // НТИ. (Серия 2: Информационные процессы и системы). 2006. № 3. С. 13 – 20.

7. Толчеев В. О. Модифицированный и обобщенный метод ближайшего соседа для классификации библиографически текстовых документов // Заводская лаборатория, диагностика материалов. 2009. Т. 75. № 7. С. 63 – 70.


Review

For citations:


Gudov A.M., Zavozkin S.Yu., Shevnin V.A. AUTOMATIC CLASSIFICATION OF SEMISTRUCTURED DOCUMENTS IN SCIENTIFIC AND EDUCATIONAL PROCESS. SibScript. 2014;(4-3):43-47. (In Russ.)

Views: 445


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2949-2122 (Print)
ISSN 2949-2092 (Online)