A study of vectorization methods for unstructured text documents in natural  language according to their influence on the quality of work of various classifiers

V. V. Shadsky; A. B. Sizonenko; M. A. Chekmarev; A. V. Shishkov; D. A. Isakin

doi:10.17586/2226-1494-2022-22-1-114-119

A study of vectorization methods for unstructured text documents in natural language according to their influence on the quality of work of various classifiers

V. V. Shadsky, A. B. Sizonenko, M. A. Chekmarev, A. V. Shishkov, D. A. Isakin

https://doi.org/10.17586/2226-1494-2022-22-1-114-119

Full Text:

PDF (Rus)

Generate QR code

Abstract

The widespread increase in the volume of processed information at the objects of critical information infrastructure, presented in text form in natural language, causes a problem of its classification by the degree of confidentiality. The success of solving this problem depends both on the classifier model itself and on the chosen method of feature extraction (vectorization). It is required to transfer to the classifier model the properties of the source text containing the entire set of demarcation features as fully as possible. The paper presents an empirical assessment of the effectiveness of linear classification algorithms based on the chosen method of vectorization, as well as the number of configurable parameters in the case of the Hash Vectorizer. State text documents are used as a dataset for training and testing classification algorithms, conditionally acting as confidential. The choice of such a text array is due to the presence of specific terminology found everywhere in declassified documents. Termination, being a primitive demarcation boundary and acting as a classification feature, facilitates the work of classification algorithms, which in turn allows one to focus on the share of the contribution that the chosen method of vectorization makes. The metric for evaluating the quality of algorithms is the magnitude of the classification error. The magnitude of the error is the inverse of the proportion of correct answers of the algorithm (accuracy). The algorithms were evaluated according to the training time. The resulting histograms reflect the magnitude of the error of the algorithms and the training time. The most and least effective algorithms for a given vectorization method are identified. The results of the work make it possible to increase the efficiency of solving real practical classification problems of small-volume text documents characterized by their specific terminology.

Keywords

vectorization method, TF-IDF, Hash Vectorizer, classification algorithm, accuracy

About the Authors

V. V. Shadsky

Krasnodar Higher Military School
Russian Federation

Viktor V. Shadsky — PhD Student

Krasnodar, 350063

A. B. Sizonenko

Krasnodar Higher Military School
Russian Federation

Alexander B. Sizonenko — D.Sc., Associate Professor

Krasnodar, 350063

M. A. Chekmarev

Krasnodar Higher Military School
Russian Federation

Maxim A. Chekmarev — PhD Student

Krasnodar, 350063

A. V. Shishkov

Krasnodar Higher Military School
Russian Federation

Andrey V. Shishkov — Student

Krasnodar, 350063

D. A. Isakin

Novosibirsk State Technical University
Russian Federation

Daniil A. Isakin — Student

Novosibirsk, 630087

References

1. Batura T.V. Automatic text classification methods. Software & Systems, 2017, no. 1, pp. 85–99. (in Russian). https://doi.org/10.15827/0236-235X.030.1.085-099

2. Bortnikov V.I., Mikhailova Iu.N. Documentary Linguistics. Ekaterinburg, Izdatel’stvo Ural’skogo universiteta Publ., 2017, 132 с. (in Russian)

3. Rogotneva E.N. Documentary Linguistics. Teaching materials. Tomsk, Tomsk Polytechnic University Publ., 2011, 784 с. (in Russian)

4. Orlov A.I. Mathematical methods of classification theory. Polythematic online scientific journal of Kuban State Agrarian University, 2014, no. 95, pp. 23–45. (in Russian)

5. Kosova M.V., Sharipova R.R. Termination as the basis for classification of document texts. Science Journal of Volgograd State University. Linguistics, 2016, vol. 15, no. 4, pp. 245–252. (in Russian). https://doi.org/10.15688/jvolsu2.2016.4.26

6. Terskikh N.V. Term as a unit of specialized knowledge . Sistema cennostej sovremennogo obshhestva, 2008, no. 3, pp. 97–104. (in Russian)

7. Rozental D.E., Golub I.B., Telenkova M.A. Contemporary Russian Language. Moscow, AJRIS-press Publ., 2014, 448 p. (in Russian)

8. Krasheninnikov A.M., Gdanskiy N.I., Rysin M.L. Linear classification of objects using normal hyperplanes. Engineering journal of Don, 2012, no. 4-1 (22), pp. 94–99. (in Russian)

9. Dan Nelson. Overview of Classification Methods in Python with Scikit-Learn. Stack Abuse. Available at: https://stackabuse.com/overview-of-classification-methods-in-python-with-scikit-learn/ (accessed: 20.12.2021).

10. Woods W. Important issues in knowledge representation. Proceedings of the IEEE, 1986, vol. 74, no. 10, pp. 1322–1334. https://doi.org/10.1109/PROC.1986.13634

11. Raschka S., Mirjalili V. Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2. Packt Publishing Ltd, 2019, 770 p.

12. Qaiser S., Ali R. Text mining: Use of TF-IDF to examine the relevance of words to documents. International Journal of Computer Applications, 2018, vol. 181, no. 1, pp. 25–29. https://doi.org/10.5120/ijca2018917395

13. Kavita Ganesan. HashingVectorizer vs. CountVectorizer. Available at: https://kavita-ganesan.com/hashingvectorizer-vs-countvectorizer/#.YcGOyavP2Ul (accessed: 20.12.2021).

14. Jason Brownlee. How to Encode Text Data for Machine Learning with scikit-learn. Machine learning mastery. Available at: https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/ (accessed: 20.12.2021).

15. Max Pagels. Introducing One of the Best Hacks in Machine Learning: the Hashing Trick. Medium. Available at: https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f (accessed: 20.12.2021).

Review

For citations:

Shadsky V.V., Sizonenko A.B., Chekmarev M.A., Shishkov A.V., Isakin D.A. A study of vectorization methods for unstructured text documents in natural language according to their influence on the quality of work of various classifiers. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2022;22(1):114-119. (In Russ.) https://doi.org/10.17586/2226-1494-2022-22-1-114-119

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2226-1494 (Print)
ISSN 2500-0373 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Scientific and Technical Journal of Information Technologies, Mechanics and Optics

A study of vectorization methods for unstructured text documents in natural language according to their influence on the quality of work of various classifiers

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy