MULTI-LABEL CLASSIFICATION OF LABELS OF SYSTEM LOGS OF COMPUTER NETWORKS. FORMALIZATION OF THE TASK
O. I. Sheluhin, D. I. Rakovskiy Moscow Technical University of Communications and Informatics
Annotation: Purpose. An important problem in the intelligent processing of syslog data is the existence of datasets containing records with multiple class label associations. A dataset suitable for classification typically contains a set of features and an associated set of class labels. The goal of classification is a trained model capable of assigning an appropriate class to an unknown object (records in "historical data"). The solution to this problem is associated with an exponential growth of label combinations that must be taken into account, as well as the computational costs of training data-mining models. The problem of multi-label classes in relation to computer network (CN) is currently insufficiently studied. The aim of the study is to formalize the problem of multivalued classification of experimental data (binary or multiclass) using the example of CN system log entries and to demonstrate its applicability to information security problems. Novelty. The novelty of the study lies in illustrating the presence of multi-labeled class labels in the analysis of syslog entries generated by CN. It is shown that this feature is inherent in most CNs, which are subject to boundary requirements for several indicators (attributes) of a predetermined Service Level Objectives. In case of occurrence of anomalous states for several attributes at once, an increase in the number of labels is a prerequisite for the occurrence of a rare anomalous state (system anomaly) CN at the current time. Results. It is shown that the problem of ambiguity of system log class labels is relevant for the analysis of the availability and integrity of information circulating in the CN. It is shown that the ambiguity of class labels manifests itself not only in the occurrence of several CN states at the current time, but also in the implicit multi-valued mapping of known CN attributes to these states. It is shown that with unambiguous learning, the label returned by such algorithms is a scalar value, and the resulting one-label classifiers label the data with loss of information. The multi-valued approach operates with labels assets (or vectors), and the resulting multi-valued classifier can assign several labels to CN states at once, which increases the classification accuracy. The significance of the secondary attributes of "historical data", which determine the quality of a multivalued classification, is shown. Practical relevance. Multi-labeled system log class labels are relevant in the areas of diagnosing malfunctions of CS hardware components, detecting attacks, detecting suspicious network activity, and other information security tasks.
Keywords: computer networks; multi-labeled classification; multiclass classification; information security.