Mid Sweden University

miun.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Advanced Algorithms for Classification and Anomaly Detection on Log File Data: Comparative study of different Machine Learning Approaches
Mid Sweden University, Faculty of Science, Technology and Media, Department of Information Systems and Technology.
2021 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Background: A problematic area in today’s large scale distributed systems is the exponential amount of growing log data. Finding anomalies by observing and monitoring this data with manual human inspection methods becomes progressively more challenging, complex and time consuming. This is vital for making these systems available around-the-clock.

Aim: The main objective of this study is to determine which are the most suitable Machine Learning (ML) algorithms and if they can live up to needs and requirements regarding optimization and efficiency in the log data monitoring area. Including what specific steps of the overall problem can be improved by using these algorithms for anomaly detection and classification on different real provided data logs.

Approach: Initial pre-study is conducted, logs are collected and then preprocessed with log parsing tool Drain and regular expressions. The approach consisted of a combination of K-Means + XGBoost and respectively Principal Component Analysis (PCA) + K-Means + XGBoost. These was trained, tested and with different metrics individually evaluated against two datasets, one being a Server data log and on a HTTP Access log.

Results: The results showed that both approaches performed very well on both datasets. Able to with high accuracy, precision and low calculation time classify, detect and make predictions on log data events. It was further shown that when applied without dimensionality reduction, PCA, results of the prediction model is slightly better, by a few percent. As for the prediction time, there was marginally small to no difference for when comparing the prediction time with and without PCA.

Conclusions: Overall there are very small differences when comparing the results for with and without PCA. But in essence, it is better to do not use PCA and instead apply the original data on the ML models. The models performance is generally very dependent on the data being applied, it the initial preprocessing steps, size and it is structure, especially affecting the calculation time the most.

Place, publisher, year, edition, pages
2021. , p. 114
Keywords [en]
Machine Learning (ML), K-Means, Principal Component Analysis (PCA), XGBoost, Log data, Anomaly Detection, Outlier Detection, Clustering.
National Category
Computer Engineering
Identifiers
URN: urn:nbn:se:miun:diva-43175Local ID: DT-V21-A2-010OAI: oai:DiVA.org:miun-43175DiVA, id: diva2:1597379
Subject / course
Computer Engineering DT1
Educational program
Master of Science in Engineering - Computer Engineering TDTEA 300 higher education credits
Supervisors
Examiners
Available from: 2021-09-27 Created: 2021-09-27 Last updated: 2021-09-27Bibliographically approved

Open Access in DiVA

fulltext(6789 kB)1294 downloads
File information
File name FULLTEXT01.pdfFile size 6789 kBChecksum SHA-512
e424176571bd4e13a14d42b6e407f05732ca79b2d3b5aa2bd51b4119e909055e4ae245f25a81abea0e2be413673fcfcca41bae89c97dc0e4e9d3048537884fdf
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Wessman, Filip
By organisation
Department of Information Systems and Technology
Computer Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 1294 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1733 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf