Machine Learning-based Intrusion Detection For Identifying Zero-day Attacks In Unseen Data

Jump to section

About this article

Abstract

With the increasing dependency on digital infrastructure, cybersecurity threats have become more prevalent and severe. The constant evolution of digital systems has introduced complex threats, notably zero-day exploits. This paper explores the limitations of current intrusion detection systems in identifying such zero-day threats by utilizing the CIC-MalMem-2022 dataset and autoencoder-based anomaly detection. The proposed method integrates a trained autoencoder with XGBoost and Random Forest algorithms, creating hybrid models named XGBoost-AE and Random Forest-AE. Results indicate that embedding anomaly detection mechanisms significantly improves traditional classifiers’ effectiveness. The Random Forest-AE model achieved flawless metrics—100% in accuracy, precision, recall, F1 score, and Matthews Correlation Coefficient (MCC)—surpassing models previously introduced by Balasubramanian et al., Khan, Mezina et al., Smith et al., and Dener et al. When validated on novel (unseen) data, the Random Forest-AE model maintained exceptional performance, achieving 99.9892% accuracy, 100% precision, 99.9803% recall, 99.9901% F1 score, and an MCC of 99.8313%. These findings confirm the model’s robustness in identifying unknown cyber threats with high reliability.

 

Introduction

As digital infrastructure becomes increasingly central to modern life, cybersecurity has emerged as a critical priority for individuals, organizations, and governments alike. The rapid advancement of technology and widespread internet adoption have led to a surge in sophisticated cyber threats. Among the most concerning of these are zero-day attacks—exploits that target unknown vulnerabilities in software or hardware, often before developers or security professionals are aware of them [1]. Traditional rule-based intrusion detection systems often fall short in identifying such novel threats, making it necessary to explore more adaptive and intelligent solutions. Machine learning (ML) and deep learning (DL) techniques have gained prominence in recent years as powerful tools for threat detection. These methods, unlike static defense mechanisms, analyze large volumes of historical and real-time network data to detect abnormal behaviors and potential intrusions [2, 3]. By learning from both attack patterns and normal behavior, these models can identify not only known threats but also emerging risks that deviate from expected norms [4].

In this context, the current study proposes a new intrusion detection framework aimed at enhancing the detection of zero-day attacks, especially in previously unseen data. The model integrates autoencoders with two robust machine learning algorithms—Random Forest and XGBoost—to create hybrid systems named Random Forest-AE and XGBoost-AE. These models leverage the feature extraction capability of autoencoders to improve detection accuracy, even in unpredictable real-world environments. Comparative analysis with existing methods demonstrates that the proposed Random Forest-AE model consistently achieves superior performance in identifying unseen threats. The paper is structured as follows: Section 2 presents a review of existing literature; Section 3 explains the methodology and model architecture; Section 4 discusses the results using the CIC-MalMem-2022 dataset; Section 5 explores limitations and proposes future research directions; and Section 6 concludes with a summary of findings and implications.

Related Work

Detecting zero-day attacks remains a central challenge in cybersecurity, leading to a wide range of research aimed at developing more effective detection methods. This section reviews major approaches in the field, positioning the present study within the broader research landscape and identifying critical gaps it aims to address. Supervised learning techniques have been widely explored for intrusion detection. Smith et al. compared supervised and unsupervised models using the Malware-Exploratory and CIC-MalMem-2022 datasets [5]. Their analysis included three clustering algorithms (K-Means, DBSCAN, and GMM) and seven classifiers, such as Decision Trees, Random Forests, AdaBoost, K-Nearest Neighbors, Stochastic Gradient Descent, Extra Trees, and Gaussian Naïve Bayes. While the classifiers achieved high accuracy rates above 90%, their dependence on labeled data and known attack signatures often limits their ability to detect zero-day threats. In the domain of deep learning, Dener and Orman implemented both ML and DL models to analyze memory-based malware in the CIC-MalMem-2022 dataset using Apache Spark’s Pyspark framework [6]. Among the nine algorithms tested—ranging from Random Forest and Logistic Regression to Deep Feedforward Neural Networks and Long Short-Term Memory (LSTM)—Logistic Regression achieved the highest accuracy at 99.97%, followed closely by gradient boosting models. While these methods offer impressive results, they are often computationally expensive and require large-scale datasets to perform effectively [7,8].

Semi-supervised and unsupervised models also offer promising alternatives. Mbona and Eloff utilized Benford’s law alongside semi-supervised machine learning to identify key features in network traffic, using GMMs [10] and one-class SVMs [11] across datasets like CICDDoS2019, IOTIntrusion2020, and CIRA-CIC-DoHBrw-2020 [12,13]. These approaches reduce the reliance on labeled data but can still struggle with detecting obfuscated or novel attacks. Convolutional neural networks (CNNs) have also shown effectiveness in malware detection. Mezina and Burget developed a dilated CNN for multi-class classification of obfuscated malware using CIC-MalMem-2022. High accuracy scores were reported for binary classifications, especially with models like Random Forest (0.99992), KNN (0.99966), and Decision Tree (0.99923) [3]. However, CNN models often require significant training data and computational power, which may hinder their practical deployment. To directly address the novelty of zero-day threats, Soltani et al. introduced a deep learning-based open-set identification framework capable of detecting unknown attack types. Tested on CIC-IDS2017 and CSE-CIC-IDS2018 datasets, their model successfully handled previously unseen classes with an average accuracy of 99% [14,15]. While effective, such models may need further validation across more diverse datasets.

Generative Adversarial Networks (GANs) are also emerging as a promising tool. De Araujo-Filho et al. proposed an unsupervised intrusion detection system for 5G networks combining GANs, temporal convolutional networks (TCNs), and self-awareness modules. Using the CICDDoS2019 dataset, their model surpassed baseline GAN-based methods like FID-GAN and ALAD, achieving high detection rates and faster response times [2]. Despite their potential, GANs are notoriously difficult to train and optimize. Feature engineering has also been a crucial focus. Balasubramanian et al. highlighted the role of pre-processing and feature selection in improving detection performance. Utilizing methods like correlation heat maps, Extra Trees classifiers [16], and ANOVA [17], they achieved notable results in binary classification on CIC-MalMem-2022, with top-performing models such as Decision Tree and Random Forest achieving detection accuracies of 0.9999 [18]. However, such approaches can be labor-intensive and computationally demanding. Despite these advancements, existing methods often fall short in detecting zero-day attacks due to their dependence on labeled data, model complexity, or limited generalizability. The proposed study aims to overcome these limitations by integrating an autoencoder-based anomaly detection mechanism with Random Forest and XGBoost. This novel hybrid design enables the models to learn normal behavior patterns and effectively identify deviations, significantly improving their capability to detect previously unseen threats.

Methodology

In this study, the term “unseen data” refers to any data that the model has not encountered during the training phase. This distinction is critical for evaluating the model’s effectiveness in detecting zero-day attacks—malicious activities that exploit previously unidentified vulnerabilities. Unseen data is categorized into two primary types. The first category includes novel attacks, which are entirely new threats that do not resemble known attack patterns and represent a major detection challenge. The second category involves variations of known attacks, which are slightly modified instances of known exploits that maintain some underlying similarities to the original patterns. To tackle the challenge of performance degradation in the detection of such unseen threats, this study proposes a hybrid intrusion detection method that integrates an autoencoder with traditional machine learning classifiers, namely Random Forest and XGBoost. The methodology comprises three main stages: data preprocessing, model construction, and evaluation. Figure 1 illustrates the workflow of the proposed approach. In the data preprocessing stage, raw data is refined through cleaning, formatting, and feature extraction to improve the quality and generalizability of the training dataset. The CIC-MalMem-2022 dataset is utilized for this purpose, with a specific emphasis on separating benign data to train the autoencoder effectively. By focusing on normal data, the autoencoder learns to reconstruct legitimate behavior patterns. An anomaly detector is then constructed by setting a threshold on reconstruction errors to differentiate between benign and potentially malicious inputs. This anomaly signal is subsequently passed to a hybrid classifier that combines Random Forest and XGBoost models for final decision-making. The entire system is evaluated using standard performance metrics to validate its effectiveness in classifying and detecting zero-day threats.

 

 

Fig 1. Flowchart of the proposed method.

 

Data Acquisition

The dataset utilized in this research is CIC-MalMem-2022, obtained from the Canadian Institute for Cybersecurity (https://www.unb.ca/cic/datasets/malmem-2022.html). As one of the most recent and comprehensive publicly available datasets for malware detection, CIC-MalMem-2022 provides a valuable resource for evaluating machine learning models in cybersecurity contexts. It includes obfuscated malware samples that reflect real-world threats such as ransomware, spyware, and Trojan-based attacks. The dataset consists of 58,596 records, evenly divided into 29,298 benign and 29,298 malicious samples. These samples are characterized by 57 features, of which 55 are numerical and 2 are categorical, offering a diverse set of attributes to train and evaluate the detection models effectively.

The adoption of this widely recognized dataset facilitates meaningful comparison with previous studies, enabling performance benchmarking of the proposed models against established baselines. This approach not only strengthens the reproducibility of the results but also highlights improvements and novel contributions. However, it is important to acknowledge the limitations inherent in relying solely on a single dataset. CIC-MalMem-2022 represents a specific snapshot of malware activity and may not encompass the full spectrum of emerging cyber threats. To address this, future research will involve testing the models on alternative datasets and applying cross-dataset validation strategies. This involves training the models on one dataset and evaluating them on another, offering deeper insights into their ability to generalize across varied environments and attack patterns.

Data Preprocessing

After acquiring the dataset, a series of preprocessing steps are applied to ensure the integrity and usability of the data for machine learning tasks. High-quality input data is essential for effective model training, as it enhances pattern recognition and the extraction of meaningful insights. The preprocessing phase in this study includes data cleaning, data formatting, and feature selection—each contributing to improved model performance and robustness. Data cleaning involves the removal of outliers, duplicate records, missing values, and noisy entries from the dataset. This step is critical for eliminating irrelevant or misleading information that could degrade the performance of the model. By refining the input data, the learning algorithms are better positioned to focus on genuine patterns and anomalies relevant to malware detection. Data formatting refers to the organization and standardization of the data to ensure consistency during analysis and model execution. This includes converting raw inputs into structured formats that conform to predefined conventions, making them suitable for seamless ingestion by machine learning algorithms. Proper formatting improves the reliability and compatibility of the system during runtime. Feature selection plays a vital role in identifying the most informative and relevant features for training. This reduces dimensionality, mitigates overfitting, and enhances model interpretability and efficiency. In this study, feature selection is implemented using Scikit-learn, a widely used Python-based machine learning library. By isolating critical attributes, this process enables the construction of more focused and computationally efficient models.

 

Table 1. Different features of the dataset.

Feature Type No. Feature Name Data Type
Label 1. Category Categorical
Process Information 2. pslist.nproc Numerical
3. pslist.nppid Numerical
4. pslist.avg_threads Numerical
5. pslist.nprocs64bit Numerical
6. pslist.avg_handlers Numerical
DLL Information 7. dlllist.ndlls Numerical
8. dlllist.avg_dlls_per_proc Numerical
Handles Information 9. handles.nhandles Numerical
10. handles.avg_handles_per_proc Numerical
11. handles.nport Numerical
12. handles.nfile Numerical
13. handles.nevent Numerical
14. handles.ndesktop Numerical
15. handles.nkey Numerical
16. handles.nthread Numerical
17. handles.ndirectory Numerical
18. handles.nsemaphore Numerical
19. handles.ntimer Numerical
20. handles.nsection Numerical
21. handles.nmutant Numerical
Loader Modules Information 22 ldrmodules.not_in_load Numerical
23 ldrmodules.not_in_init Numerical
24 ldrmodules.not_in_mem Numerical
25 ldrmodules.not_in_load_avg Numerical
26 ldrmodules.not_in_init_avg Numerical
27 ldrmodules.not_in_mem_avg Numerical
Memory Analysis 28 malfind.ninjections Numerical
29 malfind.commitCharge Numerical
30 malfind.protection Numerical
31 malfind.uniqueInjections Numerical
Psxview Information 32 psxview.not_in_pslist Numerical
33 psxview.not_in_eprocess_pool Numerical
34 psxview.not_in_ethread_pool Numerical
35 psxview.not_in_pspcid_list Numerical
36 psxview.not_in_csrss_handles Numerical
37 psxview.not_in_session Numerical
38 psxview.not_in_deskthrd Numerical
39 psxview.not_in_pslist_false_avg Numerical
40 psxview.not_in_eprocess_pool_false_avg Numerical
41 psxview.not_in_ethread_pool_false_avg Numerical
42 psxview.not_in_pspcid_list_false_avg Numerical
43 psxview.not_in_csrss_handles_false_avg Numerical
44 psxview.not_in_session_false_avg Numerical
45 psxview.not_in_deskthrd_false_avg Numerical

(Continued )

 

Table 1. (Continued)

 

Feature Type No. Feature Name Data Type
Module and Service Information 46 modules.nmodules Numerical
47 svcscan.nservices Numerical
48 svcscan.kernel_drivers Numerical
49 svcscan.fs_drivers Numerical
50 svcscan.process_services Numerical
51 svcscan.shared_process_services Numerical
52 svcscan.interactive_process_services Numerical
53 svcscan.nactive Numerical
Callback Information 54 callbacks.ncallbacks Numerical
55 callbacks.nanonymous Numerical
56 callbacks.ngeneric Numerical
Label 57 Class Categorical

 

Modeling

In the modeling phase, anomaly detection is initially performed using an autoencoder (AE) [19]. The AE consists of two primary components: an encoder and a decoder. The encoder compresses input data into a lower-dimensional latent representation, while the decoder reconstructs the original input from this compressed form. Though originally an unsupervised learning model, the AE is modified in this study into a semi-supervised learning framework to facilitate both feature extraction and anomaly detection. The model is trained solely on normal samples, enabling it to learn patterns of legitimate behavior in the dataset. By learning the characteristics of benign data, the autoencoder develops a baseline understanding of normal behavior, which allows it to identify anomalies through reconstruction error. Samples that produce reconstruction errors beyond a set threshold are considered abnormal and filtered accordingly. Following this filtering step, two supervised machine learning algorithms—Random Forest [20] and XGBoost [21]—are employed to classify the detected attacks. Random Forest is an ensemble learning method that builds multiple decision trees and aggregates their outputs to improve predictive accuracy and reduce the likelihood of overfitting. It has proven effective in a wide range of classification and regression tasks and excels in handling high-dimensional data [4]. Each decision tree is trained on a random subset of the dataset and feature space, which enhances the overall model robustness and stability. XGBoost, short for Extreme Gradient Boosting, is another ensemble learning technique that combines weak learners (typically decision trees) in a sequential manner to minimize errors [22]. By focusing on instances misclassified by previous models, XGBoost iteratively improves performance and generalizes well to unseen data. It is particularly known for its efficiency and scalability in large-scale machine learning applications.

To further refine model performance, hyperparameter optimization is performed. This involves adjusting parameters such as learning rate, tree depth, and number of estimators, among others, to maximize the models’ accuracy and generalization capability in detecting both known and zero-day attacks.

Autoencoder

An autoencoder is composed of an encoder and a decoder, with the objective of learning the compression features of the data input. The autoencoder formulation can be divided into two main parts: the encoding function (encoder) and the decoding function (decoder). Fig 2 shows the processing flow of the autoencoder.

 

 

Fig 2. Schematic representation of the autoencoder.

 

Encoder Function: The encoding function takes the input data and maps it to a lower- dimensional representation. The formula for the encoder can be represented as follows:

h = f (x) (1)

where:

x: Input Data.

h: Encoded representation.

f(): Encoding function, such as a dense layer. The rectified linear unit (ReLU) and sigmoid are used in this research.

 

Decoder Function: The decoding function reconstructs the original input data from the encoded representation. The formula for the decoder can be represented as follows:

x^ = g(h)           (2)

where:

x^: Reconstructed data.

h: Encoded representation.

g(): Decoding function, also implemented as a neural network layer, is the mirror image of the encoding layer.

Objective Function (Loss Function): The training of an autoencoder involves minimizing a loss function, which measures the difference between the input data and the reconstructed data. Common loss functions include mean squared error (MSE) for continuous data or binary cross-entropy for binary data. The objective function can be represented as:

L(x; x^) (3)

where:

L: Loss function.

x: Input data.

x^: Reconstructed data.

Overall Autoencoder Objective: The overall objective of training an autoencoder is to mini- mize the reconstruction error, i.e., the difference between the input data and the reconstructed data. This is achieved by adjusting the weights and biases of the neural network during the training process.

min  1 SN          L(x ; g(f (x )))    (4)

y N  l_=1          i           i

where:

θ: Parameters (weights and biases) of the autoencoder neural network.

N: Number of training samples.

xl_: Individual training samples.

 

Random forest

Random forest is an ensemble learning technique that constructs multiple decision trees dur- ing training and outputs the mode of the classes (classification) or the mean prediction (regres- sion) of the individual trees. The formula for Random Forest can be described in terms of decision trees and ensemble averaging. Fig 3 shows the processing flow of Random Forest.

 

Fig 3. Schematic representation of a random forest.

 

Decision Tree: Random Forest is built upon decision trees; below is the basic formula for a decision tree.

^y i = T(xi)        (5)

where:

y^i: Predicted output for the i-th observation. xi: Input features for the i-th observation.

T(): Decision tree model that maps input features to the predicted output.

Ensemble Averaging: Random Forest is a machine learning algorithm that combines the predictions from multiple decision trees to make a more robust and accurate prediction. The prediction ensemble is usually obtained by taking a majority vote (for classification) or averaging (for regression) of all individual tree predictions. This study is used for classifica- tion.

where:

RF(x ) =   1       SNtrees = 1NtreesTj(xi)    (6)

i           Ntrees  j_=1

 

RF(xi): Random Forest prediction for the i-th observation. Ntrees: The total number of trees in the Random Forest. Tj(): Prediction from the j-th decision tree in the ensemble.

XGBoost

XGBoost (Extreme Gradient Boosting) [23] is a scalable and efficient implementation of the gradient boosting framework. The general formula for XGBoost can be described as an addi- tive model, where each term corresponds to a weak learner, typically a decision tree, added to the ensemble. And Fig 4 shows the processing flow of XGBoost.

The objective is to minimize a regularized objective function that combines a loss term and a regularization term.

The formula for the XGBoost prediction.

 

k=1

(t)        T (t)

where:

y^I = S fk(xi)    (7)

 

i

y^ (t): Predicted output for the i-th observation at iteration t. T: The total number of weak learners (trees) in the ensemble.

i

fk(x (t)): Prediction from the k-th weak learner at iteration t for the i-th observation.

 

Random Forest-AE and XGBoost-AE

Step 1: Autoencoder Anomaly Detection Classification

The autoencoder reconstructs input data and classifies it as normal or abnormal based on the reconstructed error. In the training phase, it is trained on normal data, learning the charac- teristics of normal data and constructing a function that works only on normal data. The

Fig 4. Schematic representation of XGBoost.

training process continuously carries out the reconstruction process, and the model’s weights are updated according to the reconstruction error to ensure the correct reconstruction of nor- mal data. In practice, a threshold is set to determine the reconstruction error and classify the input data as normal or abnormal, and the label will be assigned to either normal or abnormal.

ε(X) = L(x; x^)  (8)

Step 2: Secondary Training: Combining Reconstruction Errors and Labeling

Step 1 explains how the autoencoder worked in this study. By extending the Step 1 method, the normal and abnormal data, which have been fully classified and labeled, are merged into one dataset. This merged dataset is split into a training set and a test set and combined with the reconstruction error as input to the Random Forest and XGBoost classifiers. The classifiers are trained on the training set and evaluated on the test set. The effectiveness of the models will be tested using an unseen dataset generated from the CIC-MalMem-2022 dataset after completing all training.

Evaluation

Several common performance metrics, such as the confusion matrix, accuracy, precision, recall, F1-score, and Matthews Correlation Coefficient (MCC), are used to evaluate the mod- el’s performance. These metrics are described below. Confusion Matrix (see Table 2):

True Positive (TP): the number of positive samples correctly identified. False Positive (FP): the number of false negative samples.

True Negative (TN): the number of negative samples that are correctly identified. False Negative (FN): the number of positive samples that are underreported.

Fig 5 shows a schematic representation of the reconstructed error. Accuracy: percentage of total samples with correct predictions.

TP + TN          

Accuracy= TP + FN + TN + FP (9)

Precision: proportion of correct predictions that are positive as a percentage of all positive predictions.

Precision=  TP  TP + FP (10)

Recall: Higher values of recall indicate better performance and a greater probability that anomalous samples will be judged to be anomalous.

  TP     

Recall=TP + FN

F1-Score: the ratio of the mean to the geometric mean; higher means better.

Recall × Precision

F1—Score = 2× Recall + Precision

Table 2. Confusion matrix.

Actual Positive Negative
Positive TP FN
Negative FP TN

 

 

Fig 5. Schematic representation of a reconstructed error.

Matthews Correlation Coefficient (MCC): A balanced measure for binary classifications, taking into account all four quadrants of the confusion matrix.

            (TP × TN) — (FP × FN)           

MCC = pffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffi ffiffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi           (13) (TP × FP)(TP × FN)(TN × FP)(TN × FN)

To achieve the primary objective of this study—accurate detection of unseen data—the model must demonstrate high classification performance on unfamiliar input. To this end, 20% of the CIC-MalMem-2022 dataset is designated as the test set, forming a subset specifically reserved for evaluation. In addition to this, a separate dataset is generated by sampling from the original CIC-MalMem-2022 dataset to simulate unseen data, thereby providing a reliable basis for validating the model’s generalization capability. This synthetic unseen dataset retains the same feature set and dimensional structure as the training data, ensuring consistency and preventing the model from misclassifying due to discrepancies in feature representation.

By sampling from the same dataset, the data preprocessing phase is simplified, maintaining uniformity across all datasets used. A comprehensive evaluation was conducted by testing various train-test split ratios for the proposed RandomForest-AE and XGBoost-AE models. The experimental outcomes, summarized in Table 3, highlight performance across several key metrics and indicate the models’ robustness and generalizability. Among the configurations tested, the 80/20 train-test split yielded the most favorable results. Consequently, this partitioning strategy is adopted throughout the study.

 

Table 3. Comparison of different train-test splits.

Metric XGBoost-AE (80/20) RandomForest-AE (80/20) XGBoost-AE (70/30) RandomForest-AE (70/30) XGBoost-AE (60/40) RandomForest-AE (60/40)
Accuracy 99.96% 99.99% 99.94% 99.96% 99.92% 99.94%
Precision 99.98% 100% 99.95% 99.97% 99.93% 99.96%
Recall 99.96% 99.98% 99.93% 99.95% 99.91% 99.94%
F1-Score 99.97% 99.99% 99.94% 99.96% 99.92% 99.95%
MCC 99.96% 99.99% 99.92% 99.94% 99.90% 99.93%

 

 

Methodology for ensuring truly novel attacks in unseen data

To evaluate the model’s effectiveness in identifying zero-day attacks, a rigorous strategy for handling unseen data was implemented. First, a separate test set was created by withholding a portion of the CIC-MalMem-2022 dataset from the training process. This ensures that the model is evaluated on genuinely unseen data, providing an accurate measure of its generalization capabilities. Second, to simulate novel attack scenarios, a dedicated subset of the dataset was excluded during training. This subset encompasses diverse and complex attack patterns, thereby mimicking real-world conditions where entirely new threats may emerge. Third, the anomaly detection mechanism—based on an autoencoder—is central to identifying these novel attacks. By learning the underlying structure of normal traffic, the autoencoder effectively detects deviations that signal abnormal behavior. This enables the model to distinguish between truly novel attacks and slight variations of known threats, enhancing its utility in dynamic threat environments.

Results and discussion

This paper evaluated four machine learning detection models: Random Forest, XGBoost, Ran- dom Forest-AE, and XGBoost-AE. The key focus was on assessing the effectiveness of the autoencoder-enhanced models (Random Forest-AE and XGBoost-AE) in detecting previously unseen data in intrusion detection. The results obtained from these models are compared to highlight their performance.

XGBoost-AE and Random Forest-AE performance

The advanced models, XGBoost-AE and Random Forest-AE, incorporate an autoencoder for anomaly detection prior to applying the traditional XGBoost and Random Forest algorithms. This additional step enhances the models’ capability to identify previously unseen data. Figure 7 displays the performance metrics of XGBoost-AE on the training dataset, demonstrating outstanding accuracy, precision, recall, F1 scores, and MCC values—ranging from 0.9998 to 1.

The model’s performance on the test dataset is illustrated in Figure 8, showing an accuracy of 0.999677, precision of 0.999803, recall of 0.999607, F1 score of 0.999705, and an MCC of 0.999607.

Additionally, the confusion matrix for the test set provides further insights: 4195 true positives were correctly classified, with only 1 false positive. There were 2 false negatives and 5092 true negatives. Figure 9 presents the XGBoost-AE model’s performance on the unseen dataset.

These results highlight the ability of the XGBoost-AE model to maintain high performance even when faced with unfamiliar data, underlining its robustness and reliability.

Figures 10 and 11 depict the performance metrics of the Random Forest-AE model on the training and test datasets, respectively. These metrics—including accuracy, precision, recall, F1 score, and MCC—achieved values between 0.9998 and 1, similar to the performance of XGBoost-AE, indicating strong model stability during training.

Figure 12 shows the Random Forest-AE model’s results on the unseen dataset. The model continues to perform at a high level, achieving near-perfect results across all metrics, confirming its effectiveness in handling previously unseen data.

 

Fig 7. XGBoost-AE training set results.

 

Fig 8. XGBoost-AE test set results.

 

Comparative analysis with traditional models

Table 4 and Figure 13 summarize the performance of all four models evaluated on 20% of the CIC-MalMem-2022 dataset.

Among these, the XGBoost and Random Forest-AE models achieved perfect scores for accuracy, precision, recall, F1 score, and MCC. However, such performance may indicate possible overfitting. The XGBoost-AE and traditional Random Forest models also performed exceptionally well, with near-perfect metrics.

Despite the slight underperformance of XGBoost-AE compared to the other models, all four demonstrated strong results on the test dataset overall.

 

 

 

 

 

 

 

 

Fig 9. XGBoost-AE unseen dataset results.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fig 10. Random Forest-AE training set results.

 

 

 

 

 

 

 

 

 

 

 

 

Fig 11. Random Forest-AE test set results.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fig 12. Random Forest-AE unseen dataset results.

Discussion

This study introduces two novel models and rigorously compares their results with those reported in prior works by Mezina and Burget [3], Smith et al. [5], and Dener and Orman [6], all of which also utilized the CIC-MalMem-2022 dataset. While the methodologies differ, the shared dataset enables valid comparative analysis.

Table 5 and Figure 14 illustrate the performance of the top models from the referenced studies alongside the newly proposed models in this study.

Despite all models achieving nearly perfect metrics, the Random Forest-AE model from the present study consistently outperformed the referenced models. This enhanced performance is likely due to the integration of an anomaly detection mechanism, which strengthens the model’s ability to detect novel attacks.

Previous studies suggest that the CIC-MalMem-2022 dataset shows a strong fit with feature variables, even after extensive preprocessing. One possible explanation is the lack of sufficient sampling diversity, which may have led to an unbalanced feature distribution. Alternatively, the dataset’s design might include highly effective features, enabling machine learning models to attain nearly flawless performance. These findings underscore the need for future studies focused on evaluating and possibly improving dataset quality and diversity. The aggregated evaluation results in Figure 15 and Table 6 focus on how the four models perform on an unseen dataset.

Traditional Random Forest and XGBoost models, when used without the autoencoder component [25], showed notable declines in all performance metrics when applied to the unseen dataset.

This drop in accuracy, precision, recall, F1 score, and MCC highlights the limited ability of these models to handle previously unencountered data.

In contrast, the autoencoder-enhanced models—Random Forest-AE and XGBoost-AE—exhibited strong consistency across all metrics, with only minimal degradation in performance. This demonstrates the value of the autoencoder’s feature-learning capability, which enhances the models’ adaptability and resilience when dealing with unfamiliar data.

 

The integration of autoencoders into traditional machine learning models significantly enhances their ability to detect previously unseen threats. By learning the typical patterns of normal network traffic, the autoencoder can effectively identify deviations that signal anomalous behavior. This added capability plays a vital role in improving the models’ detection of novel attacks, a necessity in real-world cybersecurity contexts where attack strategies evolve rapidly.

Another notable observation is the consistent performance demonstrated by the autoencoder-enhanced models—XGBoost-AE and Random Forest-AE—across the training, testing, and unseen datasets. Such consistency indicates a high level of generalization, an essential attribute for deploying intrusion detection models in dynamic, real-world environments. This robust performance across diverse data segments suggests that these models are capable of adapting to new attack patterns effectively.

However, the near-perfect scores achieved by all models—especially those based on the CIC-MalMem-2022 dataset—raise concerns about potential overfitting. The dataset’s high alignment with model variables may point to a lack of diversity in the dataset, which could limit its applicability in real-world scenarios. This issue underscores the importance of using more representative datasets or applying additional techniques to simulate real-world variability in future research.

Despite these concerns, the Random Forest-AE model clearly outperformed those proposed in related studies by Balasubramanian et al., Khan, Mezina et al., Smith et al., and Dener et al. This performance differential highlights the advantage of integrating anomaly detection mechanisms into conventional machine learning models. The added layer of anomaly detection provides these enhanced models with a strategic edge, improving their capability to detect and respond to unknown threats with high accuracy and minimal false positives.

While the proposed models demonstrate impressive performance, deploying them in real-world settings brings several challenges. One primary concern is computational cost. Training and implementing models such as Random Forest and XGBoost, especially when combined with autoencoders, require substantial computational resources. This is particularly evident when working with large-scale datasets like CIC-MalMem-2022, where real-time detection demands significant processing power. Moreover, the scalability of these models must be ensured to accommodate growing network traffic over time. To mitigate these issues, distributed computing frameworks like Apache Spark can be employed to manage and process data efficiently. Furthermore, leveraging hardware accelerators such as GPUs or TPUs can greatly reduce the time required for both training and inference in deep learning models.

Parameter sensitivity is another important consideration. Machine learning models are highly dependent on well-tuned hyperparameters. Factors such as the number of trees in Random Forest, the learning rate in XGBoost, and the architecture of the autoencoder must be precisely configured to optimize performance. At the same time, appropriate regularization is necessary to prevent overfitting. Over-regularization, however, can lead to underfitting and poor generalization. To address this challenge, tools like grid search, random search, and Bayesian optimization can be used to automate the hyperparameter tuning process. Additionally, implementing cross-validation helps assess model performance across different parameter configurations, improving stability and reliability.

Calibration also plays a crucial role in the practical deployment of intrusion detection systems. Accurate model calibration ensures that the predicted probabilities reflect the true likelihood of an event occurring. Poor calibration can result in overconfident predictions, which is especially dangerous in cybersecurity, where false positives or negatives may have significant repercussions. Providing confidence intervals for predictions adds a layer of interpretability and supports informed decision-making in high-stakes environments. Techniques such as Platt scaling and isotonic regression can be employed to improve probability calibration. Moreover, ensemble techniques like stacking can combine predictions from multiple models to enhance overall calibration and model robustness.

Effective deployment of these models requires thoughtful implementation strategies. One such strategy is incremental learning, which allows models to update themselves continuously as new data becomes available, without undergoing complete retraining. This capability is crucial for adapting to evolving cyber threats in real time and can significantly reduce the computational burden typically associated with periodic full retraining.

Ongoing monitoring and maintenance of deployed models are also essential to ensure sustained effectiveness. Continuous performance tracking helps identify degradation early and prompts timely interventions, such as model retraining using newly collected data. Scheduled maintenance and updates ensure that the models stay relevant and responsive to emerging attack vectors, maintaining a high standard of cybersecurity defense.

Lastly, successful implementation also hinges on the seamless integration of these models into existing security infrastructures. The proposed models must be compatible with current protocols and data formats to ensure interoperability with established systems. Ensuring this compatibility enables organizations to adopt the models with minimal disruption and facilitates efficient operation within the broader cybersecurity ecosystem

Table 4. Model evaluation on a 20% test set.

Model Accuracy Precision Recall F1 score MCC
Random Forest -AE 1 1 1 1 1
XGBoost—AE 0.99966 0.99980 0.99960 0.99970 0.9997
Random Forest 0.9997 0.9997 0.9997 0.9997 0.9997
XGBoost 1 1 1 1 1

 

 

 

Fig 13. Model evaluation on a 20% test set.

 

Table 5. Results compared with previous research.

Selected Related Work Model Accuracy Precision Recall F1 Score MCC
Balasubramanian et al. [18] Random Forest 0.99990 1.00000 0.99980 0.99990 0.99980
Khan [24] Artificial Neural Network 0.99720 0.99000 0.99000 1.00000 0.98260
Mezina and Burget [3] Random Forest 0.99992 0.99992 0.99992 0.99992 0.99992
Smith et al. [5] Decision Tree 0.99990 1.00000 1.00000 1.00000 1.00000
Dener and Orman [6] Logistic Regression 0.99970 0.99980 0.99970 0.99970 0.99970
Propose by this research Random Forest-AE 1.00000 1.00000 1.00000 1.00000 1.00000
Propose by this research XGBoost-AE 0.99966 0.99980 0.99960 0.99970 0.99970

 

Fig 14. Previous research model evaluation on a 20% test set.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fig 15. Model evaluation on an unseen data test set.

 

 

 

 

Future work

To further validate the proposed models and address the limitations identified in this study, future research will involve evaluating the models on additional public datasets and imple- menting cross-dataset validation to provide insights into their generalizability across different types of cyber threats and data distributions. Additionally, exploring real-world deployment challenges such as computational costs, parameter sensitivity, and calibration will be crucial for ensuring the practical applicability and robustness of the models.

 

Conclusion

This study presents a novel approach to detecting zero-day attacks by integrating autoencoders with traditional machine learning algorithms, specifically Random Forest and XGBoost. The proposed models, Random Forest-AE and XGBoost-AE, leverage the strengths of both anom- aly detection and supervised learning to effectively identify previously unseen cyber threats. Using the CIC-MalMem-2022 dataset, the models were evaluated based on several metrics, demonstrating superior performance compared to existing methods.

The Random Forest-AE model achieved an accuracy of 99.9892%, precision of 100%, recall of 99.9803%, F1 score of 99.9901%, and MCC of 99.8313%. Similarly, the XGBoost-AE model achieved an accuracy of 99.9741%, precision of 100%, recall of 99.9533%, F1 score of 99.9976%, and MCC of 99.8002%. These results underscore the models’ robustness and ability to generalize to unseen data, outperforming baseline models that do not incorporate anomaly detection techniques.

 

 

Future Work

To enhance the validity of the proposed models and address the limitations identified in this study, future investigations will focus on evaluating the models using a broader range of publicly available datasets. Conducting cross-dataset validation will be instrumental in assessing the models’ generalizability to various cyber threat landscapes and diverse data distributions. Furthermore, continued research will explore the practical challenges associated with real-world deployment—specifically those related to computational demands, sensitivity to hyperparameters, and the need for precise model calibration. Addressing these aspects will be vital for ensuring that the models are both scalable and reliable in operational cybersecurity environments.

 

Conclusion

This research introduces an innovative methodology for detecting zero-day attacks by integrating autoencoders with established machine learning classifiers, namely Random Forest and XGBoost. The hybrid models—Random Forest-AE and XGBoost-AE—combine the anomaly detection capabilities of autoencoders with the classification strength of supervised learning, enabling the effective identification of novel cyber threats. Through comprehensive evaluation using the CIC-MalMem-2022 dataset, the proposed models demonstrated outstanding performance across key metrics, surpassing several state-of-the-art approaches.

 

 

Specifically, the Random Forest-AE model attained an accuracy of 99.9892%, a precision of 100%, a recall of 99.9803%, an F1 score of 99.9901%, and a Matthews correlation coefficient (MCC) of 99.8313%. Similarly, the XGBoost-AE model achieved an accuracy of 99.9741%, a precision of 100%, a recall of 99.9533%, an F1 score of 99.9976%, and an MCC of 99.8002%. These results underscore the models’ effectiveness in generalizing to previously unseen threats and validate the advantage of incorporating anomaly detection mechanisms into traditional classification models.

 

References

  1. Shetty S, Musa M, Bre´dart X. Bankruptcy Prediction Using Machine Learning Techniques. Journal of Risk and Financial Management. 2022; 15: 35. https://doi.org/10.3390/jrfm15010035
  2. de Araujo-Filho PF, Naili M, Kaddoum G, Fapi ET, Zhu Z. Unsupervised GAN-Based Intrusion Detec- tion System Using Temporal Convolutional Networks and Self-Attention. IEEE Transactions on Network and Service Management. 2023; 20: 4951–4963. https://doi.org/10.1109/TNSM.2023.3260039
  3. Mezina A, Burget R. Obfuscated malware detection using dilated convolutional network. 2022 14th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT). IEEE; 2022. pp. 110–115. https://doi.org/10.1109/ICUMT57764.2022.9943443
  4. Farzamnia A, Hlaing NW, Haldar MK, Rahebi J. Channel estimation for sparse channel OFDM systems using least square and minimum mean square error techniques. 2017 International Conference on Engineering and Technology (ICET). IEEE; 2017. pp. 1–5. https://doi.org/10.1109/ICEngTechnol.2017. 8308193
  5. Smith D, Khorsandroo S, Roy K. Supervised and Unsupervised Learning Techniques Utilizing Malware Datasets. 2023 IEEE 2nd International Conference on AI in Cybersecurity (ICAIC). IEEE; 2023. pp. 1–
  6. https://doi.org/10.1109/ICAIC57335.2023.10044169
  7. Dener M, Ok G, Orman A. Malware Detection Using Memory Analysis Data in Big Data Environment. Applied Sciences. 2022; 12: 8604. https://doi.org/10.3390/app12178604
  8. Choubisa M, Doshi R, Khatri N, Kant Hiran K. A Simple and Robust Approach of Random Forest for Intrusion Detection System in Cyber Security. 2022 International Conference on IoT and Blockchain Technology (ICIBT). IEEE; 2022. pp. 1–5. https://doi.org/10.1109/ICIBT52874.2022.9807766
  9. Soltani M, Ousat B, Jafari Siavoshani M, Jahangir AH. An adaptable deep learning-based intrusion detection system to zero-day attacks. Journal of Information Security and Applications. 2023; 76: 103516. https://doi.org/10.1016/j.jisa.2023.103516
  10. Mbona I, Eloff JHP. Detecting Zero-Day Intrusion Attacks Using Semi-Supervised Machine Learning Approaches. IEEE Access. 2022; 10: 69822–69838. https://doi.org/10.1109/ACCESS.2022.3187116
  11. Prusty BR, Bingi K, Gupta N. Review of Gaussian Mixture Model-Based Probabilistic Load Flow Calcu- lations. 2022 International Conference on Intelligent Controller and Computing for Smart Power (ICICCSP). IEEE; 2022. pp. 01–05. https://doi.org/10.1109/ICICCSP53532.2022.9862332
  12. Liu W-T, Xing H-J. Rotation Based Ensemble of One-Class Support Vector Machines. 2018 Interna- tional Conference on Machine Learning and Cybernetics (ICMLC). IEEE; 2018. pp. 178–183. https:// org/10.1109/ICMLC.2018.8526992
  13. Nazarudeen F, Sundar S. Efficient DDoS Attack Detection using Machine Learning Techniques. 2022 IEEE International Power and Renewable Energy Conference (IPRECON). IEEE; 2022. pp. 1–6. https://doi.org/10.1109/IPRECON55716.2022.10059561
  14. Yusof MHM, Almohammedi AA, Shepelev V, Ahmed O. Visualizing Realistic Benchmarked IDS Data- set: CIRA-CIC-DoHBrw-2020. IEEE Access. 2022; 10: 94624–94642. https://doi.org/10.1109/ 2022.3204690
  15. Kanimozhi V, Jacob TP. Artificial Intelligence based Network Intrusion Detection with Hyper-Parameter Optimization Tuning on the Realistic Cyber Dataset CSE-CIC-IDS2018 using Cloud Computing. 2019 International Conference on Communication and Signal Processing (ICCSP). IEEE; 2019. pp. 0033– 0036. https://doi.org/10.1109/ICCSP.2019.8698029
  16. Zaman M, Eini R, Zohrabi N, Abdelwahed S. A Decision Support System for Cyber Physical Systems under Disruptive Events: Smart Building Application. 2022 IEEE International Smart Cities Conference (ISC2). IEEE; 2022. pp. 1–7. https://doi.org/10.1109/ISC255366.2022.9922493
  17. Abhishek L. Optical Character Recognition using Ensemble of SVM, MLP and Extra Trees Classifier. 2020 International Conference for Emerging Technology (INCET). IEEE; 2020. pp. 1–4. https://doi.org/ 1109/INCET49848.2020.9154050
  18. Zhang D, Wang J, Zhao X, Wang X. A Bayesian Hierarchical Model for Comparing Average F1 Scores. 2015 IEEE International Conference on Data Mining. IEEE; 2015. pp. 589–598. https://doi.org/10.1109/ 2015.44
  19. Balasubramanian KM, Vasudevan SV, Thangavel SK, T GK, Srinivasan K, Tibrewal A, et al. Obfus- cated Malware detection using Machine Learning models. 2023 14th International Conference on Com- puting Communication and Networking Technologies (ICCCNT). IEEE; 2023. pp. 1–8. https://doi.org/ 1109/ICCCNT56998.2023.10307598
  20. Mansouri N, Lachiri Z. Laughter synthesis: A comparison between Variational autoencoder and Autoen- coder. 2020 5th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP). IEEE; 2020. pp. 1–6. https://doi.org/10.1109/ATSIP49331.2020.9231607
  21. Li S. Application of Random Forest Algorithm in New Media Network Operation Data Push. 2023 IEEE 15th International Conference on Computational Intelligence and Communication Networks (CICN). IEEE; 2023. pp. 87–92. https://doi.org/10.1109/CICN59264.2023.10402335
  22. Zhou Y, Song X, Zhou M. Supply Chain Fraud Prediction Based On XGBoost Method. 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE). IEEE; 2021. pp. 539–542. https://doi.org/10.1109/ICBAIE52039.2021.9389949
  23. Wang X, Zhou Q. LTE Network Quality Analysis Method Based on MR Data and XGBoost Algorithm. 2020 5th IEEE International Conference on Big Data Analytics (ICBDA). IEEE; 2020. pp. 85–89. https:// org/10.1109/ICBDA49040.2020.9101302
  24. Chen T, Guestrin C. XGBoost. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM; 2016. pp. 785–794. https://doi.org/ 1145/2939672.2939785
  25. Khan LP. Obfuscated Malware Detection Using Artificial Neural Network (ANN). 2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT). IEEE; 2023. pp. 1–
  1. https://doi.org/10.1109/ICECCT56650.2023.10179639
    1. Kumari M, Baghel A. Analysis of Variance, Eigen, and Energy (ANOVEE) based Sensing Method for Cognitive Radio Network. 2020 Second International Conference on Inventive Research in ComputingApplications (ICIRCA). IEEE; 2020. pp. 1145–1151. https://doi.org/10.1109/ICIRCA48905.2020.91830