Detailed Analyses and Efficient Identification of Malware Evidence in CLaMP Dataset based on Machine Learning Approaches
DOI:
https://doi.org/10.24203/8ft1r172Keywords:
Malware Identification, Machine Learning Algorithms, Feature Selection, Windows PE headersAbstract
Malware is a malicious software that is used to launch attacks of different types in computer networks and cyber space. Several signature and machine learning-based approaches have been used for the identification of malware types in the past. However,signature-based detection approaches have been reported to have serious limitations which gave room for machine learning-based malware identification techniques to be more popular. Despite the promises of the ML methods in the identification of malware evidence, some of the ML approaches in literature have poor detection rates which can be as a result of the size and nature of the patterns in the datasets used. This study used a dataset named CLaMP for the training and testing of the malware classification models. Firstly, comprehensive exploratory analyses of the dataset were carried out with a view to understanding the data distributions in it better and make informative decisions on how to pre-process and apply it for malware identification. During the experimentations, two scenarios were established before feeding the data into the learning algorithms. Scenario 1 involves building malware identification model without data cleaning and feature selection while scenario 2 involves the cleaning of the data and selection of promising features for building the models.In scenario 2, Recursive Feature Elimination (RFE) technique was used for selecting the promising attributes which were used to build the two malware classification models. Naive Bayes (NB) and Logistic Regression (LR) algorithms were used for building the models. The hyper parameters of the two selected algorithms were varied and the models tested and validated severally before optimal performances were arrived at. The results of the models were compared based on the selected metrics, namely: accuracy, precision, recall, f1-score and Area Under the Curve (AUC). Experimental results showed that in the scenario 1, where the dataset was not pre-processed and all the attributes were used for the model building, poor results were obtained by both models in all metrics except in recall. However, NB-based malware identification model slightly performed better than LR in all the metrics. It was also discovered that both NB and LR-based malware identification models performed well in scenario 2 when the dataset was pre-processed and promising features were selected using RFE. This study concluded that the detailed exploratory analyses, data cleaning and feature subset selection methods helped in achieving promising results from the malware identification models.
References
[1] O. M. Ayinla, A. M. Oyelakin, and J. O. Olomu, "A Comprehensive Review On Machine Learning Techniques For The Identification of Ransomware Attacks In Networks," LAUTECH Journal of Computing and Informatics (LAUJCI), vol. 4, no. 1, Mar. 2024.
[2] R. Sharp, "An Introduction to Malware," 2017. [Online]. Available:https://backend.orbit.dtu.dk/ws/portalfiles/portal/139067614/malware.pdf. [Accessed: May 23, 2024].
[3] N. Kumar, S. Mukhopadhay, M. Gupta, A. Handa, S. K. Shukla (2019). "Malware Classification using Early Stage Behavioural Analysis", Conference: 2019 14th Asia Joint Conference on Information Security (AsiaJCIS), DOI: 10.1109/AsiaJCIS.2019.00-10
[4] M. A. H. Saeed, "Malware in Computer Systems: Problems and Solutions," IJID International Journal on Informatics for Development, vol. 9, no. 1, pp. 1-8, 2020. doi: 10.14421/ijid.2020.09101.
[5] Microsoft Windows Security, "Understanding malware & other threats - Windows security," Microsoft Docs, 2019. [Online]. Available: https://learn.microsoft.com/sr-Latn-RS/defender-endpoint/malware/understanding-malware
[6] Avast, "What is Malware & How Does it Work? Malware Definition," Avast, 2019.
[7] Quick Heal R & D Lab, "Introduction to Malware and Malware Analysis," pp. 1-8, 2014. [Online].Available:http://dlupdate.quickheal.com/documents/technical_papers/introduction_to_malware_and_malware_analysis.pdf. [Accessed: May 23, 2024].
[8] [8] D. Ucci, L. Aniello, and R. Baldoni, "Survey of Machine Learning Techniques for Malware Analysis," Computers & Security, vol. 81, pp. 123–147, 2019. doi: 10.1016/j.cose.2018.11.007.
[9] [9] A. M. Oyelakin and R. G. Jimoh, "A Review on the Identification Techniques for Detection-Evasive Botnet Malware," in Proceedings of Nigeria Computer Society (NCS), July 2019 International Conference of NCS, Gombe, Gombe State, Nigeria, 2019.
[10] [10] A. Bulazel and B. Yener, "Proceedings of the 1st Reversing and Offensive-oriented Trends Symposium on - ROOTS," pp. 1-21, 2017.
[11] Kaspersky Labs, "What is malware and how to defend against it?" 2017. [Online]. Available: http://usa.kaspersky.com/internet-security center/internet-safety/what-is-malware-and-how-to-protect-against it#.WJZS9xt942x.
[12] Trend Micro, "Portable executable (PE)," 2022. [Online]. Available: www.trendmicro.com.
[13] B. Bokolo, R. Jinad, and Q. Liu, "A Comparison Study to Detect Malware using Deep Learning and Machine Learning Techniques," in 2023 IEEE 6th International Conference on Big Data and Artificial Intelligence (BDAI), pp. 1-6, 2023. doi: 0.1109/BDAI59165.2023.10256957.
[14] S. Kumar, S. Singh, S. Kumar, and K. Verma, "Malware Classification Using Machine Learning Models," in International Conference on Machine Learning and Data Engineering (ICMLDE 2023), Procedia Computer Science, vol. 235, pp. 1419–1428, 2024.
[15] A. M. Oyelakin and R. G. Jimoh, "Tree-Based Learning Models for Botnet Malware Classification in Real Life Sub-Sample Dataset," Innovative Computing Review, vol. 3, no. 2, pp. 1-13, Dec. 2023.
[16] A. M. Oyelakin, M. B. Akanbi, T. S. Ogundele, A. O. Akanni, M. D. Gbolagade, M. D. Rilwan, and M. A. Jibrin, "A Machine Learning Approach for the Identification of Network Intrusions Based on Ensemble XGBOOST Classifier,"Indonesian Journal of Data and Science. [Online]. Available: https://jurnal.yoctobrain.org/index.php/ijodas/article/view/88/167.
[17] C. Connors and D. Sarkar, "Machine Learning for Detecting Malware in PE Files," arXiv, Dec. 2022. [Online]. Available: arxiv:2212.13988v1 [cs.CR].
[18] M. R. Zaharin and S. M. Shariff, "Malware Classification based on System Call," in Advances in Visual Informatics, 7th International Visual Informatics Conference (IVIC 2021), Kajang, Malaysia, Nov. 23-25, 2021, pp. 387-398, ACM Digital Library. doi: 10.1007/978-3-030-90562-5_34.
[19] S. Dilhara, "Classification of Malware using Machine Learning and Deep Learning Techniques," International Journal of Computer Applications, vol. 183, no. 32, pp. 12-17, 2021. doi: 10.5120/ijca2021921708.
[20] W. Handaya, M. N. Yusoff, and A. Jantan, "Machine learning approach for detection of fileless cryptocurrency mining malware," Journal of Physics: Conference Series, vol. 1450, no. 1, p. 012075, 2020. doi: 10.1088/1742-6596/1450/1/012075.
[21] A. Kumar, K. S. Kuppusamy, and G. Aghila, "A learning model to detect maliciousness of portable executable using integrated feature set," Journal of King Saud University - Computer and Information Sciences, vol. 31, no. 2, pp. 252-265, 2019. doi: 10.1016/j.jksuci.2017.01.003.
[22] N. Milosevic, A. Dehghantanha, and K.-K. R. Choo, "Machine learning aided Android malware classification," Computers & Electrical Engineering, vol. 61, pp. 266-274, 2017. doi: 10.1016/j.compeleceng.2017.02.013.
[23] A. Kumar, K. S. Kuppusamy, and A. Gnanasekaran, "A learning model to detect maliciousness of portable executable using integrated feature set," Journal of King Saud University - Computer and Information Sciences, vol. 31, no. 2, 2017. doi: 10.1016/j.jksuci.2017.01.003.
[24] A. Kumar, "ClaMP (Classification of Malware with PE headers)," Mendeley Data, V1, 2020. doi: 10.17632/xvyv59vwvz.1.
[25] L. Ladha and T. Deepa, "Feature Selection Methods and Algorithms," International Journal on Computer Science and Engineering (IJCSE), vol. 3, no. 5, pp. 1787–1797, 2011.
[26] A. M. Oyelakin and R. G. Jimoh, "A Survey of Feature Extraction and Feature Selection Techniques Used in Machine Learning-Based Botnet Detection Schemes," VAWKUM Transactions on Computer Sciences, vol. 9, pp. 1-7, 2021. [Online]. Available: https://vfast.org/journals/index.php/VTCS/article/view/604/658.
[27] Y. Lyu, Y. Feng, and K. Sakurai, "A Survey on Feature Selection Techniques Based on Filtering Methods for Cyber Attack Detection," Information, vol. 14, p. 191, 2023. doi: 10.3390/info14030191.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 M. O. Ayinla, A. M. Oyelakin, U. A. Adeniyi, K. O. Tajudeen, O. J. Olaleye

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The articles published in International Journal of Computer and Information Technology (IJCIT) is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.