Spam Detection in Emails Using Machine Learning Techniques: A Review
DOI:
https://doi.org/10.24203/gsdamd68Keywords:
Spam, Class Imbalance, SMOTE, Feature Selection, Ensemble LearningAbstract
Despite the vast amounts of data available within email communication systems, spam remains a persistent issue, posing challenges for both users and organizations. Analyzing this data holds the potential to develop more effective methods for detecting and mitigating spam emails. However, extracting actionable insights from this data and leveraging them to construct robust spam detection systems presents a significant challenge. Traditional approaches to combating spam, such as rule-based filtering and heuristic methods, have become increasingly inadequate due to the evolving tactics of spammers. Machine learning techniques offer a promising solution by enabling the training of predictive models using historical email data. However, the effectiveness of these models is influenced by factors such as class imbalance and the identification of relevant features essential for spam detection. This paper provides a comprehensive review of various machine learning techniques employed in spam detection within email communication systems. By examining the strengths and weaknesses of different approaches, we aim to identify strategies for improving the efficiency and accuracy of spam detection. Additionally, we propose a spam detection framework centered around ensemble learning models trained on balanced datasets using techniques like SMOTE, and featuring only the most relevant features. This approach is intended to enhance detection performance while reducing false positives, thereby offering a more effective solution to the challenge of spam detection in email systems.
References
[1] AbdulNabi, I., & Yaseen, Q. (2021). Spam Email Detection Using Deep Learning Techniques. Procedia Computer Science, 184, 853–858. https://doi.org/10.1016/j.procs.2021.03.107
[2] Ahmed, N., Amin, R., Aldabbas, H., Koundal, D., Alouffi, B., & Shah, T. (2022). Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges. Security and Communication Networks, 2022, e1862888. https://doi.org/10.1155/2022/1862888
[3] kinyelu, A. A. (2021). Advances in spam detection for email spam, web spam, social network spam, and review spam: ML-based and nature-inspired-based techniques. Journal of Computer Security, 29(5), 473–529. https://doi.org/10.3233/JCS-210022K. Elissa, “Title of paper if known,” unpublished.
[4] zeez, N. A., Misra, S., Margaret, I. A., Fernandez-Sanz, L., & Abdulhamid, S. M. (2021). Adopting automated whitelist approach for detecting phishing attacks. Computers & Security, 108, 102328. https://doi.org/10.1016/j.cose.2021.102328.
[5] Baecker, L., Garcia-Dias, R., Vieira, S., Scarpazza, C., & Mechelli, A. (2021). Machine learning for brain age prediction: Introduction to methods and clinical applications. eBioMedicine, 72, 103600. https://doi.org/10.1016/j.ebiom.2021.103600.
[6] AbdulNabi, I., & Yaseen, Q. (2021). Spam Email Detection Using Deep Learning Techniques. Procedia Computer Science, 184, 853–858. https://doi.org/10.1016/j.procs.2021.03.107
[7] Ahmed, N., Amin, R., Aldabbas, H., Koundal, D., Alouffi, B., & Shah, T. (2022). Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges. Security and Communication Networks, 2022, e1862888. https://doi.org/10.1155/2022/1862888
[8] Akinyelu, A. A. (2021). Advances in spam detection for email spam, web spam, social network spam, and review spam: ML-based and nature-inspired-based techniques. Journal of Computer Security, 29(5), 473–529. https://doi.org/10.3233/JCS-210022
[9] Alhogail, A., & Alsabih, A. (2021). Applying machine learning and natural language processing to detect phishing email. Computers & Security, 110, 102414. https://doi.org/10.1016/j.cose.2021.102414
[10] Azeez, N. A., Misra, S., Margaret, I. A., Fernandez-Sanz, L., & Abdulhamid,S. M. (2021). Adopting automated whitelist approach for detecting phishing attacks. Computers & Security, 108, 102328. https://doi.org/10.1016/j.cose.2021.102328
[11] Baecker, L., Garcia-Dias, R., Vieira, S., Scarpazza, C., & Mechelli, A. (2021). Machine learning for brain age prediction: Introduction to methods and clinical applications. eBioMedicine, 72, 103600. https://doi.org/10.1016/j.ebiom.2021.103600
Chen, Z. (n.d.). Bayesian filtering: From Kalman filters to particle filters, and beyond.
[12] Faris, H., Al-Zoubi, A. M., Heidari, A. A., Aljarah, I., Mafarja, M., Hassonah, M. A., & Fujita, H. (2019). An intelligent system for spam detection and identification of the most relevant features based on evolutionary Random Weight Networks. Information Fusion, 48, 67–83. https://doi.org/10.1016/j.inffus.2018.08.002
[13] Gayathri, A., Aswini, J., & Revathi, A. (2021). Classification Of Spam Detection Using Naive Bayes Algorithm Over K-Nearest Neighbors Algorithm Based On Accuracy. NVEO - NATURAL VOLATILES & ESSENTIAL OILS Journal | NVEO, 8516–8530.
[14] Ge, Z., Song, Z., Ding, S. X., & Huang, B. (2017). Data Mining and Analytics in the Process Industry: The Role of Machine Learning. IEEE Access, 5, 20590–20616. https://doi.org/10.1109/ACCESS.2017.2756872
[15] Gupta, S. D., Saha, S., & Das, S. K. (2021). SMS Spam Detection Using Machine Learning. Journal of Physics: Conference Series, 1797(1), 012017. https://doi.org/10.1088/1742-6596/1797/1/012017
[16] Hasas, A., Zarinkhail, M., Hakimi, M., & Quchi, M. M. (2024). Strengthening Digital Security: Dynamic Attack Detection with LSTM, KNN, and Random Forest. Journal of Computer Science and Technology Studies, 6, 49–57. https://doi.org/10.32996/jcsts.2024.6.1.6
[17] Jáñez-Martino, F., Alaiz-Rodríguez, R., González-Castro, V., Fidalgo, E., & Alegre, E. (2023). A review of spam email detection: Analysis of spammer strategies and the dataset shift problem. Artificial Intelligence Review, 56(2), 1145–1173. https://doi.org/10.1007/s10462-022-10195-4
[18] Kaur, H., & Verma, P. (2019). IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY SURVEY ON E-MAIL SPAM DETECTION USING SUPERVISED APPROACH WITH FEATURE SELECTION. https://doi.org/10.5281/zenodo.496096
[19] Kipkebut, A., Thiga, M., & Okumu, E. (2019). Machine Learning Sms Spam Detection Model. http://ir.kabarak.ac.ke/handle/123456789/386
[20] Kontsewaya, Y., Antonov, E., & Artamonov, A. (2021). Evaluating the Effectiveness of Machine Learning Methods for Spam Detection. Procedia Computer Science, 190, 479–486. https://doi.org/10.1016/j.procs.2021.06.056
[21] Krishnamoorthy, P., Sathiyanarayanan, M., & Proença, H. P. (2024). A novel and secured email classification and emotion detection using hybrid deep neural network. International Journal of Cognitive Computing in Engineering, 5, 44–57. https://doi.org/10.1016/j.ijcce.2024.01.002
[22] Kumar, N., Sonowal, S., & Nishant. (2020). Email Spam Detection Using Machine Learning Algorithms. 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA), 108–113. https://doi.org/10.1109/ICIRCA48905.2020.9183098
[23] Ligthart, A., Catal, C., & Tekinerdogan, B. (2021). Analyzing the effectiveness of semi-supervised learning approaches for opinion spam classification. Applied Soft Computing, 101, 107023. https://doi.org/10.1016/j.asoc.2020.107023
[24] Liu, J. (2020). From Statistics to Data Mining: A Brief Review. 2020 International Conference on Computing and Data Science (CDS), 343–346. https://doi.org/10.1109/CDS49703.2020.00073
[24] Mallampati, D. (2018). An Efficient Spam Filtering using Supervised Machine Learning Techniques. International Journal of Scientific Research in Computer Science and Engineering, 6(2), 33–37.
Mallampati, D., & Hegde, N. (2023). A Machine Learning Based Email Spam Classification Framework Model: Related Challenges and Issues. 9, 3137–3144.
[26] Mohammed, M. A., Mostafa, S. A., Obaid, O. I., Zeebaree, S. R. M., Ghani, G., Mustapha, A., Fudzee, M. F. M., Jubair, M. A., Hassan, M. H., Ismail, A., Ibrahim, D. A., & AL-Dhief, F. T. (2019). An Anti-Spam Detection Model for Emails of Multi-Natural Language. Journal of Southwest Jiaotong University, 54(3), 6. https://doi.org/10.35741/issn.0258-2724.54.3.6
[27] Nallamothu, P. T., & Khan, M. S. (n.d.). Machine Learning for SPAM Detection. 6(1).
[28] Nthurima, F., Mutua, A., & Stephen Titus, W. (2023). Detecting Phishing Emails Using Random Forest and AdaBoost Classifier Model. Open Journal for Information Technology, 6(2), 123–136. https://doi.org/10.32591/coas.ojit.0602.03123n
[29] Roy, S., Patra, A., Sau, S., Mandal, K., & Kunar, S. (2013). An Efficient Spam Filtering Techniques for Email Account. American Journal of Engineering Research.
[30] Sarkar, D., Bali, R., & Sharma, T. (2018). Deep Learning for Computer Vision. In D. Sarkar, R. Bali, & T. Sharma (Eds.), Practical MachineLearning with Python: A Problem-Solver’s Guide to Building Real-World Intelligent Systems (pp. 499–520). Apress. https://doi.org/10.1007/978-1-4842-3207-1_12
[31] Singh, S. P., Kumar, A., Darbari, H., Singh, L., Rastogi, A., & Jain, S. (2017). Machine translation using deep learning: An overview. 2017 International Conference on Computer, Communications and Electronics (Comptelix), 162–167. https://ieeexplore.ieee.org/abstract/document/8003957/
[32] Sokhangoee, Z. F., & Rezapour, A. (2022). A novel approach for spam detection based on association rule mining and genetic algorithm. Computers & Electrical Engineering, 97, 107655. https://doi.org/10.1016/j.compeleceng.2021.107655
[33] Svadasu, G., & Adimoolam, M. (2022). Spam Detection in Social Media using Artificial Neural Network Algorithm and comparing Accuracy with Support Vector Machine Algorithm (p. 5). https://doi.org/10.1109/ICBATS54253.2022.9758927
[34] van Meteren, R., & van Someren, M. (n.d.). Using Content-Based Filtering for Recommendation.
[35] V.Christina, S.Karpagavalli, & G.Suganya. (2010). Email Spam Filtering using Supervised Machine Learning Techniques. International Journal on Computer Science and Engineering, 2.
[36] Yang, X., Song, Z., King, I., & Xu, Z. (2023). A Survey on Deep Semi-Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, 35(9), 8934–8954. https://doi.org/10.1109/TKDE.2022.3220219
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Stanley Munga Ngigi, Richard Mathenge, Josphat Karani, Nicholus Muriithi
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The articles published in International Journal of Computer and Information Technology (IJCIT) is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.