Email spam detection: a comparison of svm and naive bayes using bayesian optimization and grid search parameters

Dzaky Budiman; Zayyan Zayyan; Ainun Mardiana; Alfira Aulia Mahrani

doi:10.52465/josre.v2i1.260

PDF

Published: Jan 31, 2024

DOI: https://doi.org/10.52465/josre.v2i1.260

Article Metrics

Keywords:

Support vector machine, Naïve bayes, Grid search, Bayesian, Spam email

Dzaky Budiman

Department of Informatics Engineering, Universitas Negeri Semarang, Indonesia

Zayyan Zayyan

Department of Informatics Engineering, Universitas Negeri Semarang, Indonesia

Ainun Mardiana

Department of Informatics Engineering, Universitas Negeri Semarang, Indonesia

Alfira Aulia Mahrani

Department of Informatics Engineering, Universitas Ahmad Dahlan, Indonesia

Abstract

Spam emails are still a big problem, crowding out inboxes and annoying email users everywhere. SVM and Naive Bayes are frequently used algorithms that have demonstrated excellent performance in performing text classification, including spam detection. The purpose of this study is to evaluate the overall performance of SVM and Naive Bayes in the context of detecting spam emails using default parameters. This research utilizes Bayesian Optimization and Grid Search Parameters for both SVM and Naive Bayes models to help maximize the performance of the constructed models. This study uses a spam email dataset that has 2 sample groups, namely spam and ham. Of the three parameter selection methods that have been tested on the SVM Algorithm, Bayesian Optimization is a parameter tuning method that has the most satisfying results in accuracy, precision, recall, and f1 scores respectively with values of 98.5642%, 99.4048%, 89.

Issue

Vol. 2 No. 1: January 2024

Section

Articles

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

References

K. Debnath and N. Kar, “Email spam detection using deep learning approach,” in 2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), IEEE, 2022, pp. 37–41.

D. Aprilianto, “SVM Optimization with Correlation Feature Selection Based Binary Particle Swarm Optimization for Diagnosis of Chronic Kidney Disease,” J. Soft Comput. Explor., vol. 1, no. 1, Sep. 2020, doi: 10.52465/joscex.v1i1.1.

A. Nurdina and A. B. I. Puspita, “Naive Bayes and KNN for Airline Passenger Satisfaction Classification: Comparative Analysis,” J. Inf. Syst. Explor. Res., vol. 1, no. 2, Jul. 2023, doi: 10.52465/joiser.v1i2.167.

S. Nandhini and D. J. Marseline, “Performance Evaluation of Machine Learning Algorithms for Email Spam Detection,” Int. Conf. Emerg. Trends Inf. Technol. Eng. ic-ETITE 2020, pp. 1–4, 2020, doi: 10.1109/ic-ETITE47903.2020.312.

N. Reska and K. Tsabita, “Comparison of KNN, naive bayes, and decision tree methods in predicting the accuracy of classification of immunotherapy dataset,” J. Student Res. Explor., vol. 1, no. 2, pp. 104–121, Jul. 2023, doi: 10.52465/josre.v1i2.170.

S. O. Olatunji, “Improved email spam detection model based on support vector machines,” Neural Comput. Appl., vol. 31, pp. 691–699, 2019.

A. Ghosh and A. Senthilrajan, “Comparison of machine learning techniques for spam detection,” Multimed. Tools Appl., pp. 1–28, 2023.

S. M. M. Hossain and I. H. Sarker, “Content-based Spam Email Detection Using N-gram Machine Learning Approach,” 2021.

D. Mallampati and N. P. Hegde, “Feature Extraction and Classification of Email Spam Detection Using IMTF-IDF+ Skip-Thought Vectors.,” Ingénierie des Systèmes d’Information, vol. 27, no. 6, 2022.

G. A. Reddy and B. I. Reddy, “Classification of Spam Text using SVM,” J. Univ. Shanghai Sci. Technol., vol. 23, no. 8, pp. 616–624, 2021.

C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1–27, 2011.

P. Kaviani and S. Dhotre, “Short survey on naive bayes algorithm,” Int. J. Adv. Eng. Res. Dev., vol. 4, no. 11, pp. 607–611, 2017.

T. M. Hansen and C. C. Finlay, “Use of machine learning to estimate statistics of the posterior distribution in probabilistic inverse problems—An application to airborne EM data,” J. Geophys. Res. Solid Earth, vol. 127, no. 11, p. e2022JB024703, 2022.

R. A. Cahya and F. A. Bachtiar, “Weakening Feature Independence of Naïve Bayes Using Feature Weighting and Selection on Imbalanced Customer Review Data,” in 2019 5th International Conference on Science in Information Technology (ICSITech), IEEE, 2019, pp. 182–187.

B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, “Taking the human out of the loop: A review of Bayesian optimization,” Proc. IEEE, vol. 104, no. 1, pp. 148–175, 2015.

V. Plevris, G. Solorzano, N. P. Bakas, and M. E. A. Ben Seghier, “Investigation of performance metrics in regression analysis and machine learning-based prediction models,” in 8th European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS Congress 2022), European Community on Computational Methods in Applied Sciences, 2022.

Y. Zhang, N. H. Kim, C. Park, and R. T. Haftka, “Multifidelity surrogate based on single linear regression,” AIAA J., vol. 56, no. 12, pp. 4944–4952, 2018.

Y. Fang, C. Liu, and Z. Li, “Optimization method of aluminum electrolysis current efficiency based on LightGBM-TPE,” in Second Guangdong-Hong Kong-Macao Greater Bay Area Artificial Intelligence and Big Data Forum (AIBDF 2022), SPIE, 2023, pp. 158–163.

M. M. Ramadhan, I. S. Sitanggang, F. R. Nasution, and A. Ghifari, “Parameter tuning in random forest based on grid search method for gender classification based on voice frequency,” DEStech Trans. Comput. Sci. Eng., vol. 10, no. 2017, 2017.

W. Nugraha and A. Sasongko, “Hyperparameter Tuning on Classification Algorithm with Grid Search,” Sist. J. Sist. Inf., vol. 11, no. 2, pp. 391–401, 2022.

F. Qureshi, “Spam Emails.”

P. Flach and M. Kull, “Precision-recall-gain curves: PR analysis done right,” Adv. Neural Inf. Process. Syst., vol. 28, 2015.

Q. H. Nguyen et al., “Influence of data splitting on performance of machine learning models in prediction of shear strength of soil,” Math. Probl. Eng., vol. 2021, pp. 1–15, 2021.

P. Rouzrokh et al., “Mitigating bias in radiology machine learning: 1. Data handling,” Radiol. Artif. Intell., vol. 4, no. 5, p. e210290, 2022.

A. Datta, B. Jena, A. K. Dash, and R. Pradhan, “A Comprehensive Analytical Study of Traditional and Recent Development in Natural Language Processing,” Int. J., vol. 10, no. 5, 2021.

I. Boban, A. Doko, and S. Gotovac, “Sentence retrieval using stemming and lemmatization with different length of the queries,” Adv. Sci. Technol. Eng. Syst., vol. 5, no. 3, pp. 349–354, 2020.

H. A. Almuzaini and A. M. Azmi, “Impact of stemming and word embedding on deep learning-based Arabic text categorization,” IEEE Access, vol. 8, pp. 127913–127928, 2020.

I. Song and S. Kim, “AVILNet: A new pliable network with a novel metric for small-object segmentation and detection in infrared images,” Remote Sens., vol. 13, no. 4, p. 555, 2021.

Abstract viewed = 496 times

Article Sidebar

Main Article Content

Abstract

Article Details

References