Email spam detection: a comparison of svm and naive bayes using bayesian optimization and grid search parameters

Main Article Content

Dzaky Budiman
Zayyan Zayyan
Ainun Mardiana
Alfira Aulia Mahrani

Abstract

Spam emails are still a big problem, crowding out inboxes and annoying email users everywhere. SVM and Naive Bayes are frequently used algorithms that have demonstrated excellent performance in performing text classification, including spam detection. The purpose of this study is to evaluate the overall performance of SVM and Naive Bayes in the context of detecting spam emails using default parameters. This research utilizes Bayesian Optimization and Grid Search Parameters for both SVM and Naive Bayes models to help maximize the performance of the constructed models. This study uses a spam email dataset that has 2 sample groups, namely spam and ham. Of the three parameter selection methods that have been tested on the SVM Algorithm, Bayesian Optimization is a parameter tuning method that has the most satisfying results in accuracy, precision, recall, and f1 scores respectively with values of 98.5642%, 99.4048%, 89.

Article Details

Section
Articles

References

K. Debnath and N. Kar, “Email spam detection using deep learning approach,” in 2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), IEEE, 2022, pp. 37–41.

D. Aprilianto, “SVM Optimization with Correlation Feature Selection Based Binary Particle Swarm Optimization for Diagnosis of Chronic Kidney Disease,” J. Soft Comput. Explor., vol. 1, no. 1, Sep. 2020, doi: 10.52465/joscex.v1i1.1.

A. Nurdina and A. B. I. Puspita, “Naive Bayes and KNN for Airline Passenger Satisfaction Classification: Comparative Analysis,” J. Inf. Syst. Explor. Res., vol. 1, no. 2, Jul. 2023, doi: 10.52465/joiser.v1i2.167.

S. Nandhini and D. J. Marseline, “Performance Evaluation of Machine Learning Algorithms for Email Spam Detection,” Int. Conf. Emerg. Trends Inf. Technol. Eng. ic-ETITE 2020, pp. 1–4, 2020, doi: 10.1109/ic-ETITE47903.2020.312.

N. Reska and K. Tsabita, “Comparison of KNN, naive bayes, and decision tree methods in predicting the accuracy of classification of immunotherapy dataset,” J. Student Res. Explor., vol. 1, no. 2, pp. 104–121, Jul. 2023, doi: 10.52465/josre.v1i2.170.

S. O. Olatunji, “Improved email spam detection model based on support vector machines,” Neural Comput. Appl., vol. 31, pp. 691–699, 2019.

A. Ghosh and A. Senthilrajan, “Comparison of machine learning techniques for spam detection,” Multimed. Tools Appl., pp. 1–28, 2023.

S. M. M. Hossain and I. H. Sarker, “Content-based Spam Email Detection Using N-gram Machine Learning Approach,” 2021.

D. Mallampati and N. P. Hegde, “Feature Extraction and Classification of Email Spam Detection Using IMTF-IDF+ Skip-Thought Vectors.,” Ingénierie des Systèmes d’Information, vol. 27, no. 6, 2022.

G. A. Reddy and B. I. Reddy, “Classification of Spam Text using SVM,” J. Univ. Shanghai Sci. Technol., vol. 23, no. 8, pp. 616–624, 2021.

C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1–27, 2011.

P. Kaviani and S. Dhotre, “Short survey on naive bayes algorithm,” Int. J. Adv. Eng. Res. Dev., vol. 4, no. 11, pp. 607–611, 2017.

T. M. Hansen and C. C. Finlay, “Use of machine learning to estimate statistics of the posterior distribution in probabilistic inverse problems—An application to airborne EM data,” J. Geophys. Res. Solid Earth, vol. 127, no. 11, p. e2022JB024703, 2022.

R. A. Cahya and F. A. Bachtiar, “Weakening Feature Independence of Naïve Bayes Using Feature Weighting and Selection on Imbalanced Customer Review Data,” in 2019 5th International Conference on Science in Information Technology (ICSITech), IEEE, 2019, pp. 182–187.

B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, “Taking the human out of the loop: A review of Bayesian optimization,” Proc. IEEE, vol. 104, no. 1, pp. 148–175, 2015.

V. Plevris, G. Solorzano, N. P. Bakas, and M. E. A. Ben Seghier, “Investigation of performance metrics in regression analysis and machine learning-based prediction models,” in 8th European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS Congress 2022), European Community on Computational Methods in Applied Sciences, 2022.

Y. Zhang, N. H. Kim, C. Park, and R. T. Haftka, “Multifidelity surrogate based on single linear regression,” AIAA J., vol. 56, no. 12, pp. 4944–4952, 2018.

Y. Fang, C. Liu, and Z. Li, “Optimization method of aluminum electrolysis current efficiency based on LightGBM-TPE,” in Second Guangdong-Hong Kong-Macao Greater Bay Area Artificial Intelligence and Big Data Forum (AIBDF 2022), SPIE, 2023, pp. 158–163.

M. M. Ramadhan, I. S. Sitanggang, F. R. Nasution, and A. Ghifari, “Parameter tuning in random forest based on grid search method for gender classification based on voice frequency,” DEStech Trans. Comput. Sci. Eng., vol. 10, no. 2017, 2017.

W. Nugraha and A. Sasongko, “Hyperparameter Tuning on Classification Algorithm with Grid Search,” Sist. J. Sist. Inf., vol. 11, no. 2, pp. 391–401, 2022.

F. Qureshi, “Spam Emails.”

P. Flach and M. Kull, “Precision-recall-gain curves: PR analysis done right,” Adv. Neural Inf. Process. Syst., vol. 28, 2015.

Q. H. Nguyen et al., “Influence of data splitting on performance of machine learning models in prediction of shear strength of soil,” Math. Probl. Eng., vol. 2021, pp. 1–15, 2021.

P. Rouzrokh et al., “Mitigating bias in radiology machine learning: 1. Data handling,” Radiol. Artif. Intell., vol. 4, no. 5, p. e210290, 2022.

A. Datta, B. Jena, A. K. Dash, and R. Pradhan, “A Comprehensive Analytical Study of Traditional and Recent Development in Natural Language Processing,” Int. J., vol. 10, no. 5, 2021.

I. Boban, A. Doko, and S. Gotovac, “Sentence retrieval using stemming and lemmatization with different length of the queries,” Adv. Sci. Technol. Eng. Syst., vol. 5, no. 3, pp. 349–354, 2020.

H. A. Almuzaini and A. M. Azmi, “Impact of stemming and word embedding on deep learning-based Arabic text categorization,” IEEE Access, vol. 8, pp. 127913–127928, 2020.

I. Song and S. Kim, “AVILNet: A new pliable network with a novel metric for small-object segmentation and detection in infrared images,” Remote Sens., vol. 13, no. 4, p. 555, 2021.

Abstract viewed = 44 times