Improve the Accuracy of C4.5 Algorithm Using Particle Swarm Optimization (PSO) Feature Selection and Bagging Technique in Breast Cancer Diagnosis
Main Article Content
Abstract
Breast cancer is the second leading cause of death due to cancer in women currently. It has become the most common cancer in recent years. In early detection of cancer, data mining can be used to diagnose breast cancer. Data mining consists of several research models, one of which is classification. The most commonly used method in classification is the decision tree. C4.5 is an algorithm in the decision tree that is often used in the classification process. In this study, the data used was the Breast Cancer Wisconsin (Original) Data Set (1992) obtained from the UCI Machine Learning Repository. The purpose of this study was to select features that will be used and overcome class imbalances that occur, so that the performance of the C4.5 algorithm worked more optimal in the classification process. The methods used as feature selection are PSO and bagging to overcome class imbalances. Classification was tested using the confusion matrix to determine the accuracy that was generated. From the results of this study, the application of PSO as a feature selection and bagging to overcome class imbalances with the C4.5 algorithm succeeded in increasing accuracy by 5.11% with an initial accuracy of 93.43% to 98.54%.
Downloads
Article Details
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
References
R. Sumbaly, N. Vishnusri, and S. Jeyalatha, “Diagnosis of breast cancer using decision tree data mining technique, ” International Journal of Computer Applications, vol. 98, no. 10. 2014.
A. Gupta, and B. N. Kaushik, “Feature selection from biological database for breast cancer prediction and detection using machine learning classifier, ” J. Artif. Intell, vo. 11, pp. 55-64, 2018.
D. Larose, Discovering Knowledge in Data: An Introduction to Data Mining. New Jersey: John Wiley & Sons, Inc. 2004.
M. A. Muslim, S. H. Rukmana, E. Sugiharti, B. Prasetiyo, and S. Alimah, “Optimization of C4. 5 algorithm-based particle swarm optimization for breast cancer diagnosis, ” Journal of Physics: Conference Series, vol. 983, no. 1, 2018.
D. Singh, N. Choudhary, and J. Samota, “Analysis of data mining classification with decision tree technique, ” Global Journal of Computer Science and Technology, vol. 13, pp. 1-5, 2013.
B. Xue, M. Zhang, and W. N Browne, “Particle swarm optimization for feature selection in classification: A multi-objective approach, ” IEEE transactions on cybernetics, vol. 43, no. 6, pp. 1656-1671. 2012.
I. A. Gheyas, and L. S. Smith, “Feature subset selection in large dimensionality domains, ” Pattern recognition, vol. 43, no. 1, pp. 5-13. 2010.
M. H. Aghdam, S. Heidari, “Feature selection using particle swarm optimization in text categorization, ” Journal of Artificial Intelligence and Soft Computing Research, vol. 5, no. 4, pp. 231-238. 2015
S.C. Yusta, “Different metaheuristic strategies to solve the feature selection problem, ” Pattern Recognition, vol. 30, no. 5, pp. 525-534. 2009.
T. W. Cenggoro, “Deep learning for imbalance data classification using class expert generative adversarial network, ” Procedia Computer Science, vol. 135, pp. 60- 67. 2018
N. Rout, D. Mishra, and M.K. Mallick, “Handling imbalanced data: a survey, ” International Proceedings on Advances in Soft Computing, Intelligent Systems and Applications. Singapura, 2018, pp. 431-443.
B. W. Yap, K.A. Rani, H.A.A. Rahman, S. Fong, Z. Khairudin, and N. N. Abdullah,. Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). Singapura: Springer.
D. Opitz and R. Maclin, “Popular ensemble methods: an empirical study, ”Journal of Artificial Intelligence, vol. 11, pp. 169-198. 1999.
T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, A. “Comparing boosting and bagging techniques with noisy and imbalanced data, ” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol .41, no. 3, pp. 552-568, 2010.
W. Feng, W. Huang, W, and J. Ren, “Class imbalance ensemble learning based on the margin theory, ” Applied Sciences, vol. 8, no. 5, pp. 815. 2018.
C. D. Sutton, “Classification and regression trees, bagging, and boosting.” Handbook of statistics, vol. 24, pp. 303-329. 2005
M. Bramer, Principles of data mining, London: Springer. 2007
Y. Yang, and W. Chen, “Taiga: performance optimization of the C4. 5 decision tree construction algorithm”. Tsinghua Science and Technology, vol. 21, no. 4, pp. 415-425, 2016.
B. Boukenze, H. Mousannif, and A. Haqiq, “Performance of data mining techniques to predict in healthcare case study: chronic kidney failure disease, ” Int. Journal of Database Managment systems, vol. 8, no. 30, pp. 1-9, 2016.
K. R. Pradeep, and N. C. Naveen, “Lung cancer survivability prediction based on performance using classification techniques of support vector machines, C4. 5 and Naive Bayes algorithms for healthcare analytics, ” Procedia computer science, vol. 132, pp. 412-420, 2018.
E. Alfaro, M. Gámez, and N. Garcia, “Adabag: and package for classification with boosting and bagging, ”. Journal of Statistical Software, vol. 54, no. 2, pp. 1- 35, 2013.
S.J. Lee, Z. Xu, T. Li, and Y. Yang, “A novel bagging C4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making, ” Journal of Biomedical Informatics, vol. 78, pp. 144-155, 2018.
M. F. Akay, “Support vector machines combined with feature selection for breast cancer diagnosis, ” Expert systems with applications, vol. 36, no. 2, pp. 3240-3247. 2009.
D. Lavanya, and K. U. Rani, “Ensemble decision tree classifier for breast cancer data, ” International Journal of Information Technology Convergence and Services, vol. 2, no. 1, pp. 17, 2012.
A.K. Shrivas, and A. Singh, “Classification of breast cancer diseases using data mining techniques, ”. International Journal of Engineering Science Invention, vol. 5, no. 12, pp. 62-65, 2016.