Ensemble learning technique to improve breast cancer classification model

ABSTRACT


INTRODUCTION
Cancer is a type of non-communicable disease characterized by the constant and malignant growth of abnormal cells or tissues that can affect the function of the affected tissues Cancer cells arise from various organ-forming elements and can form tumor masses through excessive cell division and spread through blood vessels or lymph nodes [1], [2].One of the deadliest types of cancer is breast cancer.Breast cancer affects both men and women, but is more common in women and very rare in men [3], [4].The level of heterogeneity in breast cancer is high.Breast cancer is currently the second cause of cancer deaths in women.Breast cancer can be divided into benign and malignant types [5].Benign breast cancer is a non-invasive form of breast cancer that rarely endangers the patient's life.Benign breast cancers are found in the lining of the breast ducts and do not spread to the surrounding tissues [6].
According to an article in A Cancer Journal for Clinicians published by CA in 2020, there were 2.3 million women with breast cancer or about 11.7% of all newly diagnosed cancer cases.Breast cancer has a higher prevalence than lung cancer, with a prevalence of only 11.5%.Hyuna Song, a senior scientist and epidemiologist at the American Cancer Society said breast cancer is the most common of all cancers, with cases up 2.3 million from 2,088,849 in 2018 [7].According to data from the Global Cancer Observatory (GLOBOCAN) in 2018, breast cancer is classified as dangerous cancer, ranking second out of five cancers with the highest number of patient deaths in Indonesia.Of the 207,210 total deaths, 11% or 22,692 died of breast cancer [8].
If you look at the results of previous research, the accuracy of the classification results is still not getting optimal accuracy.This study aims to improve the accuracy resulting from the classification of breast cancer and also be more optimal in predicting the level of breast cancer.Therefore, we propose a model, namely the Ensemble Learning [9] technique.This technique makes it possible to combine several research methods, the methods we use in this combination include decision tree, random forest, and logistic regression methods [10], [11].we combined the three methods to produce more optimized results.

METHOD
The design flow of the proposed algorithm is depicted as shown in Figure 1.The algorithm used is Ensamble Learning which combines 3 research methods.Each process in Figure 1 will be explained in detail in the next section.

Dataset Collection
The dataset used in this study is public, namely the Crude Oil WTI standard (CL=F) dataset obtained from the finance.yahoo.comwebsite.A total of 1058 data were used in the study from January 3, 2017 to March 31, 2021 which was accessed on April 5, 2021 with West Texas Intermediate (WTI) standard size in U.S Dollars.The price used in this study is the close price because it is the price that can be a reference for predicting the open price on the next day.The data is divided into 70% training and 30% test data.The distribution of this data is based on research conducted by [12], which managed to get an accuracy rate of 99.25%.Table 1 shows the daily close price of world crude oil.

Data Normalization
In order for the method to be used to recognize data as input, it is necessary to normalize the data using a scale in the interval [0.1] in Equation 1.

Parameter Testing
At this stage, the parameters for ANN-BP and PSO parameters are tested.ANN-BP parameters tested include testing the number of input neurons and hidden neurons, the number of iterations (epochs), and the learning rate.At the same time, the PSO parameters tested include epochs and values of r1 and r2.Parameter testing was carried out using the ANN-BP training process with 70% dataset.After testing the parameter values, the best parameter values are selected through the lowest MSE and MAPE results in Table 3.The ANN-BP algorithm is shown in Figure 2 [13].

Model Determination
Determination of the model using the ANN-BP -PSO method through a training process.The dataset used is 70% of the total data.The optimization carried out by PSO in ANN-BP aims to produce the lowest error rate.PSO optimizes the ANN-BP parameters, namely weight updates so that it is expected to increase prediction accuracy.This process continues until the ANN-BP and PSO epochs have reached their limit.The execution process using this combined method takes quite a long time to adjust the number of epochs used.

Prediction
The prediction process is carried out using datasets as much as 30% of the total data used.The prediction process follows the ANN-BP testing stages based on the model results from the ANN-BP and PSO training processes.

Figure 2. ANN-BP algorithm [14]
The method's success in this study is determined using indicators of predictive accuracy.These indicators are Mean Square Error (MSE) and Mean Absolute Percentage Error (MAPE).MSE is a method used to evaluate forecasting models through each error or residual squared, then summed and added to the number of observations [15]- [17].The MSE formula can be seen in Equation 2.
where,  : number of data points   : observed value  ̂ : Predicted value Due to its ability to be applied to various contexts, easily understood, and dependable, MAPE is regarded as the most widely used method for measuring accuracy [14], [18].MAPE indicates how big the error is in forecasting compared to the actual value [19], [20].The MAPE formula can be seen in Equation 3.
where,   : time series value in the t-period  ̂ : forecast value in the t-period  : total number of observations

RESULTS AND DISCUSSIONS
The results of determining the model using the ANN-BP and ANN-BP -PSO methods, with 70% of the training data getting MSE and MAPE values, are given in Table 5.The PSO algorithm has succeeded in optimizing the weight parameter (w) in the ANN-BP training process.ANN-BP and PSO training run in each iteration.The best position is obtained, followed by updating the weight, speed, position, Pbest, and fitness to determine Gbest until the iteration is complete.The selection of parameter values determined through parameter testing has improved accuracy.The parameters obtained from the testing process were the architecture of the ANN-BP model and the PSO parameter values.The PSO parameter values comprised 15 particles, 5 popsize, an epoch value of 16, a c1 value of 1, a c2 value of 1.5, and an inertia weight value of 0.5.Meanwhile, the ANN-BP model architecture comprised 5 input layers, 3 hidden layers, 1 output layer, an epoch value of 60, and a learning rate value of 0.2.The results obtained in the prediction process are the MSE and MAPE values.The MSE and MAPE values generated by the prediction process using the ANN-BP and PSO methods are 7.15827 and 5.02007%, respectively.Meanwhile, the results of MSE and MAPE, which only used the ANN-BP method, were 13.86345 and 6.28323%.The smaller the MSE value obtained, the better the forecasting performance [21].
Although the PSO algorithm can improve the accuracy and minimize the error value in the ANN-BP method, the training process is quite time-consuming [22].This is because each epoch in ANN-BP performs weight update calculations in each PSO epoch.Therefore, as the value of the ANN-BP and PSO epochs increases, the weight update process will also take longer.
The study using the ANN-BP-PSO model obtained better forecasting results with a high level of accuracy in predicting crude oil prices based on daily time series compared to studies that used the ARIMA method [23], Edited Nearest Neighbor (ENN) [24], Local Mean Decomposition (LMD)-ARIMA [25], and Naive [24], [25].

CONCLUSION
The application of the PSO algorithm in optimizing the weight parameter (w) of ANN-BP makes the prediction quality of crude oil prices increase, as evidenced by the results of MSE and MAPE ANN-BP -PSO is better than using only ANN-BP.Based on the results of the MAPE and MSE values, the testing process using the PSO algorithm in the ANN-BP method, which is 7.15827 and 5.02007%, indicates that the ANN-BP -PSO method is classified as very good and has a smaller error rate compared to using only ANN-BP method only.The prediction error value obtained decreased by 1.26316% compared to using only the ANN-BP model, which had MSE and MAPE values of 13.86345 and 6.28323% on the WTI standard Crude Oil object (CL=F).

Figure 1 .
Figure 1.Flow of the proposed algorithm

Figure 1 .
Figure 1.ANN-BP and PSO model determination flowchart

Table 1
World crude oil close price

Table 2
shows the world crude oil price dataset before and after normalization in the interval [0.1].

Table 2 .
Dataset before and after normalization

Table 3 .
Best parameter results PSO and ANN-BP

Table 4 .
Results of MSE and MAPE training processThe MSE and MAPE values generated from the training process in the search for the best parameter model using the ANN-BP and PSO methods are 1.96737 and 1.85356%, respectively.While the MSE and MAPE values using only the ANN-BP method in the training process are 2.25938 and 3.03976%, with PSO fitness results of 0.9818.These results indicate that PSO has optimized ANN-BP to get a minor error value, so the prediction model is tested in the ANN-BP training process.MSE and MAPE results from the prediction process are shown in Table6.

Table 5 .
Prediction results