Comparation analysis of naïve bayes and decision tree C4.5 for caesarean section prediction

ABSTRACT


INTRODUCTION
Maternal mortality rate (MMR) in Indonesia is still high. MMR represents number of maternal deaths during lifetime pregnancy until the post-term childbirth caused by pregnancy, childbirth and the puerperium or the management and not caused by accident or fell on every 100,000 live births [1]. Caesarean section is the last alternative in action labor. This is due to high risk factors, both risk for mother and babies [2]. Despite the high risk, numbers of caesarean birth experienced increase significantly, particularly in Indonesia. World Health Organization (WHO) set the standard for caesar section delivery in a country about 5-15 percent per thousand births in the world. Based on WHO data, in 2004-2008 in three continents (Latin America, Africa, and Asia) the lowest Caesarean birth rate was in Angola (2.3%) and the highest in China (46.2%). Caesar births data in Indonesia has increased sharply, especially in big cities. Lowest rate in Southeast Sulawesi (5.5%) and the highest in DKI Jakarta (27.2%) [3].
Information technology will continue to develop and needed to meet the needs of fast and accurate information for life [4]. Technology has been used in various fields, for example in the health sector [5]. At this time the health sector has been supported by technology that is able to visualize and predict a patient's condition. From existing patient data, it can be used as material to classify a patient's condition using technology. One area that requires classification of a patient's condition is a childbirth [6]. Based on the explanation above, it is necessaary to have an algorithm that can support the work of medical personnel in determining the type of labor [1]. Classification is one of the methods contained in data mining [7]. Classification is necessary to find patterns in order to be able to produce correct predictions even in critical conditions [8]. To perform the classification process, there are several algorithm that can be used including ISSN: 2746-0991  47 Comparation analysis of naïve bayes and decision tree C4.5 for…(I Gusti Ayu Suciningsih) Support Vector Machine (SVM), Naïve Bayes, K-Nearest Neighbor (KNN), Decission Tree, and Artificial Neural Network (ANN). These methods have their own level of accuracy for each object to be classified. The methods that will be used in this project are comparation of the Decision Tree C4.5 and Naïve Bayes methods to classify the caesarean section.

METHOD 2.1 Application of Naïve Bayes Algorithm
This classifier is based on the Naïve Bayes Theorem, which gives a way to estimate the posterior probability. Posterior probability of a class gives the estimation of an item belonging to that class based on the given attributes. Naïve bayes is the simplest calculation of the Bayes theorem, because it is able to reduce computational complexity to simple multiplication of probability [9]. Apart from that, the Naïve Bayes algorithm is also capable of handling data sets which has many attributes.
The application of Caesarean Section data set on Naïve Bayes algorithm process as follows: Prepare caesarean section data set. Classifying using Naïve Bayes algorithm. Count the number of classes or labels in the data set. Count the number of cases on each class. Multiply all the class variables.
Compare the results of each classes.
The following is the equation of the Naïve Bayes: P(H|X) = (P(X│H)P(H))/(P(X)) In wich: X : data or tuple object (class C) H: : hypothesis P(H|X) : probability that hypothesis H is in condition P(H) : prior probability that the H hypothesis is valid (true) P(X) : prior probability of tuple X.

Application of Decision Tree C4.5 Algorithm
The C4.5 algorithm [10] is used in Data Mining as a Decision Tree Classifier [11] which can be employed to generate a decision, based on a certain sample of data (univariate or multivariate predictors). The following is the application of the research Decision Tree C4.5 algorithm [12].

Cleaning Data
Cleaning data is checked on the dataset, if there is a missing value in the dataset, treatment must be given to the data [13]. In the dataset used for this study, there are no missing values as shown in Figure 1 which show a dataset of caesarean section, because there are no missing value then we can go to the next step .

Determining Independet Variables and Dependent Variables
The dependent variable used here is the caesarian variable, because we want to see whether the patient is classified as caesar labor or normal labor. The other varible that are age, delivery number, delivery time, blood pressure, and heart problem became an independent variable, can be seen at figure 2.

Normalization
Normalization is rescaling real numeric attributes into range 0 and 1. That in the dataset there is data with values other than 0 and 2, then the normalization stage will be carried out so the data becomes values in the range 0 and 1.

Data Testing and Data Training
The classification using naïve bayes is contained in the sklearn package [6]. In this classification, testing data and training data are needed. Dividing the data set into Data Testing and Data training aims to adjust the data set into the Algorithm model. Divided by the ratio of Data Training 75% and Data Testing 25%. Training data with random state is 123. The random state value is independent, the random state shows how many times the data is randomized. However, this time using 123 so that the random results we get are the same.

Confusion of Matrix
In figure 2. We can know that there are 5 pregnant women who are predicted to have normal labor and in actual circumstances do deliver normal. Meanwhile, the number of pregnant women who are predicted to have normal labor but in actual fact give birth by caesarean section is also 6. Then, there were 5 pregnant women who were predicted to give birth by caesarean section and in actual fact they gave birth by caesarean section. Meanwhile, there were 4 pregnant women who were predicted to give birth by caesarean section but in actual circumstances gave birth normally, the result can be seen at Figure 5.

Memory Usage
In the decision tree models memory that been used is 111,57 MB, shown in Figure 6.

Measure the Program Execution Time
In measuring the program execution time using a decision tree algorithm, the results are displayed for 0.013 seconds, shown in Figure 8.

Level of Accuracy
Decision tree models are created using 2 steps: Induction and Pruning. Induction is where we actually build the tree i.e set all of the hierarchial decision boundaries based on our data. Because of the nature of training decision tree they can be prone to mjor overfitting. Pruning is the process of removing the unnecessary structure from a decision tree, effectively reducing the complexity to combat overfitting with the added bonus of making it even easier to interpet. By using this method, the result of accuracy are shown in figure 10. After getting the Naïve Bayes algorithm classification model, calculate the accuracy using a confusion matrix. Naïve Bayes classification algorithm will produce better results if using more training data. The results of the accuracy in the Naïve Bayes classification are shown in Figure 11.

CONCLUSION
Using C4.5 and Naïve Bayes classifier models , the result of accuracy are 45% and after getting the Naïve Bayes algorithm classification model, calculate the accuracy using a confusion matrix. Naïve Bayes classification algorithm will produce better results if using more training data. The results of the accuracy in the Naïve Bayes classification are 50%. So the level of accuracy using the Naïve Bayes method is greater or more accurate than the decision tree method.