Global recession sentiment analysis utilizing VADER and ensemble learning method with word embedding

ABSTRACT


INTRODUCTION
The condition when the economies of most countries deteriorate as activity in the economic sector declines is called a global recession [1].Recently, the IMF surprisingly published bad news for the world economy, namely the decline in world economic growth in 2023 to 2.9% [2].The IMF also predicts that Indonesia's economy will still be hampered by global deceleration with inflation at 9.5% and economic growth at 3.9% [2], [3].This issue invites Indonesian people to give their opinions on social media, especially twitter.
The unstructured sentiment of society is often overlooked, even though it is able to extract the viewpoints expressed [4], [5] and is the first stage to assess how far this recession has affected.So a sentiment analysis strategy is needed that is able to generalize the condition of the community from the sentiment data they convey.So as to be able to evaluate the sustainability, prevention and handling of recessions for the government.
Therefore, many studies have developed sentiment analysis techniques from twitter or other datasets to be able to generalize and even predict from the processed sentiment data [6], [7].Text mining is a machine learning method that aims to gather information and obtain high accuracy in classifying predictions on a class of data [8], [9].Sentiment analysis is part of data mining, various optimizations with various algorithms are carried out in order to achieve high accuracy performance [10].
Various approaches are used for twitter sentiment analysis in various contexts such as sentiment regarding elections, government policies and so on [11], [12].In twitter sentiment analysis, the use of ensemble learning can help in overcoming the diversity and variation of sentiments [13].The ensemble learning method improves classification accuracy for multiclass segmentation [14].Research [15], [16] confirmed that the selection of algorithms for ensemble learning affects the accuracy results and the need to use customized feature extraction to improve accuracy performance.
Related research has applied various methods for sentiment analysis by combining various data labeling methods, feature extraction, algorithms and pre-processing [17], [18].Ensemble learning is one of the most popular classification models today.In other research [19], developed an ensemble learning model with Term Frequency-Inverse Document Frequency (TF-IDF) and Bag Of Word feature extraction from twitter datasets retrieved using API.This research was developed again by [20] However, the accuracy obtained is 77.15% which means it is still low so it is recommended to use feature extraction in deep learning.Other research that discusses sentiment analysis [18], [21], [22] contributes using various algorithms in the proposed ensemble learning method but not optimal in performance.So sentiment analysis is needed to detect and classify sentiments properly.From this study, several findings can be drawn as follows i) the majority of sentiment analysis uses existing datasets or is extracted manually using supporting tools ii) the use of ensemble learning needs to be supported by relevant algorithms with feature extraction and methods that can balance the data and minimize overfitting.
The ensemble learning method uses the idea of multiple algorithms to find the best result.However, we found research [23] using Ensemble Deep Learning (MVEDL) based on the Majority Voting technique which gives equal weight to each classifier.Which means that each classifier has the same ability and reliability, so the accuracy results are not optimal.Furthermore, the ensemble learning model is improved with bagging techniques and more optimal sentiment weighting.We propose an ensemble learning method with VADER algorithm with BM25 feature extraction and Word Embedding that successfully improves accuracy performance.Our research contributions are i) improving the accuracy model of 2023 global recession sentiment analysis using ensemble learning method and VADER algorithm [24], ii) optimizing the performance with BM25 feature extraction and Word Embedding.

METHOD
The proposed method uses ensemble learning method that combines several models, namely Logistic Regression, Decision Tree, Random Forest, and SVM.Before processing with the proposed model, data preprocessing is carried out starting from labeling, data cleaning, Countvectorizer feature extraction with N-gram, BM25, and word embedding.After going through the above process, the new data is processed by the proposed model and produces an evaluation matrix.The proposed method will be explained in the next section and shown in Figure 1.

Dataset
The dataset contains the sentiment of Indonesian people regarding the issue of global recession on twitter.This data is sourced from Kaggle https://www.kaggle.com/datasets/hafizhadinda/global-recession-in-2023-sentiment-indonesianwhich was published in January 2023.The dataset contains 3074 sentiments with four columns: id, created at, username, and tweet.The dataset needs to be labeled first before processing, namely with the VADER algorithm.Sentiment labeling uses positive (1) and negative (0) labels.

VADER
The VADER algorithm [25] is used as a sentiment analysis model because it can measure the emotion of the data.In determining the polarity of the sentence, the compuund in each word is given positive, negative or neutral criteria.In this study, the text label is based on the polarity score calculated by normalizing the sum of positive and negative scores.Equation ( 1) is the VADER calculation used in the labeling process.

Preprocessing
At the beginning of data pre-processing, it is necessary to perform a data cleaning stage on the data that has been labeled.Because the data taken from twitter usually contains unnecessary words or characters such as non-alphanumeric punctuation marks, symbols, numbers, username (@), retweets and excess whitespace that can interfere with the data analysis process.After the data cleaning process, the sentiment is processed by going through several stages, namely, Case Folding, StopWord Removal, and stemming.Further process is needed to remove duplicate data and remove unnecessary links.Then the sentiments are merged back for further processing.The dataset containing sentiment is converted into a dataframe format containing six columns in the form of id, created at, username, tweet (before data pre-processing), label and tweet (after data pre-processing).The following examples of tweets before and after pre-processing can be seen in Table 1.
Table 1.Pre-processing stages Tweets before Pre-Processing Tweets after Pre-Processing After going through the plenary session process, continue...
After the plenary session process, continue with the points... @papa_loren Really reckless even though it's 2023...

How reckless is the world's economic condition…
After cleaning the dataset, determine the tweet columns for the x variable and y variable.Then, the data is divided into two with a ratio of 80: So 80% (2426 data) for training data and 20% (607 data) for test data.

Count Vectorizer Feature Extraction with N-Gram
In machine learning analysis, to process unstructured text data, it needs to be converted into numerical form so that it can be processed.One method that can be used is Count vectorizer feature extraction with Ngrams.In this method, the text is divided into a sequence of n consecutive words and then the frequency with which each sequence of words appears is calculated.The result is a feature vector representing the text, with each element of the vector indicating the frequency of occurrence of a particular word sequence.

BM25 Vectorizer
The Best Matching 25 Vectorizer (BM25) [26] is a major improvement over the classic TF-IDF-based algorithm.The BM25 ranking system sorts document match results based on the searched keywords.This feature is used to estimate the relevance between the query and the text during information indexing.Equations ( 2) and ( 3) are BM25 calculations, and BM25 Vector Visualization with Principal Component Analysis (PCA) is shown in Figure 2.
Where Q is the Query used as input, N is the total sentences in the document, nt is the total terms containing the Query, fd,t as the total term frequency, fq,t as the total query frequency, and k1,k3,b is a constant or parameter.
Where k1 and b are constants or parameters, 〖dl〗_d is the total sentences in the document, and avl is the average document length.

Word Embedding
The word embedding [27] used in this research is the word2vec model with a dimension of 300.Each word is a vector that represents a point in a certain dimension in space.Words that have certain characteristics, such as being in the same context or having the same semantic meaning, are connected to each other in a certain space.Using the one-hot encoding method, a vector with a value of 1 indicates the position of the word, and a value of 0 indicates the position of other words.
Where,   represents the context word of the current word   ,  is the window size of the context, and conditional probability Pr(  |  ) is calculated by equation (6).
Where ,      are the represent of word w   is the word vocabulary Word2vec assumes that a word has semantic similarity with the surrounding words.However, the surrounding words may not have the same semantics as the current word [29].

Ensemble Learning
The proposed model in this research uses an Ensemble Learning algorithm [30]- [32] that uses a voting system to get the best results.There are several methods combined into this model, namely Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), and Support Vector Machine (SVM).To determine the best results, voting will be carried out, which will later be taken as the final result in the final evaluation.For more details can be seen in Figure 3.  7), ( 8), ( 9) and (10) show the calculation of Logistic Regression, Decision Tree, Random Forest and SVM respectively [33]- [38].

Evaluation
After analyzing the experiments, the ensemble learning model needs to be evaluated to see the resulting performance using Confusion Matrix.Confusion Matrix can display the condition of actual data and predictions made by machine learning algorithms.True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) are the number of true positive classes, the number of false positive, true negative, and false negative data in the dataset.Equations ( 13), ( 14) and ( 15) are calculations to determine Accuracy, Recall, and f1 score values, respectively.

RESULTS AND DISCUSSIONS
In the experimentation and testing process, this research involved the Ensemble Learning method as the best performing result.Previously, the team had conducted experiments on other experiments with several parameters.RAM.

Results of SVM and TF-IDF algorithm approach with Manual Labeling
Initial experiments were conducted using a dataset labeled with label 0 as negative and label 1 as positive manually by the team.TF-IDF feature extraction was used in assessing the frequency of a word and then Word Embedding was used to represent the word.The results of this approach are shown in Table 2.

Results of SVM and BM25 algorithm approach with manual labeling
The experiments were then continued using the manually labeled dataset.Table 3. shows the results of pre-processing, using BM25 and word embedding in SVM.

Ensemble Learning Approach Results
Evaluating previous experiments, this approach represents the use of multiple algorithms with the Ensemble Learning method.In addition, since it was judged that the manually calculated labeling based on different preferences made the data unbalanced and resulted in poor performance.So the dataset was updated with labeling done using VADER automatically.This model is used for sentiment analysis and can determine the diversity of data through the intensity of emotional strength.
The algorithms used in Ensemble Learning are Logistic Regression, Decision Tree, Random Forest, and SVM.This method assesses by voting on the results shown by the model.In generating model performance we chose BM25 to represent words because from experimental results the BM25 method is more sophisticated for calculating relevance scores than simple methods such as TF-IDF.In the overall experimental results the model is superior to both Accuracy, Precission, Recall and F1 Score.The best results of Ensemble Learning performed well in the evaluation metrics shown in Table 4.The use of BM25 and Word Embedding greatly supports the model in producing good performance.This can be seen from the visualization of Word Embedding with a comparison using PCA and t-Distributed Stochastic Neighbor Embedding (t-SNE).PCA is used to identify and reduce word dimensionality.The visualization in Figure shows the patterns and relationships between words, where each point distribution in the image represents a word vector representation.A comparison is made using t-SNE where this visualization method pays more attention to the similarity of the distribution between points.This pattern helps to illustrate the hidden patterns of word relationships.The visualization is shown in Figure 5.The t-SNE visualization (left) results explore complex semantic relationships between words.This can help in finding groups of words that are related in a way that is not always visible linearly.Meanwhile, PCA visualization (right) results provide a more linear and orderly general picture of the relationships between words.
After considering the results from testing the SVM algorithm, the research team decided to use the Ensemble Learning method as the main approach in experimentation and testing.As a result, the ensemble model was able to overcome the uncertainty problem and obtain an accuracy of 95.02%.Thus, from the results of the experiments and tests conducted, it can be concluded that the Ensemble Learning approach provides the best performance in data classification.To compare the performance of the proposed model, it is shown in Table 5.Based on table 5, it can be compared that the proposed model is superior to models from previous studies.The proposed model has more optimal accuracy than other models, with an accuracy value of 95.02%.Therefore, it can be said that this proposed model is more optimal than previous research models.

CONCLUSION
Based on the results of experiments and testing of the proposed new model in analyzing Indonesian sentiment about the 2023 global recession on Twitter using the Ensemble Learning model by combining several Machine Learning methods, it can be concluded that in terms of accuracy, this model shows good performance in analyzing Indonesian sentiment about the 2023 global recession with an accuracy of 95.02%.So this model can be one of the effective solutions in distinguishing Indonesian sentiment, positive or negative.However, it should be noted that the proposed model has language limitations in performing sentiment analysis.So it is hoped that future research can develop models that can perform sentiment analysis with foreign languages.

Figure 1 .
Figure 1.Flowchart of the proposed method

Table 5 .
Comparison model