APPLYING ENSEMBLE LEARNING FOR IMBALANCED DATA

Ngo Thi Lan Anh; Nguyen Anh Hai; Bui Thi Hong Hanh

Ngo Thi Lan Anh Faculty of Information Technology, University of Transport and Communications
Nguyen Anh Hai Faculty of Information Technology, University of Transport and Communications
Bui Thi Hong Hanh Faculty of Information Technology, University of Transport and Communications

Keywords: Ensemble Learning, Bagging, Boosting, Stacking, imbalance data

Abstract

Nowadays, the application of machine learning methods for data analysis has become a significant trend, unlocking great potential in extracting information from data. However, in practice, many datasets exhibit class imbalance, such as problems related to predicting errors in banking transactions or classifying cancer patients, which poses challenges in building accurate predictive models. This imbalance often degrades the performance of machine learning models, especially in predicting events related to minority classes. In this study, we focus on applying the Ensemble Learning method, which combines multiple predictive models to improve performance, for imbalanced datasets.

We approach this problem by using some common techniques such as Bagging, Boosting, and Stacking from the Ensemble Learning method to harness the strengths of different models. This method not only aims to enhance the model’s accuracy but also to minimize the impact of data imbalance on the prediction process. The results of experiments on real-world datasets have demonstrated the effectiveness of the proposed method in addressing the issue of data imbalance. This proposal not only opens the door for the deployment of machine learning models in fields where data imbalance is common but also contributes to the ongoing development of machine learning and data mining.

References

Z. Yuan and P. Zhao, “An improved ensemble learning for imbalanced data classification,” in Proceedings of 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference, ITAIC 2019, Institute of Electrical and Electronics Engineers Inc., May 2019, pp. 408–411. doi: 10.1109/ITAIC.2019.8785887.

S. T. Lê, T. B. Nguyễn, and T. M. H. Lê, “Xử lý dữ liệu không cân bằng trong bài toán dự đoán lỗi phần mềm,” 2020.

Y. SUN, A. K. C. WONG, and M. S. KAMEL, “CLASSIFICATION OF IMBALANCED DATA: A REVIEW,” Int. J. Pattern Recognit. Artif. Intell., 2009, vol. 23, no. 04, pp. 687–719, doi: 10.1142/S0218001409007326.

T. Ait. AI, “https://www.turintech.ai/what-is-imbalanced-data-and-how-to-handle-it/.” [Online]. Available: https://www.turintech.ai/what-is-imbalanced-data-and-how-to-handle-it/

M. A. Ganaie, M. Hu, A. K. Malik, M. Tanveer, and P. N. Suganthan, “Ensemble deep learning: A review,” Oct. 01, 2022, Elsevier Ltd. doi: 10.1016/j.engappai.2022.105151.

S. Vluymans, “Learning from imbalanced data,” Studies in Computational Intelligence. [Online]. Available: https://www.kaggle.com/datasets/arashnic/imbalanced-data-practice

Nguyễn Văn Hậu, Phạm Minh Chuẩn, Nguyễn Văn Quyết, Học máy cơ bản. NXB Khoa học và Kỹ thuật, 2022.