Principal component analysis-enhanced ensemble learning models for proactive failure prediction in cloud-based systems
Cloud computing environments require high availability and scalability, making proactive failure management essential for ensuring system reliability, security, and consistent performance. Effective failure prediction significantly reduces downtime, improves disaster recovery processes, and maintains uninterrupted service delivery. This paper presents an optimized machine learning framework for predicting failures in cloud infrastructures by integrating principal component analysis (PCA) with advanced ensemble learning models. The study employs three prominent models—random forest (RF), categorical boosting (CatBoost), and light gradient boosting machine (LightGBM)—enhanced through PCA to improve feature representation and overall predictive accuracy. Key operational metrics, including class scheduling, memory usage, central processing unit utilization, event instances, and task priority, are used as features. The Google 2019 cluster dataset is utilized, and preprocessing steps involve handling missing data, scaling numerical attributes, and encoding categorical variables to ensure data quality. Experimental results reveal that PCA-enhanced RF, CatBoost, and LightGBM achieve superior accuracies of 94.31%, 97.17%, and 98.36%, respectively, outperforming their standard counterparts. These outcomes highlight the effectiveness of PCA-integrated ensemble learning and underscore its potential for real-time cloud failure prediction and automated fault monitoring in large-scale distributed environments.
Al Essa, H. A., & Bhay, W. S. (2023). Ensemble learning classifiers hybrid feature selection for enhancing performance of intrusion detection system. Bulletin of Electrical Engineering and Informatics, 13(1), 665–676. https://doi.org/10.11591/eei.v13i1.5844
Chen, Y., & Zhang, R. (2025). Hybrid dual-channel attention CNN and eXtreme Gradient Boosting for industrial process model development and fault diagnosis. IEEE Internet of Things Journal, 12(17), 35649–35661. https://doi.org/10.1109/JIOT.2025.3579006
Deb, K., Zhang, X., & Duh, K. (2022). Post-hoc interpretation of transformer hyperparameters with explainable boosting machines. In J. Bastings, Y. Belinkov, Y. Elazar, D. Hupkes, N. Saphra, & S. Wiegreffe (Eds.), Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (pp. 51–61). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.blackboxnlp-1.5
Dugyala, R., Kumar, T. N., Umamaheshwar, E., & Vijendar, G. (2023). An ensemble learning approach for task failure prediction in cloud data centers. In S. K. Tummala, S. Kosaraju, P. B. Bobba, & S. K. Singh (Eds.), E3S Web of Conferences, 391, 01072. EDP Sciences. https://doi.org/10.1051/e3sconf/202339101072
Gao, J., Wang, H., & Shen, H. (2020). Task failure prediction in cloud data centers using deep learning. IEEE Transactions on Services Computing, 15(3), 1411–1422.
Giridhar, M. V., Shetty, C. S., Kanthi, N., & Jayanthi, P. N. (2025). Artificial intelligence-based fault prediction for cloudresource efficiency. Journal of Emerging Technologies and Innovative Research, 12(2), g543–g546. https://www.jetir. org/view?paper=JETIR2502662
Gollapalli, M., AlMetrik, M. A., AlNajrani, B. S., AlOmari, A. A., AlDawoud, S. H., AlMunsour, Y. Z., Abdulqader, M. M., & Aloup, K. M. (2022). Task failure prediction using machine learning techniques in the Google cluster trace cloud computing environment. Mathematical Modelling of Engineering Problems, 9(2), 545–553. https://doi.org/10.18280/mmep.090234
Hadadi, F., Dawes, J. H., Shin, D., Bianculli, D., & Briand, L. (2024). Systematic evaluation of deep learning models for log-based failure prediction. Empirical Software Engineering, 29(5), 105. https://doi.org/10.1007/s10664-024-10501-4
Hamaide, V., Joassin, D., Castin, L., & Glineur, F. (2022). A two-level machine learning framework for predictive maintenance: Comparison of learning formulations. arXiv. https://arxiv. org/abs/2204.10083
Jardine, A. K. S., Lin, D., & Banjevic, D. (2006). A review on machinery diagnostics and prognostics implementing condition-based maintenance. Mechanical Systems and Signal Processing, 20(7), 1483–1510. https://doi.org/10.1016/j.ymssp.2005.09.012
Jassas, M. S., Mahmoud, S. M., Alrashoud, M., & Alqahtani, A. (2022). Analysis of job failure and prediction model for cloud computing using machine learning. Sensors, 22(5), 2035. https://doi.org/10.3390/s22052035
Li, X., Wu, X., Wang, T., Xie, Y., & Chu, F. (2025). Fault diagnosis method for imbalanced data based on adaptive diffusion models and generative adversarial networks. Engineering Applications of Artificial Intelligence, 147, 110410. https://doi.org/10.1016/j.engappai.2025.110410
Malhi, A., & Gao, R. X. (2004). PCA-based feature selection scheme for machine defect classification. IEEE Transactions on Instrumentation and Measurement, 53(6), 1517–1525. https://doi.org/10.1109/TIM.2004.834070
Nori, H., Jenkins, S., Koch, P., & Caruana, R. (2019). InterpretML: A unified framework for machine learning interpretability. arXiv. https://arxiv.org/abs/1909.09223
Pruckovskaja, V., Weissenfeld, A., Heistracher, C., Graser, A., Kafka, J., Leputsch, P., Schall, D., & Kemnitz, J. (2023). Federated learning for predictive maintenance and quality inspection in industrial applications. arXiv. https://arxiv.org/abs/2304.11101
Saxena, D., & Singh, A. K. (2022). OFP-TM: An online VM failure prediction and tolerance model towards high availability of cloud computing environments. The Journal of Supercomputing, 78(6), 8003–8024. https://doi.org/10.1007/s11227-021-04235-z
Vago, N. O. P., Forbicini, F., & Fraternali, P. (2024). Predicting machine failures from multivariate time series: An industrial case study. Machines, 12(6), 357. https://doi.org/10.3390/machines12060357
Wen, Y., Rahman, M. F., Xu, H., & Tseng, T.-L. B. (2022). Recent advances and trends of predictive maintenance from data-driven machine prognostics perspective. Measurement, 187, 110276. https://doi.org/10.1016/j.measurement.2021.110276
Xie, Y., Lian, K., Liu, Q., Zhang, C., & Liu, H. (2021). Digital twin for cutting tool: Modeling, application and service strategy. Journal of Manufacturing Systems, 58, 305–312.
Yang, H., & Kim, Y. (2022). Design and implementation of machine learning-based fault prediction system in cloud infrastructure. Electronics, 11(22), 3765. https://doi.org/10.3390/electronics11223765
Zhang, Q., Liu, Q., & Ye, Q. (2024). An attention-based temporal convolutional network method for predicting remaining useful life of aero-engine. Engineering Applications of Artificial Intelligence, 127(A), 107241. https://doi.org/10.1016/j.engappai.2023.107241
Zhao, R., Yan, R., Chen, Z., Mao, K., Wang, P., & Gao, R. X. (2019). Deep learning and its applications to machine health monitoring. Mechanical Systems and Signal Processing, 115, 213–237. https://doi.org/10.1016/j.ymssp.2018.05.050
