Comparative Study of Feature Selection Methods for CatBoost-Based Heart Disease Prediction
Keywords:
Heart Disease Prediction, Feature Selection, CatBoost, Machine Learning, Clinical Decision Support, Cardiovascular InformaticsAbstract
Since cardiovascular disease continues to be one of the world's top causes of mortality, precise diagnostic tools are vital.. While learning models, such as CatBoost, are still in development and hold promise for cardiac prediction, the optimal strategy is less effective and remains underexplored. In order to determine the best strategy for enhancing CatBoost-based heart disease prediction, this work performs a thorough comparison analysis of several feature selection techniques. We evaluated six distinct feature selection methods—holistic filter models (information gain, chi-square), wrapper models (redundant feature removal), and embedded models (LASSO, Random Forest Feature Importance, CatBoost Feature Importance)—using the publicly available Cleveland Cardiology dataset. The dataset was preprocessed, and the performance of the CatBoost classifier with each feature subset was evaluated using standard metrics including accuracy, precision, recall, and F1 score. Our results demonstrate that feature selection significantly improves model performance over the baseline (all 13 featuresWith just seven features chosen, the combined approach utilizing CatBoost feature importance measurements (CB-FI) demonstrated its superiority by reaching a maximum accuracy of 88.8% and an F1 score of 89.8%. This approach fared better than filter-based approaches and LASSO (accuracy of 87.6%). The best methods agreed on identifying a core set of clinically relevant features: chest pain type (cp), thallium scan (thal), number of major vessels (ca), ST-segment depression (oldpeak), maximum heart rate (thalach), and exercise-induced angina (exang).The study demonstrates that feature selection, particularly using classifier intrinsic importance measures (CB-FI), is critical for developing high-performance and effective heart disease prediction models. Based on a clinically interpretable, integrated feature set, the resulting economic model offers a strong basis for developing dependable and reasonably priced clinical decision support systems to help with the early diagnosis of heart disease.
Downloads
References
[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. (http://archive.ics.uci.edu/ml)
[2] Güllü, M., Akeyyol, M. A., & Barşgi, N. (2022). Machine learning-based comparative study for heart disease prediction. Advances in Artificial Intelligence Research (AAIR).
[3] Khemphila, A., & Boonjing, V. (2011). Heart disease classification using neural network and feature selection. International Conference on Systems Engineering.
[4] Raykar, S., & Shet, V. (2021). Comparative analysis of feature selection based machine learning methods for heart disease prediction. ITEE Journal.
[5] Anuradha, P., & David, V. K. (2022). Feature selection by ModifiedBoostARoota and classification by CatBoost model on high dimensional heart disease datasets. International Journal of Computer Theory and Engineering.
[6] Firdaus, F. F., Nugroho, H. A., & Soesanti, I. (2020). A review of feature selection and classification approaches for heart disease prediction. International Journal of Information Technology and Electrical Engineering.
[7] Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological).
[8]Alizadehsani, R., Hosseini, M. J., Sani, Z. A., Ghandeharioun, A., & Boghrati, R. (2012). Diagnosis of coronary artery disease using cost-sensitive algorithms. In 2012 IEEE 12th International Conference on Data Mining Workshops (pp. 9-16). IEEE.
[9]Anbarasi, M., Anupriya, E., & Iyengar, N. C. S. N. (2010). Enhanced prediction of heart disease with feature subset selection using genetic algorithm. International Journal of Engineering Science and Technology, 2(10), 5370–5376.
[10] Anuradha, P., & David, V. K. (2022). Feature selection by ModifiedBoostARoota and classification by CatBoost model on high dimensional heart disease datasets. International Journal of Computer Theory and Engineering, 14(4), 141–148. https://doi.org/10.7763/IJCTE.2022.V14.1321
[11] Ayar, M., & Şabanoviç, S. (2018). An ECG-based feature selection and heartbeat classification model using a hybrid heuristic algorithm. Informatics in Medicine Unlocked, 13, 167–175.
[12] Boonjing, V., & Khemphila, A. (2011). Heart disease classification using neural network and feature selection. In 2011 21st International Conference on Systems Engineering (pp. 406-409). IEEE. https://doi.org/10.1109/ICSEng.2011.80
[13] Dahal, K. R., & Gautam, Y. (2020). Argumentative comparative analysis of machine learning on coronary artery disease. Open Journal of Statistics, 10(4), 694–705.
[14] Firdaus, F. F., Nugroho, H. A., & Soesanti, I. (2020). A review of feature selection and classification approaches for heart disease prediction. International Journal of Information Technology and Electrical Engineering, 4(3), 75–82.
[15] Güllü, M., Akeyyol, M. A., & Barşgi, N. (2022). Machine learning-based comparative study for heart disease prediction. Advances in Artificial Intelligence Research (AAIR), 2(2), 51–58. https://doi.org/10.54569/aair.1145616
[16] Kolukisa, B., et al. (2018). Evaluation of classification algorithms, linear discriminant analysis and a new hybrid feature selection methodology for the diagnosis of coronary artery disease. In 2018 IEEE International Conference on Big Data (Big Data) (pp. 2232-2238). IEEE.
[17] Kolukisa, B., et al. (2020). Coronary artery disease diagnosis using optimized adaptive ensemble machine learning algorithm. International Journal of Bioscience, Biochemistry, and Bioinformatics, 10(1), 1–12.
[18] Raykar, S., & Shet, V. (2021). Comparative analysis of feature selection based machine learning methods for heart disease prediction. ITEE Journal of Information Technology & Electrical Engineering, 10(1), 41–48.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.








