Addressing Class Imbalance for Breast Cancer Prediction in Southern Libya: A Comparative Study of Sampling Techniques

Main Article Content

Asma Agaal
Mansour Essgaer
Amal Amarrf

Abstract

Class imbalance refers to a scenario where the quantity of data in the minority class is significantly lower than that in the majority class, resulting in challenges in classification. To address this issue, this study tackles the challenge of class imbalance in breast cancer prediction using a dataset from the Sabha Center for Oncology Treatment in southern Libya. The research investigates the impact of eight different sampling techniques, including SMOTE, Adasyn, and NearMiss, when combined with Random Forest classification. The findings reveal that integrating SMOTE with Random Forest significantly outperforms other model configurations, resulting in a 21% increase in accuracy for predicting malignant samples and reaching a peak recall of 96%. This study demonstrates the importance of addressing class imbalances in medical datasets to improve the effectiveness of breast cancer prediction models.

Article Details

How to Cite
Agaal, A., Essgaer, M., & Amarrf, A. (2024). Addressing Class Imbalance for Breast Cancer Prediction in Southern Libya: A Comparative Study of Sampling Techniques . Sebha University Conference Proceedings, 3(2), 416–422. https://doi.org/10.51984/sucp.v3i2.3357
Section
Confrence Proceeding

References

Yang, F., et al., Global trajectories of liver cancer burden from 1990 to 2019 and projection to 2035. 2023. 136(12): p. 1413-1421.

Jain, L., Artificial Intelligence and Machine Learning for Healthcare. 2023.

Jiang, Y., C. Wang, and S. Zhou. Artificial Intelligence-based Risk Stratification, Accurate Diagnosis and Treatment Prediction in Gynecologic Oncology. in Seminars in Cancer Biology. 2023. Elsevier.

Twomey, D., Novel Algorithm-Level Approaches for Class-Imbalanced Machine Learning. 2023, UCL (University College London).

Aguiar, G., B. Krawczyk, and A.J.M.L. Cano, A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework. 2023: p. 1-79.

Teslenko, D., et al., Comparison of Dataset Oversampling Algorithms and Their Applicability to the Categorization Problem. 2023(2 (24)): p. 161-171.

Yu, T. and H.J.a.p.a. Zhu, Hyper-parameter optimization: A review of algorithms and applications. 2020.

Brandt, J. and E. Lanzén, A comparative review of SMOTE and ADASYN in imbalanced data classification. 2021.

Qing, Z., et al., ADASYN-LOF Algorithm for Imbalanced Tornado Samples. 2022. 13(4): p. 544.

Mqadi, N.M., N. Naicker, and T.J.M.P.i.E. Adeliyi, Solving misclassification of the credit card imbalance problem using near miss. 2021. 2021(1): p. 7194728.

Vuttipittayamongkol, P. and E.J.I.S. Elyan, Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. 2020. 509: p. 47-70.

Hairani, H., A. Anggrawan, and D.J.J.I.J.o.I.V. Priyanto, Improvement performance of the random forest method on unbalanced diabetes data classification using Smote-Tomek Link. 2023. 7(1): p. 258-264.

Dal Pozzolo, A., et al. Racing for unbalanced methods selection. in Intelligent Data Engineering and Automated Learning–IDEAL 2013: 14th International Conference, IDEAL 2013, Hefei, China, October 20-23, 2013. Proceedings 14. 2013. Springer.

Kovács, G.J.A.S.C., An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. 2019. 83: p. 105662.

López, V., et al., An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. 2013. 250: p. 113-141.

Ishaq, A., et al., Improving the prediction of heart failure patients’ survival using SMOTE and effective data mining techniques. 2021. 9: p. 39707-39716.

Belarouci, S., et al., Comparative study of balancing methods: case of imbalanced medical data. 2016. 21(3): p. 247-263.

Raeder, T., et al., Learning from imbalanced data: Evaluation matters. 2012: p. 315-331.

Rendon, E., et al., Data sampling methods to deal with the big data multi-class imbalance problem. 2020. 10(4): p. 1276.

Huda, S., et al., A hybrid feature selection with ensemble classification for imbalanced healthcare data: A case study for brain tumor diagnosis. 2016. 4: p. 9145-9154.

Huang, M.-W., et al., On combining feature selection and over-sampling techniques for breast cancer prediction. 2021. 11(14): p. 6574.

Fotouhi, S., S. Asadi, and M.W.J.J.o.b.i. Kattan, A comprehensive data level analysis for cancer diagnosis on imbalanced data. 2019. 90: p. 103089.

Kaope, C. and Y.J.M.J.M. Pristyanto, Teknik Informatika dan Rekayasa Komputer, The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance. 2023. 22(2): p. 227-238.

Vinutha, H., B. Poornima, and B. Sagar. Detection of outliers using interquartile range technique from intrusion dataset. in Information and Decision Sciences: Proceedings of the 6th International Conference on FICTA. 2018. Springer.

Little, R.J. and D.B. Rubin, Statistical analysis with missing data. Vol. 793. 2019: John Wiley & Sons.

Raju, V.G., et al. Study the influence of normalization/transformation process on the accuracy of supervised classification. in 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT). 2020. IEEE.

Billot, B., et al., Robust machine learning segmentation for large-scale analysis of heterogeneous clinical brain MRI datasets. 2023. 120(9): p. e2216399120.

Mahesh, T., et al., The stratified K-folds cross-validation and class-balancing methods with high-performance ensemble classifiers for breast cancer classification. 2023. 4: p. 100247.

Yuan, Y., et al., Review of resampling techniques for the treatment of imbalanced industrial data classification in equipment condition monitoring. 2023. 126: p. 106911.

Stracqualursi, E., et al., Systematic review of energy theft practices and autonomous detection through artificial intelligence methods. 2023. 184: p. 113544.

Kim, A. and I.J.P.o. Jung, Optimal selection of resampling methods for imbalanced data with high complexity. 2023. 18(7): p. e0288540.

Wongvorachan, T., S. He, and O.J.I. Bulut, A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining. 2023. 14(1): p. 54.

Singh, P.S., et al., Enhanced classification of hyperspectral images using improvised oversampling and undersampling techniques. 2022. 14(1): p. 389-396.

Kou, G., et al., Improved hybrid resampling and ensemble model for imbalance learning and credit evaluation. 2022. 7(4): p. 511-529.

Saalim, M.I., Studying the perturbation-based oversampling technique for imbalanced classification problems. 2023.

Mesquita, F., J. Maurício, and G. Marques. Oversampling techniques for diabetes classification: A comparative study. in 2021 International Conference on e-Health and Bioengineering (EHB). 2021. IEEE.

Halim, A.M., et al., Handling Imbalanced Data Sets Using SMOTE and ADASYN to Improve Classification Performance of Ecoli Data Sets. 2023. 5(1): p. 246− 253-246− 253.

Elreedy, D. and A.F.J.I.S. Atiya, A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. 2019. 505: p. 32-64.

Chen, B., et al., RSMOTE: A self-adaptive robust SMOTE for imbalanced problems with label noise. 2021. 553: p. 397-428.

Tyagi, A.K. and V.K. Reddy, Performance analysis of under-sampling and over-sampling techniques for solving class imbalance problem. 2019.

Sarkar, S., et al. An ensemble learning-based undersampling technique for handling class-imbalance problem. in Proceedings of ICETIT 2019: Emerging Trends in Information Technology. 2020. Springer.

Tanimoto, A., et al., Improving imbalanced classification using near-miss instances. 2022. 201: p. 117130.

Ludera, D.T. Credit card fraud detection by combining synthetic minority oversampling and edited nearest neighbours. in Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2. 2021. Springer.

Palimkar, P., R.N. Shaw, and A. Ghosh. Machine learning technique to prognosis diabetes disease: Random forest classifier approach. in Advanced Computing and Intelligent Technologies: Proceedings of ICACIT 2021. 2022. Springer.

Shekar, B. and G. Dagnew. Grid search-based hyperparameter tuning and classification of microarray cancer data. in 2019 second international conference on advanced computational and communication paradigms (ICACCP). 2019. IEEE.

Padilla, R., S.L. Netto, and E.A. Da Silva. A survey on performance metrics for object-detection algorithms. in 2020 international conference on systems, signals and image processing (IWSSIP). 2020. IEEE.