Loading...
Thumbnail Image
Item

Explainable Supervised and Semi-Supervised Learning for Breast Cancer Risk Prediction from Questionnaires: A Study on BCSC and UAE Datasets

Alsarookh, Omar Ahmad
Date
2025-06
Type
Thesis
Degree
Citations
Altmetric:
Description
A Master of Science thesis in Computer Engineering by Omar Ahmad Alsarookh entitled, “Explainable Supervised and Semi-Supervised Learning for Breast Cancer Risk Prediction from Questionnaires: A Study on BCSC and UAE Datasets”, submitted in June 2025. Thesis advisor is Dr. Salam Dhou. Soft copy is available (Thesis, Completion Certificate, Approval Signatures, and AUS Archives Consent Form).
Abstract
Breast cancer is one of the most prevalent cancers globally and remains a leading cause of death among women. While mammography helps detect existing abnormalities, it offers limited insight into the future risk of developing cancer. This highlights the need for proactive risk assessment which enables early intervention before symptoms appear. This thesis explores supervised and semi-supervised machine learning methods to classify women into low and high-risk groups using environmental and lifestyle features. Three datasets were utilized: the labeled Breast Cancer Surveillance Consortium (BCSC) Risk Estimation dataset, the unlabeled BCSC Risk Factors dataset, and a labeled private region-specific dataset from University Hospital Sharjah (UHS) in the UAE. In the supervised approach, multiple models were evaluated on the BCSC Risk Estimation dataset including XGBoost, TabNet, Random Forest, and Logistic Regression. XGBoost achieved the best performance with an F1-score of 0.93. A separate region-specific supervised model was also developed using the UHS dataset, with XGBoost again performing best (F1-score: 0.85). Since labeled medical data is often scarce and expensive to obtain, semi-supervised learning allows us to leverage large volumes of unlabeled data to improve model performance and generalization. In the semi-supervised approach, two techniques were used on the unlabeled BCSC Risk Factors dataset: Label Spreading, a graph-based method, and Self-Training, where the classifier iteratively labels data based on confidence thresholds. Pseudo-labeled samples from both methods were combined with labeled data to retrain classifiers. The best semi-supervised model (Self-Training with XGBoost) achieved an F1-score of 0.91. Generalizability was evaluated by testing the BCSC-trained models on the UHS dataset. The supervised BCSC model achieved an F1-score of 0.92, while the semi-supervised model’s F1-score dropped to 0.81 when tested on UHS data, which highlights domain shift and pseudo-label noise. SHAP and LIME were used for explainability, confirming the influence of key risk factors such as breast density, family history, and BMI. These results validate the feasibility of integrating supervised and semi-supervised approaches for proactive, population-specific breast cancer risk assessment.
External URI
Collections