Introduction Breast Cancer (BC) is the most fatal and frequent malignancy among women with an estimated 11.7% of all cancer cases and about 20% of all cancer-related deaths. Globally it t is the second leading cause of cancer death among people (men and women) after lung malignancies in both developing and developed countries [1]. Based on the global cancer report, BC was the most commonly diagnosed cancer in 2020, with 2.3 million new cases [2]. Early detection and screening can significantly decrease patient costs, improve the overall likelihood of treatment and survivability [3].
Today, evidence suggests that BC is a global challenge due to its heterogeneous, multifactorial, violent nature, and destructive health effects [4, 5]. Reportedly, it is now well established that the malignant BC is often aggressive, forming in the early stages in the glands and mammary ducts [6] and then metastasizing to the surrounding tissues, adjacent lymph nodes, and, specifically, to the bones, liver, brain, or lungs in the advanced stages [7]. Most regrettably, many cases of malignancy are detected late in the advanced stages of the disease such that the tumor has metastasized to the tissues around the breast, axillary lymph nodes, and even other organs [8, 9]. Therefore, there is a growing body of literature that recognizes the benefits of systematic and up-to-date screening policies in this regard [10].
The most well-known methods for screening the disease include mammography, thermography, and tissue sampling techniques which are thoroughly implemented more seriously in many developed countries [11]. However, the mentioned screening methods are time-consuming, expensive, and highly complicated. On the other hand, recently, there has been renewed interest in some techniques, including breast self-examination (BSE) and clinical breast examination (CBE). Despite their cheapness and availability, studies have reported challenging and different results on their effectiveness [12, 13].
There are several clinical and non-clinical factors influencing the incidence of BC [14]. Due to the different stages and severity of BC and the existence of some ambiguities and unpredictable situations regarding its outcomes, which, in turn, necessitates adopting innovative technologies for screening [15].
Recently, researchers have shown an increased interest in the deployment of newly-developed digital technological and non-invasive methods such as artificial intelligence (AI) systems which can be effective in rapid, accurate, and timely diagnosis of malignancies [16]. Specifically, the rapid diagnosis of cancers in the early stages is considered the most significant factor for definitive treatment of the disease, prevention of unpleasant complications, and increasing patients’ survival chances [17]. Machine learning (ML), a subset of AI, has many applications in many industries, including healthcare [18]. The ML plays a crucial role in managing malignancies such as prognosis, diagnosis, and treatment outcomes from the big data available in the medical field [19].
In the last few decades, several ML-based methods have been developed for the effective and timely prognosis and screening of BC [20, 21]. These methods will support decisions by extracting hidden patterns and applied knowledge from the raw dataset [22].
The clinical decision support systems (CDSSs) based on rule-based logic [23] and decision tree (DT) algorithms [24] are considered useful, practical, and flexible tools for modeling medical diagnoses and supporting complex decisions [23]. Rule-based machine learning (RBML) is increasingly adopted due to different stages and degrees of severity and some ambiguities and unpredictable situations in the behavior and outcome of the disease besides various clinical and non-clinical factors involved in BC emergence and progression [23, 25]. So far, several studies have been evaluating the application of ML algorithms in BC risk classification and prediction based on clinical variables.
Momenyan et al. developed an optimum ML-based intelligent model for classifying the BC risk [26]. Researchers compared three different ML algorithms for BC risk classification [27, 28]. In another study conducted by Solanki and their colleagues, they investigated the prediction of benign or malignant BC using selected ML techniques [29]. Finally, Salod et al. in their work compared the performance of eight ML algorithms in BC screening and detection [2].
In recent years, many RBML techniques are applied to predicting BC and classifying disease outcomes. Therefore, this study was aimed to develop an appropriate and scientific screening model based on the selected RBML for earlier detection of the disease, improve diagnostic efficiency and decrease the risk of mortalities caused by BC.
Instrument and Methods This retrospective single-center study aimed to develop a BC risk prediction model using seven popular RBML algorithms and selecting the best performing. Models were trained and evaluated on the data of suspected BC from December 2017 to January 2021. BC cases were extracted from the BC Registry database in the Ayatollah Taleghani Hospital, Abadan, Iran. The Registry database contains 2854 patient records with 30 features. The independent features (input) are categorized into 6 main classes patient characteristics, nutritional factors, medical history, history of BC and related interventions, clinical manifestations, and epidemiological factors input variables. The dependent variable (output) is the diagnosis of BC by two values of 0 and 1 associated with non-BC and BC cases, respectively. Primary variables of the registry database associated with the BC prognosis are listed below:
- Demographic: Age, job, education, nationality, the ratio of waist to breast, and Body Mass Index;
- History of diseases: Salt, vegetable, dairy, fruit (average in days from 5 years ago), fast food, and oil consumption;
- Nutritional factors: Diabetes, common cold, hyperlipidemia, hyperglyceridaemia, hypercholesterolemia, hypertension, and fatness;
- History of breast cancer and interventions: A personal history of breast cancer, history of breast sampling, history of chest radiotherapy, and family history of breast cancer;
- Clinical manifestations: Exist a mass in the upper quarter of the breast or the unspecified region of the breast;
- Epidemiological factors: Walking, heavy job, physical, optimal physical activities, and alcohol consumption;
- Outcome: BC and non-BC.
After applying exclusion criteria, ultimately the 1668 case records were chosen for the study (Diagram 1).
![src=./files/hehp/images/HTML_Publish/55598/55598-_D_1.PNG]()
Diagram 1) Flow chart describing patient selection
The Abadan University of Medical Science ethics board approved the study design. Before implementing the ML algorithms, preprocessing was performed on the raw dataset. This stage is a common requirement for many ML predictions. For this purpose, we removed the samples with more than 70% missing data from the analysis. Then, for other missing fields, we used the average of the existing available values and the K-nearest neighborhood (KNN) Euclidean distance for the quantitative and qualitative variables, respectively. The model’s implementation was done by Rapid Miner Studio 7.1.1 environment.
To select the best predictors and reduce the dataset
dimension, the independent Chi-square test was used for determining the relationship between each independent variable (30 variables) with the dependent (BC diagnosis: Yes or No) as the output class. The p<0.01 is considered as a statistically significant level in this respect. After determining the most important affecting factors in BC, we trained a set of RBML algorithms such as J-48, random-forest (RF), random-tree (RT), and REP-Tree, decision table (DT), J-RIP, and Part were applied to classify the diagnosis value of the dataset and eliciting the knowledge about BC classification with IF-THEN structure. These techniques are used for discovering the knowledge and hidden patterns that existed for diagnosing BC in the dataset. The Weka software 3.9 was utilized in this respect.
In the last phase, the performance of all algorithms was assessed by criteria such as positive predicted value (PPV), negative predicted value (NPV), sensitivity, specificity, accuracy, F-score, and are calculating the area under receiver operator characteristics (AUC-ROC). The confusion matrix has been used for measuring the capabilities of each data mining algorithm in classification. They are calculated as follows: The True Positive (TP) and True Negative (TN) are the numbers of positive and negative cases that have and do non-having BC and are truly classified by algorithms as positive and negative, respectively. False Positive (FP) and False Negative (FN) are also the numbers of non-BC and BC cases that are incorrectly classified as positive and negative cases by algorithms, respectively. The 10-fold cross-validation has been utilized for determining and comparing all data mining performance for considering the errors that existed in algorithms performance calculation in this respect. After determining the best algorithm using different performance criteria, in the last step, the best knowledge for diagnosing BC has been obtained using the IF-THEN structure, and the rules with the more classified samples were considered the main knowledge for diagnosing BC.
Findings The 554 and 949 cases associated with the positive and negative BC cases, respectively have remained and were included for statistical analysis. The mean age of the afflicted women was 48.146±13.074 years and in non-afflicted cases was 43.212±9.70 years. Table 1 shows the basic data of the two groups of individuals.
Based on the results, 18 variables had a significant relationship with diagnosing the BC using the Chi-square test at p<0.01. The variables of upper in quadrants breast cancer, history of chest radiotherapy, and fatness were considered as the most three important factors for diagnosing the BC at p<0.001 (Table 2).
Table 1) The frequency results of demographic variables
![src=./files/hehp/images/HTML_Publish/55598/55598-_T_1.PNG]()
Table 2) The most important BC prediction factors at p<0.01
![src=./files/hehp/images/HTML_Publish/55598/55598-_T_2.PNG]()
The results of determining the combinational correlation between the BC diagnostic factors and the dependent variable using binary logistic regression (BLR) and forward logistic regression method have been brought in IF-Term Removed Table (Table 3). As depicted in Table 3, in the 9th step of the BLR, by entering the 9 variables of history of breast sampling, history of chest radiotherapy, family history of BC, alcohol consumption, vegetable consumption, diabetes, physical activity, age, and upper in quadrants breast cancer, the average of log-likelihood of the model has been obtained -61.91 at p<0.01. In conclusion, by selecting these nine variables in the BLR model and reducing the Log-likelihood, the performance of the BLR has been increased and therefore, these variables had a significant hybrid correlation coefficient with output class at p<0.01.
The results of comparing the performance of selected RBML algorithms in BC classification using the confusion matrix showed that the DT was the only algorithm that by FP=0 and TN=949, has classified all the non-BC samples correctly, and was a better algorithm than others in this regard. The J-48 decision tree algorithm with FP=1 and TN=948 had also the pleasant capability of classifying the non-BC cases. Also, this algorithm with FN=12 and TP=542 had a better performance in classifying the positive cases than other algorithms.
The results of measuring the evaluation criteria of PPV, NPV, sensitivity, specificity, accuracy, and F-score of these algorithms have been demonstrated in Diagram 2. Based on the results, although, the DT rule-based algorithm with NPV=1 demonstrated the best capability in just classifying the negative BC cases, generally, the J-48 decision tree algorithm with accuracy=0.991 and F-measure=0.987 has obtained the best performance in classifying all research samples than other algorithms. The ROC of all RBML algorithms has been shown in Diagram 3.
Generally, investigating all the algorithms classification performance using different evaluation criteria showed that the J-48 decision tree algorithm with PPV of 0.998, NPV of 0.987, the sensitivity of 0.978, specificity of 0.998, accuracy of 0.991, F-measure of 0.987, and also AUC of 0.9997 yielded the best performance than other algorithms for predicting the BC risk. In Diagram 4, the J-48 decision tree algorithm has been depicted and all technical characteristics used in this study have been mentioned. Finally, the best knowledge about diagnosing BC with the more classified sample extracted from this algorithm with IF-THEN structures has been brought and then interpreted. The most important technical features utilized for building J-48 with the best performance include the number of batch size=100, binary split=False, collapse tree=True, confidence factor=0.25, number of minimal objects=2, number of decimal places=2,
number of folds=3, reduced error pruning=True, and number of seeds=1.
Some knowledge extracted from the J-48 decision tree algorithm with highly classified samples: - IF Radio therapy=Yes THEN Diagnosis=breast cancer;
- IF Radio therapy=No & Alcohol=Yes THEN Diagnosis=breast cancer;
- IF Radio therapy=No & Alcohol=No & Age <=38 THEN Diagnosis=Non-breast cancer.
Based on the J-48 decision tree algorithm’s diagram, the history of chest radiotherapy has been considered as the most important factor for diagnosing BC. Generally, three rules have been obtained as the most important patterns, as below:
1- The first rule was only based on the history of the chest radiotherapy as a condition, this means that in 455 of the positive cases, the history of chest radiotherapy has been seen and if one person has this risk factor, the probability of afflicting BC can be 82.1%;
2- In the second rule, if the person without any history of chest radiotherapy with alcohol consumption, the probability of afflicting BC can be 11.3% (63 positive samples have been classified truly);
3-The third rule is very important for diagnosing the non-BC cases, and if a person without any history of chest radiotherapy, non-alcoholic and less than 38 years, the probability of non-afflicting BC can be 89.5% (850 truly classified samples/ 949 total negative samples).
Table 3) IF-Term removed table for BC diagnostic factors (p<0.001)
![src=./files/hehp/images/HTML_Publish/55598/55598-_T_3.PNG]()
Table 4) The selected RBML confusion matrix
![src=./files/hehp/images/HTML_Publish/55598/55598-_T_4.PNG]()
![src=./files/hehp/images/HTML_Publish/55598/55598-_D_2.PNG]()
Diagram 2) Various performance evaluation criteria of different RBML algorithms (The vertical and horizontal vertices of the diagram show the True Positive Rate (TPR) and False Positive Rate (FPR), respectively)
![src=./files/hehp/images/HTML_Publish/55598/55598-_D_3.PNG]()
Diagram 3) The ROC of different RBML algorithms
![src=./files/hehp/images/HTML_Publish/55598/55598-_D_4.PNG]()
Diagram 4) The J-48 pruned decision tree Discussion
The purpose of the current study was to effectively determine BC cases through intelligent RBML techniques. In the present study, multiple RBML-based predictive models were developed for early risk prediction of BC based on 1668 suspected BC clinical data. Thus, we trained seven RBML algorithms including J-48, RF, RT, and REP-Tree, DT, J-RIP, and Part according to the top related parameters affecting the risk of BC that derived from a correlation coefficient analysis. The selected algorithms were applied to the pre-processed dataset. This study first selected the most reliable and clinically relevant predictors related to BC by using the independence Chi-square test. Hence we identified nine highly correlated variables that had the meaningful hybrid correlation coefficient with output class at P<0.05. It is proven that ML can be a