Biostatistics SPSS Analysis Project
Aidian Sasenarine
The City College of New York
MED 22311: Introduction to Biostatistics
Sophia Barrett
May 11th, 2025
Introduction:
A) Background Research & Research Rationale:
There are numerous key variables that affect the prognosis of cardiovascular diseases. These variables encompass symptoms, adverse health effects and protective factors. Furthermore, these factors are affected by many demographics including age, gender, and other diagnoses. It is essential to accurately measure these variables and precisely paint a complete picture of these factors and their collective influence on cardiovascular diseases. Detailing the relationship, correlation, and significance between many variables that affect cardiovascular diseases are important because it reveals key risk factors and supports better prevention and treatment. Within this project the key variables highlighted from this data set are Chest Pain, Resting Blood Pressure, Serum Cholesterol, Fasting Blood Sugar, Exercise Angina, Max Heart Rate, Heart Disease Diagnosis.
Cardiovascular disease remains the leading global health concern, with numerous biological and lifestyle-related risk factors. Even though known contributors from this dataset such as serum cholesterol levels and chest pain are associated with these diseases, their combined predictive power and the relationships between variables are less explored in population and individual level data. This report uses clinical data from the UCI Heart Disease dataset to investigate how these risk factors relate to other risk factors and ultimately their link to heart disease outcomes. This relationship can contribute to more informed screening, better understanding of the results of screening, and finally better treatment and prevention plans.
B) Dataset Background:
This report pulls data directly from the UCI Machine Learning Repository’s Heart Disease dataset. This dataset has a well known sample of individuals that were screened for cardiovascular conditions. This dataset ultimately merges patient records from the Cleveland Clinical Foundation (U.S.), Hungarian Institute of Cardiology (Hungary), University Hospital in Zurich (Switzerland), and Long Beach VA Medical Center (U.S.) Each record from every patient includes many important stats about their demographic and important risk factors utilized every day in clinics. These measurements include chest pain type, gender, age, resting blood pressure, cholesterol levels, blood sugar, and heart disease diagnosis – and more. The dataset’s wide variety of patient data yields it as an ideal choice for examining the potential relationships between cardiovascular risk factor and disease outcomes (Janosi et al., 1989).
C) Null and Alternative Hypothesis:
Chi-Square Test: Is there an association between fasting blood sugar levels and types of chest pain?
Null Hypothesis (H₀): There is no association between fasting blood sugar and the different types of chest pain {Typical, Atypical, Non-Anginal}
Alternative Hypothesis (H₁): There is a significant association between fasting blood sugar and different types of chest pain {Typical, Atypical, Non-Anginal}
Independent Samples t-Test: Does a person’s maximum heart rate differ significantly between those with and without exercise-induced angina?
Null Hypothesis (H₀): There is no significant difference in maximum heart rate between individuals with exercise angina and those without exercise angina.
Alternative Hypothesis (H₁): There is a significant difference in max heart rate between individuals with exercise angina and those without exercise angina.
One Way ANOVA: Does mean serum cholesterol differ across types of chest pain (Typical, Atypical, Non-Anginal)?
Null Hypothesis (H₀): Mean serum cholesterol level is the same across all levels of chest pain
Alternative Hypothesis (H₁): At least one chest pain type group has a different mean serum cholesterol.
Two Way ANOVA: Do exercise-induced angina and fasting blood sugar interact to affect serum cholesterol levels?
Null Hypothesis (H₀): There is no significant effect between Exercise Angina and Fasting Blood Sugar on Serum Cholesterol.
Alternative Hypothesis (H₁): There is a significant effect between Exercise Angina and Fasting Blood Sugar on Serum Cholesterol
Multiple Linear Regression: Does chest pain type, serum cholesterol, and resting blood pressure accurately and significantly predict maximum heart rate?
Null Hypothesis (H₀): Chest pain type, serum cholesterol, and resting blood pressure do not significantly predict maximum heart rate.
Alternative Hypothesis (H₁): At least one of the predictors – Chest Pain Type, Serum Cholesterol, or Resting Blood Pressure – significantly predicts maximum heart rate.
Multiple Logistic Regression: Does chest pain, resting blood pressure, serum cholesterol, fasting blood sugar, and maximum heart rate significantly predict the chances that an individual would have heart disease?
Null Hypothesis (H₀): Chest pain, resting blood pressure, serum cholesterol, fasting blood sugar, and maximum heart rate do not significantly predict whether or not has heart disease.
Alternative Hypothesis (H₁): At least one of those predictors, resting blood pressure, serum cholesterol, fasting blood sugar, or maximum heart rate, does significantly predict the whether or not a person has heart disease.
2. Methods:
A) Sample:
UCI Heart Disease Dataset
This dataset includes exactly 1000 patients that underwent screening for heart disease. The standard UCI Heart Disease Dataset has 303 patients, but this was an expanded version, either using data from other hospitals or synthetically expanding the actual dataset (Aishah, 2020).
The UCI Heart Disease Dataset complies patients from four main sources: Cleveland Clinic Foundation, Hungarian Institute of Cardiology, University Hospital in Zurich, VA Medical Center, and University Hospital in Basel (Osei-Nkwantabisa & Ntumy, 2024).
The demographic include both females and males ranging from ages 29 to 77. The variables in the dataset are binary categorical, multi-level categorical, and continuous variables. The variables for this analysis are highlighted below. The clinical focus was on the risk factors these variables had on the development of heart disease, and the outcome measured was the presence of heart disease or the absence of heart disease (Osei-Nkwantabisa & Ntumy, 2024).
B) Variables:
Chest Pain Types (chestpain)
- Categorical Variable
- 0 = No Chest Pain, 1 = Typical Chest Pain, 2 = Atypical Chest Pain, 3 = Non-Anginal Chest Pain
Chest pain is oftentimes listed as a symptom of many cardiovascular diseases as opposed to a risk factor itself (Cayler Jr, 2005). Obtaining a complete picture of the severity, location and duration of pain is rare as this data is reliant on the self-reported health of the patient (Cayler Jr, 2005). What is measurable are tests that help determine the underlying cause of the pain instead of the pain itself. Tests such as electrocardiogram (ECG), chest radiograph, and blood tests help construct clinical predictions and aids in diagnosing the underlying cause.
Chest pain is relevant to the Heart Disease Dataset because it is the major clinical symptom often associated with heart disease. Furthermore, it is prominent in the diagnosis of angina pectoris – the chest pain caused by ischemic (shortage of oxygen) cardiac disease. The dataset uses chest pain type as a predictor to indicate the underlying cardiovascular disease. Essentially, accurate classification of chest pain is important and influences decisions about further testing and treatment (Lenfant, 2010). The main three types of classification of chest pain are atypical chest pain, which does not follow the usual patterns of heart-related pain, typical chest pain that is caused by heart issues, and non-anginal chest pain which is not related to the heart and comes from other causes like muscles or digestion.
Chest pain is more commonly reported by men, especially in middle age, often as a typical heart-related symptom. Women and older adults report less chest pain. People with risk factors such as smoking, high blood pressure – regardless of gender or race – are more likely to experience chest pain (Lenfant, 2010).
Resting Blood Pressure (restingBP)
- Continuous Variable
- Typical Range Below 120 / 80 mmHg
Blood pressure does shift the risk for cardiovascular disease as there is an association between resting blood pressure and heart disease. Overall blood pressure measures the force at which blood pushes against the walls of the arteries. Systolic blood pressure (SBP) shows the top number in a standard blood pressure test – signifying the pressure when the heart beats. The bottom number details the pressure between heart beats (Gillum, 1998). The patient’s blood pressure is measured every 2 minutes in two different resting periods. In one of these periods the patient is seated for 16 minutes and another period where the patient is lying down to rest (Gillum, 1998).
Resting blood pressure underscores the relationship between resting heart rate (RHR) and blood pressure (BP). This relationship is key not only in the risk of mortality in multiple cardiovascular diseases, but it is also predictive of the risk of developing the diseases themselves. Resting blood pressure directly relates to the UCI Heart Disease Dataset as both RHR and blood pressure are associated with an increased risk of cardiovascular disease mortality, with a particularly significant risk when a fast RHR is combined with high systolic blood pressure (Sala et al., 2006).
Both RHR and BP contribute to the development of atherosclerosis and an increase in risk for heart attacks along with other cardiovascular complications. RHR is more often seen higher in women and younger individuals, while physically active people tend to have lower heart rates. BP generally increases as a person gets older and is higher in men and black individuals. Both RHR and BP are influenced by factors such as fitness, stress and interactions with other health conditions.
Serum Cholesterol Levels (serumcholesterol)
- Continuous Variable
- Typical Range Below 200 mg/dL
Serum cholesterol also plays a major role in the development of cardiovascular diseases. Patients with low High-Density Lipoproteins (HDL) have issues with removing plaque buildup leading to heart implications. While those with high levels of Low-Density Lipoproteins (LDL) have an above average buildup of plaque in the arteries. (Chen et al., 1991). In addition, serum cholesterol, specifically Low- Density Lipoprotein Cholesterol (LDL-C) is a key factor in the diagnosis of hypercholesterolemia in patients. Serum cholesterol levels are measured with density-gradient ultracentrifugation, chromatographic and electrophoretic techniques. LDL-C levels may even be accurately estimated with the Friedewald equation (Chen et al., 1991).
Serum cholesterol concentration has a direct positive relationship with mortality from coronary heart diseases (CHD) (Rifai et al., 1992). The findings emphasize that cholesterol levels are directly linked to CHD risk. Furthermore, cholesterol levels help assess individual risk for coronary events (especially for individuals in the dataset).
Primarily, serum cholesterol levels are affected most by age, sex, ethnicity and socioeconomic status. Cholesterol levels generally increase with age while higher levels are found in men as opposed to women (Rifai et al., 1992).
Fasting Blood Sugar Levels (fastingbloodsugar)
- Categorical Variable
- 0 = Normal Level ( < 120 mg/dL ), 1 = High Level ( > 120 mg/dL)
Fasting blood glucose is simply the level of glucose in the blood after an overnight fast. Typically fasting blood glucose is a problematic symptom in untreated Type 2 diabetes but also has clinical associations to heart complications (Emerging Risk Factors Collaboration, 2010). Long term effects of fasting blood glucose can affect triglyceride levels, ultimately leading to heart problems. Fasting blood glucose is measured through a blood test after an 8 hour fast. Levels are meant to hover around 4-6 mmol/l and measurements are taken every three months (Emerging Risk Factors Collaboration, 2010).
Unlike serum cholesterol levels, fasting blood glucose concentrations indicate a modest non-linear association with coronary heart disease risk. However, high levels of fasting blood glucose (>5.59 mmol/L) indicate significant impact on vascular disease risk (Holman & Turner, 1988). However, a high blood glucose level also means a higher risk of type 2 diabetes, which further complicates the underlying causes of other diseases, and makes treatment plans more complex and difficult.
In the case of fasting blood glucose levels, older adults tend to have higher amounts of sugar in the blood with men reaching their peak before women. This peak starts around the mid 50s. In addition, South Asian descendants also are at a higher risk for elevated fasting blood sugar because of their diet high in carbs. Moreover, obesity and increased BMI are strongly associated with higher blood glucose levels (Chen et al., 1991).
Exercise-Induced Angina (exerciseangina)
- Categorical Variable
- 0 = Absence of Angina, 1 = Presence of Angina
Exercise Angina presents similarly to chest pain, but the chest discomfort surfaces during and after physical activities. This is due to the reduced blood flow to the heart muscle, commonly associated with coronary artery disease (Long et al., 2019). Also, similarly to chest pain, angina is a symptom, and doctors attempt to find the underlying cause of the pain and not measure the pain itself. Doctors will take measures of a patient’s heart rate, ECG, and perfusion scintigraphy while the patient performs physical activities to help diagnose issues related to exercise-induced angina (Long et al., 2019).
Exercise angina is highly relevant to cardiovascular diseases, particularly CHD. Statistics highlight that about 30-60% of people undergoing exercise testing may have CHD when exercise angina was one of the markers of the condition. It is relevant to the heart disease dataset as it is an established method of evaluating prognosis of cardiovascular diseases (Hlatky, 1999).
Exercise angina can be significantly affected by numerous demographic factors. These factors such as older age (men > 45, women > 55), obesity, diabetes, hypertension, and positive smoking history, all increase the risk of exercise angina. Ethnic groups, particularly African Americans and Hispanics, tend to have higher rates of CHD because of a higher association to these risk factors (Hlatky, 1999).
Individual’s Maximum Heart Rate (maxheartrate)
- Continuous Variable
Maximum heart rate is the highest number of beats per minute your heart can reach during extensive physical activity. There is controversy surrounding measurements of max heart rate as studies have found the formula: max heart rate = (220 -age) to be inaccurate especially in older adults (Tanaka & Seals, 2001). A more accurate model that has been tested on more than 18,000 people is 208 – 0.7 x age. This better reflects actual heart rate limits, and how it is measured, regardless of gender or physical activity (Tanaka & Seals, 2001).
Maximum heart rate is more relevant to resting heart rate and its association to cardiovascular disease than maximum heart rate’s relationship to cardiovascular disease alone. However, that does not mean it is not important in the diagnosis of a patient with cardiovascular problems. Therefore, monitoring both resting and maximum heart rates is key for evaluating cardiovascular risk and guiding treatment plans.
Maximum heart rate is primarily influenced by age, decreasing steadily as people get older. Factors such as ethnicity, sex, and physical activity levels also play a minimal role in the impact on maximum heart rate.
Diagnosis of Heart Disease (target)
- Categorical Variable
- 0 = No Presence of Heart Disease, 1 = Presence of Heart Disease
All of these factors including other variables all play a role in determining and diagnosing cardiovascular disease. Cardiovascular disease, put simply, are conditions that affect the heart and blood vessels. Heart disease is typically diagnosed from a combination of medical history, diagnostics tests and physical exams. Some common tests include ECGs and stress tests (Videbæk et al., 2016).
Cardiovascular diseases are imperative to this dataset as variables within this dataset are utilized to predict this exact variable. Without this variable within the dataset, no conclusions could be drawn on the predictors of Heart Disease from a statistical lens.
C) Analysis Plan:
This study encompasses both inferential statistics (association tests) and predictive modeling statistics (regression model tests) to assess the relationship between clinically relevant variables in the UCI Heart Disease Dataset and the actual presence of heart disease. The report will detail 6 relevant statistical tests conducted through SPSS and analyzing input from the UCI Heart Disease Dataset. The first analysis, a Chi-Square Test for independence, will distinguish the association between fasting blood sugar levels and chest pain types. Afterwards, an independent samples t-test will analyze the difference between maximum heart rate from those with exercise angina and those without. The one way ANOVA will then evaluate if these same people with exercise angina and their correlating fasting blood glucose levels affect their serum cholesterol levels. These tests cover the inferential statistics examinations. Moving on to the predictive models, a multi linear regression will estimate the accuracy of chest pain types, serum cholesterol levels, and rating blood pressure on maximum heart rate. And finally, a multiple logistic regression will detail whether or not all these factors combined significantly estimate the likelihood of heart disease. All statistical output will be detailed in the appendix.
All of these model’s performances will be visually displayed in graphs and tested for normality, collinearity, and linearity when applicable. Metrics to determine the strength of relationship such as R Squared will entail the accuracy of the association between variables. Taken together, these analyses aim to determine the important patterns and predictive relationships behind important clinician variables. These include: chest pain, blood sugar, serum cholesterol, heart rate. Moreover, these variables may have a significant influence on the presence of heart disease and mapping the relationship between these key risk factors can help develop a better diagnosis and guide treatment options.
3. Results:
A) Data Preparation (See Appendix G)
B) Interpretive Statistical Analysis
1) Chi-Square Test: Is there an association between fasting blood sugar levels and types of chest pain?
A chi square test for independence was conducted to determine the association between fasting blood sugar (< 120 mg/dL, > 120 mg/dL) and the different types of chest pain (no chest pain, typical chest pain, atypical chest pain, non-anginal chest pain). The results indicate a significant association: χ² (3, N = 1000) = 54.58, p < .001, Cramér’s V = .23. Furthermore, Cramer’s V shows a moderate association between the variables. Individuals with normal fasting blood sugar ( < 120 mg/dL) were most likely to not have any chest pain. While those with high fasting blood sugar ( > 120 mg/dL) were most likely to have atypical chest pain. See Appendix A3 and A4.
Figure 8:
Note. Count measures the amount of individuals in each category. The bar graph highlights the distribution of fasting blood sugar against different types of chest pain. The results indicated significant association, χ² (3, N = 1000) = 54.58, p < .001, Cramér’s V = .23.
2) Independent Samples t-Test: Does a person’s maximum heart rate differ significantly between those with and without exercise-induced angina?
An independent sample t-test was conducted to detail the relationship of maximum heart rate between individuals with exercise-induced angina and those without exercise induced angina. The test was not statistically significant: t(998) = -0.49, p = 0.624, d = 0.03, 95% CI [-5.31, 3.18]. This suggests that the relationship was not statistically associated – exercise angina was not meaningfully related to an individual’s maximum heart rate. Those without anina (M = 151.1, SD 22.3) had a marginally greater maximum rate than those with angina (M = 1050.0, SD = 21.9). See Appendix B6.
Figure 9:
Note. Dependent axis measures the maximum heart rate of individuals in bpm. Boxplot of max heart rate against presence of exercise angina. The results indicated non-significant association, t(998) = -0.49, p = 0.624, d = 0.03, 95% CI [-5.31, 3.18] Confidence intervals, standard deviations, and mean maximum heart rate near identical.
3) One Way ANOVA: Does mean serum cholesterol differ across types of chest pain (Typical, Atypical, Non-Anginal)?
A one-way ANOVA was conducted to determine whether or not the mean serum cholesterol level differed across different types of chest pain (no chest pain, typical chest pain, atypical chest pain, non-anginal chest pain). The ANOVA test revealed a significant effect, F(3, 996) = 10.91, p < 0.01, partial η² = 0.032. Post hoc comparisons through Tukey’s HSD showed that participants with no chest pain saw a markedly decreased level of serum cholesterol (M = 286.28, n = 44.37) compared to those with typical chest pain (M = 318.97, n = 48.45), atypical chest pain (M = 331.64, n = 312) and non-anginal chest pain (M = 370.14, n = 56.42). See Appendix C1 and C10.
Figure 10:
Note. Shapiro-Wilk tests indicated non-normality within all four groups (p < 0.001), and Levene’s test revealed unequal variances (p < 0.001). Results should be taken with caution when generalizing results to populations. Boxplot of chest pain types and serum cholesterol levels. Despite overlapping error bars, the results indicated significant differences were found across all groups, F(3, 996) = 10.91, p < 0.01.
4) Two Way ANOVA: Do exercise-induced angina and fasting blood sugar interact to affect serum cholesterol levels?
A two-way ANOVA was conducted to determine the effects of exercised induced angina and fasting blood sugar on serum cholesterol levels. Test of normality (Levene’s test) indicated a violation of homogeneity assumption, p <0.001, but the analysis proceeded due to the large sample of the data. There was no significant effect of exercise induced angina on cholesterol levels, F(1, 996) = 0.21, p = .645, partial η² < .001. However, there was a significant main effect of fasting blood sugar on serum cholesterol levels, F(1, 996) = 83.83, p < .001, partial η² = .077. As a result, individuals with high fasting blood sugar had higher serum cholesterol, but not exercise angina on cholesterol levels. The overall interaction from angina and fasting blood sugar was not significant, F(1, 996) = 0.70, p = .405, partial η² = .001. On the contrary, the model overall was statistically significant, F(3, 996) = 6.24, p < .001, but the effect size was relatively small (R² = .018, adjusted R² = .015). To conclude, findings detail that fasting blood sugar independently influences cholesterol, while angina and its effect played a marginal role in cholesterol levels. See Appendix D1.
Figure 11:
Note. Estimated Marginal Mean represents means of serum cholesterol level. Shapiro-Wilk failed the test of normality within all four groups (p < 0.001), and Levene’s test revealed unequal variances (p < 0.001). Results should be taken with caution when generalizing results to populations. Interaction plot shows varying serum cholesterol levels (Y-axis) between groups of high and normal fasting blood glucose against the presence or absence of exercise angina (X-axis). Despite violations, the results still indicated significant differences for angina and its interaction on serum cholesterol, while fasting blood sugar highlighted a significant main effect on cholesterol (p < 0.001).
C) Predictive Model Statistical Analysis
Multiple Linear Regression: Does chest pain type, serum cholesterol, and resting blood pressure accurately and significantly predict maximum heart rate?
A multiple linear regression model was tested to examine whether chest pain types, serum cholesterol, and resting blood pressure significantly predicted an individual’s maximum heart rate. Overall, the regression model as a whole was statistically significant, F(3, 996) = 6.239, p < .001. Adjusted R square value detailed that around 1.5% of the predictors explained the variance in maximum heart rate. Amongst the predictor, chest pain type was a significant predictor of maximum heart rate, B = 3.45, β = .096, t(996) = 2.95, p = .003. In addition, resting blood pressure also significantly predicted maximum heart rate, B = 0.083, β = .073, t = 2.24, p = .025. On the other hand, serum cholesterol was not a significant predictor, B = 0.003, β = .013, t = 0.41, p = .685. Collinearity diagnostics were all within range, tolerance values were above 0.94 and all VIFs below 1.1. No severe multicollinearity noted. See Appendix E3 and E4.
Figure 12:
Note. Resting blood pressure measured in millimeters of mercury (mmHg), Maximum heart rate measured in beats per minute (bpm). This scatterplot shows the relationship between serum cholesterol and fasting blood sugar levels, with each point representing a patient with Cardiovascular Disease Dataset. The model demonstrates small but statistically significant predictors of maximum heart rate for both blood pressure and chest pain type, but not serum cholesterol levels.
Max Heart Rate = 21.947 + (3.447 × Chest Pain) + (0.003 × Serum Cholesterol) + (0.083 × Resting BP)
Multiple Logistic Regression: Does chest pain, resting blood pressure, serum cholesterol, fasting blood sugar, and maximum heart rate significantly predict the chances that an individual would have heart disease?
A binary logistic regression was examined to determine whether or not chest pain types, serum cholesterol levels, resting blood pressure, maximum heart rate, and fasting blood sugar significantly predicted the likelihood of an individual having heart disease. Everything taken into account, the model was statistically significant, χ²(5) = 477.73, p < .001. This suggests that these predictors accurately distinguished those who had heart disease with those who did not have heart disease. The model held a Nagelkerke R Squared value of .607, meaning that the model explained 60.7% of the variance in heart disease diagnosis, and correctly classified 83.9% of the cases. Classification accuracy was 79.8% for individuals without heart disease and 86.9 for those with it. No signs of issues with multicollinearity between predictors. Most of the predictors were individually statistically significant (chest pain, fasting blood sugar, maximum heart rate, resting blood pressure). Only serum cholesterol was not significant.
See Appendix F7 & F9.
Figure 13:
Note. Predicted probabilities of having heart disease based on the logistic regression model. Each data point in the chart represents a patient. Higher predicted values indicated a greater likelihood of heart disease. The model was statistically significant, χ²(5) = 477.73, p < .001.
Discussions / Interpretations:
The first analysis highlighted the relationship between normal and high fasting blood sugar levels and different types of chest pain according to the UCI heart disease dataset. Given that elevated blood sugar is linked to cardiovascular disease, it is imperative that the relationship between cardiac symptoms and fasting blood sugar is detailed. The chi-squared test revealed a significant association between fasting blood sugar and chest pain type. Those with high fasting blood sugar saw more atypical chest pain, and those with normal saw no chest pain. The overall hypothesis was supported and the data suggests that patterns of fasting blood sugar may affect cardiac symptoms – with higher blood sugar leading to chest pain not recognized in heart conditions. A major limitation is that the underlying background for such atypical chest pain was not detailed. Furthermore, the atypical chest pain may have resulted from patients with diabetes which does in fact correlate with typical chest pain. Additionally, chest pain is very subjective, meaning that different medical professionals may diagnose chest pain differently with no benchmark guidelines. Due to the nature of the test only causation can be inferred as opposed to relationships. The main implication of this study is to be cautious when making a diagnosis with metabolic markers. Patients who have chest pain not commonly associated with heart conditions should not automatically be ruled out for heart diseases as there was a significant relationship within this study.
The second analysis detailed if individuals with exercise-induced angina differed statistically in maximum heart rate as opposed to those without exercise induced angina. A common result of exercise angina is reduced blood flow to the heart during physical exercise. As a result there may be an underlying relationship between exercise angina and lower max rates during physical exercise – primarily because of the discomfort and weakened heart. With all this being said, the hypothesis was not supported as there was no statistically significant difference in maximum heart rate between the two groups of individuals. Even more, the confidence interval includes 0 which strengthens the conclusion of no statistical significance. Although the findings marginally showed that those without angina had a greater mean maximum heart rate, the conclusion is that an individual’s ability to reach maximum heart rate is not hindered by exercise angina. One important limitation to keep in mind was that the mechanisms behind obtaining maximum heart rates was not noted. This could lead to some patients receiving medications while others being put to physical exercises, and the near-identical results may reflect that angina alone is too narrow of a variable to note any differences. These results imply that exercise induced angina is not a significant predictor of maximum heart rate and other tests should be utilized to determine the correlation instead.
The one way ANOVA analysis explored whether serum cholesterol levels differed significantly based on chest pain type. High cholesterol, a well known risk of cardiovascular disease, could interact with chest pain type (common symptoms of cardiovascular disease), to lay out an unique relationship and biological mechanics within patient profiles. The one way ANOVA noted statistically significant differences in serum cholesterol across all chest pain types, but a moderate effect size. Individual’s with no chest problems had lower serum cholesterol levels, as opposed to atypical pain patients and non anginal pain patients who had the highest serum cholesterol levels. These results align with the notion that patients with higher serum cholesterol are more likely to be experiencing any form of chest pain which then leads to cardiovascular issues. However, there was one major flaw within this study; that was the fact that Shapiro-Wilk’s test was violated. This means that the robustness of the results are diminished and the generalizability of the data was lost. In addition there were many outliers, suggesting that although the null hypothesis was rejected, these results cannot support a wider population. Despite this, medical professionals should target lipid screening in patients with high cholesterol levels as it may be useful in the diagnosis of cardiovascular issues.
Similarly, the two way ANOVA investigated how serum cholesterol levels were impacted by a combination of fasting blood sugar and exercise induced angina. Because both fasting blood sugar and exercise angina are correlated risk factors to heart disease, detailing their relationship to cholesterol levels can help clarify risk potential within patients. Exercise angina, however, had no statistical significance in affecting serum cholesterol levels, meaning angina status does not correlate to exercise angina. On the other hand, fasting blood sugar did have a statistically significant effect on cholesterol levels as patients with a higher blood glucose level had more serum cholesterol in their body. The interaction between angina and blood sugar was not significant, detailing that their combined influence was not greater than their individual effect. This study however was limited in multiple fashions. Primarily, the effect size was small. This suggests that the predictors explained a marginal amount of the variance in cholesterol levels. In addition, normality and homogeneity of variances were violated, reducing the robustness of the data. Furthermore, only the null hypothesis detailing fasting blood sugar and serum cholesterol should be rejected, the other two were not statistically significant. Moreover, this highlights that fasting blood sugar plays a more meaningful role in exercise induced angina. Medical professionals should consider fasting blood sugar and regulating these levels with patients with exercise angina. The other factors could still be considered, but not to a significant effect as fasting blood sugar.
Looking now at the predictive model results, the multiple linear regression examined whether chest pain type, serum cholesterol and resting blood pressure could significantly predict an individual’s maximum heart rate. Maximum heart rate is an important cardiovascular indicator for stress and exertion and detecting this problem early could be paramount in controlling and mitigating the prognosis of cardiovascular diseases. Overall, the regression model was statistically significant indicating that the combination of all three predictors contributes to explaining the variance in max heart rate. The problem however, was that the adjusted R squared value was only 1.5%. This means that the predictors only predicted approximately 1.5% of the variation. Individually, chest pain type and resting blood pressure was a significant predictor, but serum cholesterol was not. The drastically low R squared limits the explanatory power of the predictors, suggesting more important predictors are missing. Even though collinearity (the predictors weren’t too closely related) wasn’t an issue, key predictors were missing. Even though these predictors contribute to the variance, they aren’t the main predictors – more important factors should be analyzed to depict resting blood pressure. Despite the poor predictors, the null hypothesis was rejected.
Finally, the binary logistic regression was utilized to evaluate a multitude of predictors (chest pain, serum cholesterol, resting blood pressure, fasting blood sugar, and maximum heart rate) on the chances of predicting heart disease diagnosis. This is the ultimate goal as medical professionals want to detect any cardiovascular disease and these predictors may aid in the diagnosis. As a group, the model was statistically significant. These predictors reliably distinguished those with and without heart disease. Even more, the Nagelkerke R squared had a large effect size, meaning that the model accounted for 60.7% of the variance in heart disease diagnosis. Overall the model correctly classified 83.9% of the cases overall, 80% without heart disease and 87% with heart disease. All predictors were statistically significant except for serum cholesterol. The only limitation with this model is that serum cholesterol may overlap with other predictors, hence why it was not significant. Furthermore, the model does not take into account lifestyle factors that may play a role. The findings imply that this model is useful as a cardiovascular predictor. These variables are useful and obtaining these variables from a patient can strongly influence the diagnosis of cardiovascular diseases. As a result, the null hypothesis was rejected.
References:
Aishah, I. (2020). Exploratory Data Analysis on Heart Disease UCI data set. Towards Data Science. https://towardsdatascience.com/exploratory-data-analysis-on-heart-disease-uci-data-set-ae129e47b323/
Cayley Jr, W. E. (2005). Diagnosing the cause of chest pain. American family physician, 72(10), 2012-2021. https://doi.org/10.3122/jabfm.18.6.2012
Chen, Z., Peto, R., Collins, R., MacMahon, S., Lu, J., & Li, W. (1991). Serum cholesterol concentration and coronary heart disease in populations with low cholesterol concentrations. British Medical Journal, 303(6797), 276-282. https://doi.org/10.1136/bmj.303.6797.276
Emerging Risk Factors Collaboration. (2010). Diabetes mellitus, fasting blood glucose concentration, and risk of vascular disease: a collaborative meta-analysis of 102 prospective studies. The Lancet, 375(9733), 2215-2222. https://doi.org/10.1016/S0140-6736(10)60484-9
Ferrari, R., & Fox, K. (2016). Heart rate reduction in coronary artery disease and heart failure. Nature Reviews Cardiology, 13(8), 493-501. https://doi.org/10.1038/nrcardio.2016.84
Gillum, R. F. (1988). The epidemiology of resting heart rate in a national sample of men and women: associations with hypertension, coronary heart disease, blood pressure, and other cardiovascular risk factors. American heart journal, 116(1), 163-174. https://doi.org/10.1016/0002-8703(88)90262-1
Holman, R. R., & Turner, R. C. (1988). Optimizing blood glucose control in type 2 diabetes: an approach based on fasting blood glucose measurements. Diabetic medicine, 5(6), 582-588. https://doi.org/10.1111/j.1464-5491.1988.tb01056.x
Hlatky, M. A. (1999). Exercise testing to predict outcome in patients with angina. Journal of General Internal Medicine, 14(1), 63. https://doi.org/10.1046/j.1525-1497.1999.00283.x
Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R. (1989). Heart Disease [Data set]. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X
Lenfant, C. (2010). Chest pain of cardiac and noncardiac origin. Metabolism, 59, S41-S46. https://doi.org/10.1016/j.metabol.2010.07.014
Long, L., Anderson, L., Gandhi, M., Dewhirst, A., Bridges, C., & Taylor, R. (2019). Exercise-based cardiac rehabilitation for stable angina: systematic review and meta-analysis. Open heart, 6(1). https://doi.org/10.1136/openhrt-2018-000989
Osei-Nkwantabisa, A. S., & Ntumy, R. (2024). Classification and prediction of heart diseases using machine learning algorithms. arXiv. https://doi.org/10.48550/arXiv.2409.03697
Rifai, N., Warnick, G. R., McNamara, J. R., Belcher, J. D., Grinstead, G. F., & Frantz Jr, I. D. (1992). Measurement of low-density-lipoprotein cholesterol in serum: a status report. Clinical chemistry, 38(1), 150-160. https://doi.org/10.1093/clinchem/38.1.150
Sala, C., Santin, E., Rescaldani, M., & Magrini, F. (2006). How long shall the patient rest before clinic blood pressure measurement?. American journal of hypertension, 19(7), 713-717. https://doi.org/10.1016/j.amjhyper.2005.08.021
Tanaka, H., Monahan, K. D., & Seals, D. R. (2001). Age-predicted maximal heart rate revisited. Journal of the American college of cardiology, 37(1), 153-156. https://doi.org/10.1016/S0735-1097(00)01054-8
Videbæk, J., Laursen, H. B., Olsen, M., Høfsten, D. E., & Johnsen, S. P. (2016). Long-term nationwide follow-up study of simple congenital heart disease diagnosed in otherwise healthy children. Circulation, 133(5), 474-483. https://doi.org/10.1161/CIRCULATIONAHA.115.017226
Appendix:
Appendix A: Chi-Square Test
A1 – Case Processing Summary: Chest Pain x Fasting Blood Sugar
This table details the amount of valid and missing cases that are being utilized in the chi-square analysis that is meant to test the association between the different chest pain types and fasting blood sugar levels.
A2 – Cross Tabulation: Chest Pain × Fasting Blood Sugar
This table highlights the frequency distribution for chest pain type across all levels of fasting blood sugar. These pain types include typical chest pain, atypical chest pain, non-anginal chest pain.
A3 – Chi-Square Test: SPSS Output
This output details the raw results from the Chi-Square analysis between chest pain and fasting blood sugar. The output includes data on the Chi-Square value, the degrees of freedom and the Pearson Chi Square p-value.
A4 – Symmetric Measures: Chest Pain x Fasting Blood Sugar
This table presents the strength of association between chest pain and fasting blood sugar as presented in a Chi-Square Analysis. The Cramer’s V statistic can be utilized to determine the strength of association between the two variables.
Appendix B: Independent Samples t-Test
B1 – Test of Normality: Max Heart Rate x Exercise Angina
The table displays the Kolomongorov-Smirnov and the Shapiro-Wilk test as an assessment for normality within this Independent samples T-Test. Most importantly, the Shapiro-Wilk helps detail if max heart rate is normally distributed within each independent variable (exercise angina).
B2 – Histogram for Normality: Max Heart Rate x No Exercise Angina
This histogram table is utilized to visually represent the distribution of maximum heart rate with no exercise angina. It is not a visual representation of the Independent samples T-test but rather of the normal distribution of the data.
B3 – Histogram for Normality: Max Heart Rate x Yes Exercise Angina
This histogram is utilized to visually represent the distribution of maximum heart rate with the presence of exercise angina. It is not a visual representation of the Independent samples T-test.
B4 – Q-Q Plots for Normality: Max Heart Rate x No Exercise Angina
The observed Q-Q plots details the normality of maximum heart rate with no exercise angina. The assessment is dependent on whether the plotted data follow an approximately normal line. This would help determine if the data was normally distributed.
B5 – Q-Q Plots for Normality: Max Heart Rate x Yes Exercise Angina
The observed Q-Q plot details the normality of maximum heart rate with the presence of exercise angina. The assessment is dependent on whether the plotted data follow an approximately normal line. This would help determine if the data was normally distributed.
B6 – Independent Samples t-Test Output: Max Heart Rate x Exercise Angina
This SPSS output details the results from the T-test between maximum heart rates and exercise Angina. Alongside with the test statistic, the p-value degrees of freedom, confidence interval, and Levene’s Test for equality of variances are all labeled.
B7 – Independent Samples Effect Size: Max Heart Rate x Exercise Angina
This graph highlights the sample effect size, namely Cohen’s d. This statistic displays how big the difference is between the two groups by showing how far apart their averages are compared to the overall spread of the data.
Appendix C: One Way ANOVA
C1 – Test For Normality: Different Levels of Chest Pain x Serum Cholesterol Levels
The table displays the Kolomongorov-Smirnov and the Shapiro-Wilk test as an assessment for normality within this One Way ANOVA test. Most importantly, the Shapiro-Wilk helps detail if serum cholesterol levels are normally distributed within each category of chest pain (None, Typical, Atypical, Non-Anginal) This test was conducted prior to the ANOVA analysis.
C2 – Histogram for Normality: Typical Chest Pain x Serum Cholesterol Levels
This histogram table is utilized to visually represent the distribution of serum cholesterol with typical chest pain. It is not a visual representation of the One Way ANOVA but rather of the normal distribution of the data.
C3 – Histogram for Normality: No Chest Pain x Serum Cholesterol Levels
This histogram table is utilized to visually represent the distribution of serum cholesterol with no typical chest pain. It is not a visual representation of the One Way ANOVA but rather of the normal distribution of the data.
C4 – Histogram for Normality: Non-Anginal Chest Pain x Serum Cholesterol Levels
This histogram table is utilized to visually represent the distribution of serum cholesterol with non-anginal typical chest pain. It is not a visual representation of the One Way ANOVA but rather of the normal distribution of the data.
C5 – Q-Q Plots for Normality: No Chest Pain x Serum Cholesterol Levels
The observed Q-Q plots details the normality of serum cholesterol with no chest pain. The assessment is dependent on whether the plotted data follow an approximately normal line. This would help determine if the data was normally distributed.
C6 – Plots for Normality: Typical Chest Pain x Serum Cholesterol Levels
The observed Q-Q plots details the normality of serum cholesterol with Typical Chest Pain. The assessment is dependent on whether the plotted data follow an approximately normal line. This would help determine if the data was normally distributed.
C7 – Plots for Normality: Atypical Chest Pain x Serum Cholesterol Levels
The observed Q-Q plots details the normality of serum cholesterol with Atypical Chest Pain. The assessment is dependent on whether the plotted data follow an approximately normal line. This would help determine if the data was normally distributed.
C8 – Plots for Normality: Non-Anginal Chest Pain x Serum Cholesterol Levels
The observed Q-Q plots details the normality of serum cholesterol with Non-Anginal Chest Pain. The assessment is dependent on whether the plotted data follow an approximately normal line. This would help determine if the data was normally distributed.
C9 – Test for Homogeneity of Variances: Chest Pain Types x Serum Cholesterol Levels
Test of Homogeneity of Variances highlights the test Levene Statistic. Levene’s statistic is needed to assess whether there is equal variances for serum cholesterol across all chest pain types. Test is utilized to support One Way ANOVA.
C10 – ANOVA Output: Chest Pain Types x Serum Cholesterol Levels
SPSS output highlighting the results for serum cholesterol and the effects within the group and between the group of different types of chest pains. Includes Sum of Squares, F statistic and significance value.
C11 – ANOVA Effect Size: Chest Pain Types x Serum Cholesterol Levels
This graph details information about the effect size of the ANOVA data. The eta-squared value details the proportion of serum cholesterol explained by each type of chest pain. This represents the effect size for the ANOVA.
C12 – Games-Howell Multiple Comparison’s Test:
The data from the above chart was utilized to synthesize the mean cholesterol level between all different types of chest pain. The test compares mean serum cholesterol levels across all types of chest pain. This is an example of a Post HOC test that is utilized for unequal variances and sample sizes, analogous to the data found in this dataset.
C13 – Tukey B: Chest Pain Types x Serum Cholesterol Levels
The data from the above chart was utilized to synthesize the mean cholesterol level between all different types of chest pain. The test compares mean serum cholesterol levels across all types of chest pain. This is an example of a Post HOC test that is utilized for identifying which specific group differences were statistically significant.
Appendix D: Two Way ANOVA
D1 – Test for Homogeneity of Variances: Serum Cholesterol Against Exercise Induced Angina and Fasting Blood Sugar
The table for the test of Homogeneity of Variances highlights the test Levene Statistic. Levene’s statistic is needed to assess whether there is equal variances for serum cholesterol across all fasting blood sugar and exercise angina types. This test is utilized to support Two Way ANOVA.
D2 – Test Between Subjects for Effect on Serum Cholesterol
This output table presents the main effects between exercise-induced angina and fasting blood glucose and their interactions with serum cholesterol level within the lens of a Two Way ANOVA test. The degrees of freedom, F statistics, partial Eta squared and p-value are all detailed.
D3 – Estimated Marginal Means of Serum Cholesterol
This table shows the average serum cholesterol level for each group based on whether someone has exercised induced angina. The results from the table help better how factors such as these affect serum cholesterol as a whole
Appendix E: Multiple Linear Regression
E1 – Variables Entered and Removed from the Model
This table specifically highlights the predictors included in this multiple regression linear model. These variables include serum cholesterol, chest pain type and resting blood pressure.
E2 – Model Summary For Multiple Linear Regression
This table ultimately presents the overall fit of the regression model predicting max heart rate. The important statistics included in this model summary is R (the proportion of Variance) and R^2 (the proportion squared).
E3 – ANOVA Table For Multi Linear Regression
This table highlights if the overall regression model significantly predicts max heart rate by testing if the combination of all the predictors explains the variance
E4 – Coefficients For Multi Linear Regression
This table provides individual effects for each predictor on the maximum heart rate. Important statistics include unstandardized coefficients, standardized coefficients, t-statistic and VIF.
E5 – Collinearity Diagnostics for Multi Linear Regression
This table also reports the VIF and tolerance for predictors and its effect on maximum heart rate. This ensures that the predictors are not too highly correlated with others.
E6 – Residual Statistics For Multi Linear Regression
This table highlights the information on residuals, which is the differences between observed and predicted values. This helps assess the assumption and quality of the regression model.
E7 – Case Processing Summary for Multiple Linear Regression
This table shows the amount of cases for each variable and also notes if any cases were missing. This ensures transparency about the sample size.
E8 – Test of Normality for Multiple Linear Regression
The table displays the Kolomongorov-Smirnov and the Shapiro-Wilk test as an assessment for normality within this Multiple Linear Regression test. Most importantly, the Shapiro-Wilk helps detail if all predictors levels are normally distributed within each category. This test was conducted prior to the ANOVA analysis.
E9 – Normal P-P Plot of Regression Standardized Residual
This plots details if the residuals from the multiple linear regression model are normally distributed. Points closely following the diagonal lines means that normality is assumed.
Appendix F: Multiple Logistic Regression
F1 – Dependent Coding For Logistic Regression
This details the coding for the presence of heart disease which will be utilized in the Logistic Regression. A code of 0 means no heart disease while a code of 1 means presence of heart disease. This allows for estimates based on predictor values.
F2 – Classification Table for Logistic Regression
This table summarizes how well the model predicts the presence or absence of heart disease. This model only looks at the prediction at baseline before any of the predictors are included in the model.
F3 – Model Summary for Logistic Regression
This model details two test statistics Cox & Snell R Square and Nagelkerke R Square. These are utilized in order to indicate how much variation of the presence of heart disease is explained by the predictor values in the logistic regression.
F4 – Variables in the Equation for Logistic Regression
This table depicts the baseline model with only the constant added. It provides the baseline log-odds for exercise induced angina and is utilized as the y-intercept.
F5 – Variables not in the Equation for Logistic Regression
This table displays the p-value for variables not in the equation detailing that the overall variables are not significant at baseline.
F6 – Classification Table for Logistic Regression (All Predictors Included)
The table presents the model’s predictive performance after all predictors have been included. It details how accurately the model details heart disease diagnosis based off of the predictors. Calculations for sensitivity and specificity can be made.
F7 – Variables in Equation for Logistic Regression (All Predictors Included)
This table gives information about the B, S.E. Wald. degrees of freedom. P – value, Exp(B), and confidence of interval for each predictor in the final model. It details which variables significantly predict the likelihood of a heart disease diagnosis.
F8 – Correlation Matrix for Logistic Regression (All Predictors Included)
This matrix shows the Pearson correlation coefficient for all predictors variables included in the logistic regression model. Identifies correlation between the variables and helps predict potential multicollinearity.
F9 – Multicollinearity for Logistic Regression (All Predictors Included)
This table also reports the VIF and tolerance for predictors and its effect on resting blood pressure. This ensures that the predictors are not too highly correlated with others.
Appendix G: Qualitative Descriptives
G1 – Descriptives & Frequencies for Chest Pain
This frequency table details the frequency of each type of chest pain within all samples in this dataset.
G2 – Descriptives & Frequencies for Fasting Blood Sugar
This frequency table details the frequency of high and low fasting blood sugar within all samples in this dataset.
G3 – Descriptives & Frequencies for Exercise Angina
This frequency table details the frequency of presence and absence of exercise angina within all samples in this dataset.
G4 – Descriptives & Frequencies for quantitative variables
This descriptive table gives the mean, standard deviation, maximum, minimum and total amount of individuals within quantitative variables.
G5 – Descriptives & Frequencies for Heart Disease Diagnosis

