Team Members:
The number of strokes in young adults has been rising in recent years, and it recently becomes one of the most significant public health concerns. In the past, strokes mainly affected older adults, but strokes are now increasingly seen in young adults. This raises questions about the causes. Therefore, the objective of this study is to answer the question: "How do lifestyle factors such as stress, alcohol consumption, smoking, and unhealthy eating behaviors correlate with the rising stroke incidence in young adults?". Stress significantly impacts heart health leading to conditions that raise stroke risk. Similarly, alcohol consumption and smoking are also known risk factors for many health problems including stroke. As for unhealthy eating habits, this factor is related to high average glucose levels, and it contributes to diabetes, which is a common stroke risk factor. Through this analysis, we hope to provide valuable insights into preventing and managing strokes in young adults.
The first step in our process will be to import several relevant Python libraries for this study.
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as seas
import scipy as spy
from scipy.stats import chi2_contingency
from scipy.stats import f_oneway
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve
According to "Improving the Health, Safety, and Well-Being of Young Adults: Workshop Summary", which is the summary of a workshop hosted by the Board on Children, Youth, and Families of the Institute of Medicine (IOM) and the National Research Council (NRC) in May, 2013, in the United States, young adulthood typically starts with high school graduation around age 18 and can extend into the late 20s or early 30s. Therefore, we will use data of people who are considered young adults (between the ages 18-30).
Source: https://www.kaggle.com/datasets/shashwatwork/depression-and-mental-health-data-analysis
This dataset contains 824 rows and 13 columns offering a broad range of variables related to stress. It includes information on factors such as age, gender, occupation, number of days the participant has stayed indoors, whether their stress is increasing daily, frustrations during the first two weeks of quarantine, significant changes in eating and sleeping habits, their history of mental disorders in the previous generation, changes in body weight during quarantine, extreme mood changes, difficulty coping with daily problems or stress, loss of interest in work, and feelings of mental weakness when interacting with others. These factors are crucial for our project as they help us demonstrate that young adults aged approximately 18 to 30 experience increasing stress levels.
Before exploring stress data, we need to load in stress dataset into a variable called stress_df.
stress_df = pd.read_csv('/content/drive/MyDrive/CSV/mental_health_finaldata_1.csv')
print(stress_df)
Age Gender Occupation Days_Indoors Growing_Stress \ 0 20-25 Female Corporate 1-14 days Yes 1 30-Above Male Others 31-60 days Yes 2 30-Above Female Student Go out Every day No 3 25-30 Male Others 1-14 days Yes 4 16-20 Female Student More than 2 months Yes .. ... ... ... ... ... 819 20-25 Male Corporate Go out Every day No 820 20-25 Male Others 1-14 days Yes 821 20-25 Male Student More than 2 months Yes 822 16-20 Male Business 15-30 days No 823 30-Above Female Others 15-30 days No Quarantine_Frustrations Changes_Habits Mental_Health_History \ 0 Yes No Yes 1 Yes Maybe No 2 No Yes No 3 No Maybe No 4 Yes Yes No .. ... ... ... 819 Yes No Yes 820 Yes No Yes 821 Maybe Maybe No 822 No Maybe No 823 No No No Weight_Change Mood_Swings Coping_Struggles Work_Interest Social_Weakness 0 Yes Medium No No Yes 1 No High No No Yes 2 No Medium Yes Maybe No 3 Maybe Medium No Maybe Yes 4 Yes Medium Yes Maybe No .. ... ... ... ... ... 819 Yes Medium No Yes Maybe 820 Maybe Low No Maybe Maybe 821 Yes High Yes Yes Maybe 822 Maybe Low Yes No Maybe 823 Yes Low Yes No Maybe [824 rows x 13 columns]
ANOVA is used when comparing means among three or more groups to determine if there are significant differences among them. This dataset contains four age groups: 16-20, 20-25, 25-30, and 30-Above. Therefore, we will apply ANOVA Test on "whether the age group has a correlation with growing stress". We also assume that the alpha value is 0.05. These are Null Hypothesis and Alternate Hypothesis for the Anova Test:
$H_{0}$: The age group does not have an effect on the likelihood of growing stress.
$H_{A}$: The age group does have an effect on the likelihood of growing stress.
stress_table2 = pd.crosstab(stress_df['Growing_Stress'], stress_df['Age'])
statistic, p_value = f_oneway(stress_table2['16-20'], stress_table2['20-25'], stress_table2['25-30'], stress_table2['30-Above'])
print(stress_table2)
print("P-Value:", p_value)
Age 16-20 20-25 25-30 30-Above Growing_Stress Maybe 60 63 66 78 No 75 47 65 69 Yes 76 76 74 75 P-Value: 0.4821995988536243
Besides ANOVA Test, we can use Descriptive Statistics method to observe "Growing_Stress" for each age group.
stress_table3 = pd.crosstab(stress_df['Age'], stress_df['Growing_Stress'])
descriptive_stats = stress_table3.describe()
print(stress_table3)
print(descriptive_stats)
Growing_Stress Maybe No Yes Age 16-20 60 75 76 20-25 63 47 76 25-30 66 65 74 30-Above 78 69 75 Growing_Stress Maybe No Yes count 4.000000 4.000000 4.000000 mean 66.750000 64.000000 75.250000 std 7.889867 12.055428 0.957427 min 60.000000 47.000000 74.000000 25% 62.250000 60.500000 74.750000 50% 64.500000 67.000000 75.500000 75% 69.000000 70.500000 76.000000 max 78.000000 75.000000 76.000000
We create a graph using the matplotlib showing the realtion between the age groups and growing stress below.
stress_table2.plot(kind='bar', colormap='Paired')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
Look at the above Descriptive Statistics table, we have:
Count:
Each stress category ("Maybe", "No", "Yes") has 4 data points (4 age groups).
Mean:
The average count of individuals reporting "Maybe" stress is 66.75.
The average count of individuals reporting "No" stress is 64.00.
The average count of individuals reporting "Yes" stress is 75.25.
Standard Deviation (std):
The standard deviation for "Maybe" stress is 7.89 indicating moderate variability.
The standard deviation for "No" stress is 12.06 indicating higher variability compared to "Maybe" and "Yes".
The standard deviation for "Yes" stress is 0.96 indicating very low variability.
Minimum (min):
The minimum count for "Maybe" stress is 60.
The minimum count for "No" stress is 47.
The minimum count for "Yes" stress is 74.
25%:
25% of the data for "Maybe" stress is less than or equal to 62.25.
25% of the data for "No" stress is less than or equal to 60.50.
25% of the data for "Yes" stress is less than or equal to 74.75.
Median (50%):
The median for "Maybe" stress is 64.50.
The median for "No" stress is 67.00.
The median for "Yes" stress is 75.50.
75%:
75% of the data for "Maybe" stress is less than or equal to 69.00.
75% of the data for "No" stress is less than or equal to 70.50.
75% of the data for "Yes" stress is less than or equal to 76.00.
Maximum (max):
The maximum count for "Maybe" stress is 78.
The maximum count for "No" stress is 75.
The maximum count for "Yes" stress is 76.
Summary statistics:
The "Maybe" and "Yes" stress levels have higher average counts compared to "No" stress. The "No" stress level shows the highest variability indicating that the counts are more spread out across the age groups. The "Yes" stress level has the highest average count and the lowest variability suggesting that a consistent number of individuals across age groups report high stress. These statistics provide a clear picture of how stress levels vary among different age groups.
Because p-value > alpha value (0.4821995988536243 > 0.05), we fail to reject the null hypothesis. There is not enough evidence to suggest that the age group has an effect on the likelihood of growing stress. We create a plot to visualize the relationship between "Age Groups" and "Growing Stress":
plt.figure(figsize=(8, 6))
seas.violinplot(x='Growing_Stress', y='Age', data=stress_df, inner='quartile')
plt.ylabel('Growing Stress')
plt.xlabel('Age Group')
plt.title(f'Age Groups vs Growing Stress (Contingency Table)')
plt.grid(True)
plt.show()
Source: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1310009611
This dataset provides an overview of heavy drinking rates across age groups in Canada, with heavy drinking defined as 5 or more drinks in males or 4 or more drinks in females on a single occasion at least once per month in the past year. More specifically, the data includes the age groups 12-17 years old and 18-34 years old, which overlaps with our focus group of people aged 15 to 30 years old. For each age group, the data includes the number of people who reported heavy drinking in that year and that number expressed as a percentage of that age group. Because the data was sourced through the Canadian Community Health Survey, the sample size is sufficiently large, and the participants are spread out geographically, which in turn increases the diversity of participants. Given that the data ranges from 2015 to 2022, we can analyze how drinking habits within each age group have changed over time, contributing to our overall analysis of how the risk of strokes in young people has changed in recent years. Finally, the data can be further divided between males and females, providing insight into whether the sex of a participant, combined with their alcohol use, impacts the likelihood of a stroke.
To begin, we create a variable called alcohol_df that will store the raw data from the alcohol_consumption.csv file.
alcohol_df = pd.read_csv('/content/drive/MyDrive/CSV/alcohol_consumption.csv')
print(alcohol_df)
REF_DATE GEO DGUID Age group \ 0 2015 Canada (excluding territories) NaN Total, 12 years and over 1 2016 Canada (excluding territories) NaN Total, 12 years and over 2 2017 Canada (excluding territories) NaN Total, 12 years and over 3 2018 Canada (excluding territories) NaN Total, 12 years and over 4 2019 Canada (excluding territories) NaN Total, 12 years and over .. ... ... ... ... 91 2018 Canada (excluding territories) NaN 65 years and over 92 2019 Canada (excluding territories) NaN 65 years and over 93 2020 Canada (excluding territories) NaN 65 years and over 94 2021 Canada (excluding territories) NaN 65 years and over 95 2022 Canada (excluding territories) NaN 65 years and over Sex Indicators Characteristics UOM UOM_ID \ 0 Both sexes Heavy drinking Number of persons Number 223 1 Both sexes Heavy drinking Number of persons Number 223 2 Both sexes Heavy drinking Number of persons Number 223 3 Both sexes Heavy drinking Number of persons Number 223 4 Both sexes Heavy drinking Number of persons Number 223 .. ... ... ... ... ... 91 Both sexes Heavy drinking Percent Percent 239 92 Both sexes Heavy drinking Percent Percent 239 93 Both sexes Heavy drinking Percent Percent 239 94 Both sexes Heavy drinking Percent Percent 239 95 Both sexes Heavy drinking Percent Percent 239 SCALAR_FACTOR SCALAR_ID VECTOR COORDINATE VALUE STATUS SYMBOL \ 0 units 0 v110787655 1.1.1.17.1 5782800.0 NaN NaN 1 units 0 v110787655 1.1.1.17.1 5770900.0 NaN NaN 2 units 0 v110787655 1.1.1.17.1 6015500.0 NaN NaN 3 units 0 v110787655 1.1.1.17.1 5946400.0 NaN NaN 4 units 0 v110787655 1.1.1.17.1 5802200.0 NaN NaN .. ... ... ... ... ... ... ... 91 units 0 v110790388 1.6.1.17.4 7.4 NaN NaN 92 units 0 v110790388 1.6.1.17.4 7.6 NaN NaN 93 units 0 v110790388 1.6.1.17.4 7.4 NaN NaN 94 units 0 v110790388 1.6.1.17.4 7.9 NaN NaN 95 units 0 v110790388 1.6.1.17.4 10.0 NaN NaN TERMINATED DECIMALS 0 NaN 0 1 NaN 0 2 NaN 0 3 NaN 0 4 NaN 0 .. ... ... 91 NaN 1 92 NaN 1 93 NaN 1 94 NaN 1 95 NaN 1 [96 rows x 18 columns]
Next, we will explore the data to determine if there are any columns that can be deleted.
alcohol_df['SCALAR_ID'].unique() # this output shows us that the entire column has the same value
array([0])
# Since the SCALAR_ID column doesn't provide information for the dataset, we can remove it
alcohol_df = alcohol_df.drop(columns=['SCALAR_ID'])
# Since the STATUS, SYMBOL, AND TERMINATED columns are completely empty, we can remove them as well
alcohol_df = alcohol_df.drop(columns=['STATUS', 'SYMBOL', 'TERMINATED'])
# Since the UOM_ID (unit of measure ID), DGUID , SCALAR_FACTOR, VECTOR, DECIMALS and COORDINATE
# columns don't provide data necessary to our analysis, we will remove them as well
alcohol_df = alcohol_df.drop(columns=['UOM_ID', 'DGUID', 'SCALAR_FACTOR', 'VECTOR', 'DECIMALS','COORDINATE'])
print(alcohol_df)
REF_DATE GEO Age group \ 0 2015 Canada (excluding territories) Total, 12 years and over 1 2016 Canada (excluding territories) Total, 12 years and over 2 2017 Canada (excluding territories) Total, 12 years and over 3 2018 Canada (excluding territories) Total, 12 years and over 4 2019 Canada (excluding territories) Total, 12 years and over .. ... ... ... 91 2018 Canada (excluding territories) 65 years and over 92 2019 Canada (excluding territories) 65 years and over 93 2020 Canada (excluding territories) 65 years and over 94 2021 Canada (excluding territories) 65 years and over 95 2022 Canada (excluding territories) 65 years and over Sex Indicators Characteristics UOM VALUE 0 Both sexes Heavy drinking Number of persons Number 5782800.0 1 Both sexes Heavy drinking Number of persons Number 5770900.0 2 Both sexes Heavy drinking Number of persons Number 6015500.0 3 Both sexes Heavy drinking Number of persons Number 5946400.0 4 Both sexes Heavy drinking Number of persons Number 5802200.0 .. ... ... ... ... ... 91 Both sexes Heavy drinking Percent Percent 7.4 92 Both sexes Heavy drinking Percent Percent 7.6 93 Both sexes Heavy drinking Percent Percent 7.4 94 Both sexes Heavy drinking Percent Percent 7.9 95 Both sexes Heavy drinking Percent Percent 10.0 [96 rows x 8 columns]
# Upon further examination of the data, we notice that the Characteristics and UOM columns provide similar data
print(alcohol_df['Characteristics'].unique())
print(alcohol_df['UOM'].unique())
# Given that UOM is more concise, we will keep UOM and drop Characteristics
alcohol_df = alcohol_df.drop(columns=['Characteristics'])
print(alcohol_df)
['Number of persons' 'Percent'] ['Number' 'Percent'] REF_DATE GEO Age group \ 0 2015 Canada (excluding territories) Total, 12 years and over 1 2016 Canada (excluding territories) Total, 12 years and over 2 2017 Canada (excluding territories) Total, 12 years and over 3 2018 Canada (excluding territories) Total, 12 years and over 4 2019 Canada (excluding territories) Total, 12 years and over .. ... ... ... 91 2018 Canada (excluding territories) 65 years and over 92 2019 Canada (excluding territories) 65 years and over 93 2020 Canada (excluding territories) 65 years and over 94 2021 Canada (excluding territories) 65 years and over 95 2022 Canada (excluding territories) 65 years and over Sex Indicators UOM VALUE 0 Both sexes Heavy drinking Number 5782800.0 1 Both sexes Heavy drinking Number 5770900.0 2 Both sexes Heavy drinking Number 6015500.0 3 Both sexes Heavy drinking Number 5946400.0 4 Both sexes Heavy drinking Number 5802200.0 .. ... ... ... ... 91 Both sexes Heavy drinking Percent 7.4 92 Both sexes Heavy drinking Percent 7.6 93 Both sexes Heavy drinking Percent 7.4 94 Both sexes Heavy drinking Percent 7.9 95 Both sexes Heavy drinking Percent 10.0 [96 rows x 7 columns]
# Finally, we will check for duplicates
duplicated = alcohol_df[alcohol_df.duplicated()]
num_of_duplicates_alc = alcohol_df.duplicated().sum()
print(num_of_duplicates_alc)
0
Since we have removed any unnecessary columns and confirmed that there are no duplicates, we can begin our analysis of the data!
To analyze the data, we can first copy the dataframe and limit it to just the rows expressing the data as a percent of the age group. Then, we can visualize the relationship between the age group and the percent of that age group reporting heavy drinking across all years.
alcohol_df_percents = alcohol_df.copy()
alcohol_df_percents = alcohol_df_percents[alcohol_df_percents['UOM'] == 'Percent']
# To improve the graph, remove the rows with the age group as 'Total'
alcohol_df_percents = alcohol_df_percents[alcohol_df_percents['Age group'] != 'Total, 12 years and over']
df_pivot = alcohol_df_percents.pivot(index='REF_DATE', columns='Age group', values='VALUE')
for column in df_pivot.columns:
plt.plot(df_pivot.index, df_pivot[column], marker='o', label=column)
plt.title('Percent of Age Group Reporting Heavy Drinking, Over Years')
plt.xlabel('Year')
plt.ylabel('Percent Reporting Heavy Drinking')
plt.legend(title='Age Group', bbox_to_anchor=(1.05, 1.0), loc='upper left')
plt.grid(True)
plt.show()
Hypothesis Testing:
ANOVA tests are used to compare means between 3 or more groups. In this case, there are 5 different age groups to compare: people aged 12-17 years old, 18-34 years old, 35-49 years old, 50-64 years old, and 65 years old and older. We can use the ANOVA test to determine if there is a statistically significant difference in the mean heavy drinking rate reported across years. We will use a significance value, or alpha value, of 0.05. To conduct the test, our Null and Alternative Hypotheses are as follows:
$H_{0}$: The age group does not have an effect on the mean reported rate of heavy drinking.
$H_{A}$: The age group does have an effect on the mean reported rate of heavy drinking.
# To begin, we will create arrays storing the numeric percents associated with each age group
age_12_17 = alcohol_df_percents[alcohol_df_percents['Age group'] == '12 to 17 years']['VALUE']
age_18_34 = alcohol_df_percents[alcohol_df_percents['Age group'] == '18 to 34 years']['VALUE']
age_35_49 = alcohol_df_percents[alcohol_df_percents['Age group'] == '35 to 49 years']['VALUE']
age_50_64 = alcohol_df_percents[alcohol_df_percents['Age group'] == '50 to 64 years']['VALUE']
age_65_above = alcohol_df_percents[alcohol_df_percents['Age group'] == '65 years and over']['VALUE']
results = f_oneway(age_12_17,age_18_34,age_35_49, age_50_64, age_65_above)
print(results.pvalue)
6.293707068743887e-24
Given that the resulting p-value is less than the value for alpha, we can reject the null hypothesis. As such, we can conclude that there exists a difference among age groups regarding their average reported drinking rates.
Source: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
This dataset has a large sample size of 5,110 people, reflecting a diverse population from rural and urban areas. The dataset contains various demographic, physiological, and behavioral factors that provide information about the samples' health problems and lifestyle choices. These attributes include gender, age, marital status, occupation, type of residence, medical conditions (such as hypertension and heart disease), BMI, and smoking status. The dataset's age range is 0.08 years (roughly eight months) to 82 years, ensuring enough observations from the young adult age group (18-30 years). The smoking status variable is extensive, with options such as "formerly smoked," "never smoked," "smokes," and "unknown," providing helpful information about smoking frequency and history. This precise categorization will allow for an in-depth assessment of how different smoking habits affect the chance of a stroke in young adults.
Step 1. Load the initial dataset containing all the samples.
smoking_df = pd.read_csv('/content/drive/MyDrive/CSV/smoking_stroke.csv')
print(smoking_df)
id gender age hypertension heart_disease ever_married \ 0 9046 Male 67.0 0 1 Yes 1 51676 Female 61.0 0 0 Yes 2 31112 Male 80.0 0 1 Yes 3 60182 Female 49.0 0 0 Yes 4 1665 Female 79.0 1 0 Yes ... ... ... ... ... ... ... 5105 18234 Female 80.0 1 0 Yes 5106 44873 Female 81.0 0 0 Yes 5107 19723 Female 35.0 0 0 Yes 5108 37544 Male 51.0 0 0 Yes 5109 44679 Female 44.0 0 0 Yes work_type Residence_type avg_glucose_level bmi smoking_status \ 0 Private Urban 228.69 36.6 formerly smoked 1 Self-employed Rural 202.21 NaN never smoked 2 Private Rural 105.92 32.5 never smoked 3 Private Urban 171.23 34.4 smokes 4 Self-employed Rural 174.12 24.0 never smoked ... ... ... ... ... ... 5105 Private Urban 83.75 NaN never smoked 5106 Self-employed Urban 125.20 40.0 never smoked 5107 Self-employed Rural 82.99 30.6 never smoked 5108 Private Rural 166.29 25.6 formerly smoked 5109 Govt_job Urban 85.28 26.2 Unknown stroke 0 1 1 1 2 1 3 1 4 1 ... ... 5105 0 5106 0 5107 0 5108 0 5109 0 [5110 rows x 12 columns]
Step 2. Data Cleaning process -> delete duplicates and determine what to do with unknown data (is it MNAR or MCAR/MAR)
# 1. Check for exact duplicates -> None Existing
print(smoking_df.shape)
smoking_df.drop_duplicates()
print(smoking_df.shape, "\n")
# 2. Check for missing values -> BMI category is not used for this exploration thus can be ignored
missing_values = smoking_df.isnull().sum()
print(missing_values)
(5110, 12) (5110, 12) id 0 gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 201 smoking_status 0 stroke 0 dtype: int64
Our data set has 5110 individual responses with no duplicates and all answers with unique ID numbers. Since for we want to explore the correlation between the smoking status and stroke diagnosis we will disregard the missing data in the bmi column.
Step 1. Define the objective: Does smoking status have an impact on the diagnosis of stroke? (For this dataset all young adults between 18 - 30 had no diagnosis of stroke thus we will explore the correlation for all ages)
print(smoking_df)
id gender age hypertension heart_disease ever_married \ 0 9046 Male 67.0 0 1 Yes 1 51676 Female 61.0 0 0 Yes 2 31112 Male 80.0 0 1 Yes 3 60182 Female 49.0 0 0 Yes 4 1665 Female 79.0 1 0 Yes ... ... ... ... ... ... ... 5105 18234 Female 80.0 1 0 Yes 5106 44873 Female 81.0 0 0 Yes 5107 19723 Female 35.0 0 0 Yes 5108 37544 Male 51.0 0 0 Yes 5109 44679 Female 44.0 0 0 Yes work_type Residence_type avg_glucose_level bmi smoking_status \ 0 Private Urban 228.69 36.6 formerly smoked 1 Self-employed Rural 202.21 NaN never smoked 2 Private Rural 105.92 32.5 never smoked 3 Private Urban 171.23 34.4 smokes 4 Self-employed Rural 174.12 24.0 never smoked ... ... ... ... ... ... 5105 Private Urban 83.75 NaN never smoked 5106 Self-employed Urban 125.20 40.0 never smoked 5107 Self-employed Rural 82.99 30.6 never smoked 5108 Private Rural 166.29 25.6 formerly smoked 5109 Govt_job Urban 85.28 26.2 Unknown stroke 0 1 1 1 2 1 3 1 4 1 ... ... 5105 0 5106 0 5107 0 5108 0 5109 0 [5110 rows x 12 columns]
Step 2. Explore the filtered dataset for enough comprehension of the data.
# 0. Explore the columns and rows of our filtered data
print("Num of Rows and Columns : " , smoking_df.shape)
# 1. Explore the columns of the data
print("\nColumns : ", smoking_df.columns)
# 2. Explore the data types of the columns -> age : float64 smoking_status : object (string)
print("\nData types :\n" , smoking_df.dtypes )
# 3. Explore the categories of smoking_status
print("\nCategories of smoking status: ", smoking_df["smoking_status"].unique())
# 4. Check for unknown response for smoking status -> more than 25% refused to respond MNAR
unknown_smoking_status = smoking_df[smoking_df["smoking_status"] == "Unknown"]
unknown_smoking_status
print("\nMissing Num of Rows and Columns :", unknown_smoking_status.shape)
Num of Rows and Columns : (5110, 12) Columns : Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'stroke'], dtype='object') Data types : id int64 gender object age float64 hypertension int64 heart_disease int64 ever_married object work_type object Residence_type object avg_glucose_level float64 bmi float64 smoking_status object stroke int64 dtype: object Categories of smoking status: ['formerly smoked' 'never smoked' 'smokes' 'Unknown'] Missing Num of Rows and Columns : (1544, 12)
Our categorical values for the smoking_status column are "Unknown," "formerly smoked," "never smoked," and "smokes." This data is assumed to be data missing not at random or MNAR, since some of the individuals did not wish to disclose their smoking behaviors. If these individuals were less than 1% of our data, we would simply drop them; however, they were approximately 30% of our data. Thus, we have decided to consider Unknown as a new category to study the correlation between smoking status and stroke diagnosis.
Step 3. Hypothesis Testing Using ANOVA to test the correlation between "smoking_status" and "stroke" for young adults.
< Hypothesis >
$H_{0}$: Smoking status does not have an effect on the likelihood of stroke.
$H_{A}$: Smoking status does have an effect on the likelihood of stroke.
Step 3.1 Since we have categorical values for smoking status we will convert them to numeric values before using ANOVA
# Convert the smoking status to numeric values
def convert_status (status) :
if status == "never smoked" :
return 0
elif status == "formerly smoked" :
return 1
elif status == "smokes" :
return 2
elif status == "Unknown" :
return -1
smoking_df["smoking_status"] = smoking_df["smoking_status"].apply(convert_status)
print(smoking_df)
id gender age hypertension heart_disease ever_married \ 0 9046 Male 67.0 0 1 Yes 1 51676 Female 61.0 0 0 Yes 2 31112 Male 80.0 0 1 Yes 3 60182 Female 49.0 0 0 Yes 4 1665 Female 79.0 1 0 Yes ... ... ... ... ... ... ... 5105 18234 Female 80.0 1 0 Yes 5106 44873 Female 81.0 0 0 Yes 5107 19723 Female 35.0 0 0 Yes 5108 37544 Male 51.0 0 0 Yes 5109 44679 Female 44.0 0 0 Yes work_type Residence_type avg_glucose_level bmi smoking_status \ 0 Private Urban 228.69 36.6 1 1 Self-employed Rural 202.21 NaN 0 2 Private Rural 105.92 32.5 0 3 Private Urban 171.23 34.4 2 4 Self-employed Rural 174.12 24.0 0 ... ... ... ... ... ... 5105 Private Urban 83.75 NaN 0 5106 Self-employed Urban 125.20 40.0 0 5107 Self-employed Rural 82.99 30.6 0 5108 Private Rural 166.29 25.6 1 5109 Govt_job Urban 85.28 26.2 -1 stroke 0 1 1 1 2 1 3 1 4 1 ... ... 5105 0 5106 0 5107 0 5108 0 5109 0 [5110 rows x 12 columns]
Step 3.2 Use ANOVA to conduct a test for our encoded dataset.
# Create and display a contingency table used for our testing
smoking_stroke_table2 = pd.crosstab(smoking_df['stroke'], smoking_df['smoking_status'])
print(smoking_stroke_table2)
# ANOVA Testing to find P-Value
result = f_oneway(smoking_stroke_table2[0],smoking_stroke_table2[1], smoking_stroke_table2[2], smoking_stroke_table2[-1])
print("P-Value: ", result.pvalue)
smoking_status -1 0 1 2 stroke 0 1497 1802 815 747 1 47 90 70 42 P-Value: 0.9018883799637181
Step 4. Visualize the statistic result.
contingency = pd.crosstab(smoking_df['smoking_status'], smoking_df['stroke'])
contingency.plot(kind= "bar")
<Axes: xlabel='smoking_status'>
We fail to reject the null hypothesis because the P-value exceeds the significance level 0.9019 > 0.05. This means that we fail to prove that smoking status influences the likelihood of having a stroke. No Post-Hoc testing is needed since there was not enough statistical proof of convincing significant differences between the four categories of smoking status and the diagnosis of stroke.
plt.figure(figsize=(8, 6))
seas.violinplot(x='smoking_status', y='stroke', data=smoking_df, inner='quartile')
plt.ylabel('Stroke')
plt.xlabel('Encoded Smoking Status (-1: Unknown, 0: never smoked, 1: formerly smoked, 2: smokes)')
plt.title(f'Smoking Status and Stroke')
plt.grid(True)
plt.show()
Source: https://www.kaggle.com/datasets/jillanisofttech/brain-stroke-dataset
The dataset consists of 4,981 entries and 11 features, which encompass patient demographics, medical history, average glucose levels, and stroke diagnosis. It includes both categorical and numerical data types. No features appear over-represented according to the summary statistics provided. We can observe the correlation among most features is low, suggesting that each contributes independently to determining the stroke diagnosis. Boxplots and z-scores indicate that there are no significant outliers in age, average glucose level, and BMI. The binary nature of the target variable, stroke, suggests that a classification approach is appropriate for the primary analysis technique. These findings will inform the necessary preprocessing steps and help in selecting the appropriate analysis method to ensure thorough and accurate modeling.
In the last step, we will take a look at the relationship between unhealthy eating behaviors (linked to increased average glucose levels) and occurence of stroke. First, create a dataframe named "glucose_df", read the "brain_stroke(avg_glucose_lvl).csv" file into it and display the dataframe.
glucose_df = pd.read_csv('/content/drive/MyDrive/CSV/brain_stroke.csv')
print(glucose_df)
gender age hypertension heart_disease ever_married work_type \ 0 Male 67.0 0 1 Yes Private 1 Male 80.0 0 1 Yes Private 2 Female 49.0 0 0 Yes Private 3 Female 79.0 1 0 Yes Self-employed 4 Male 81.0 0 0 Yes Private ... ... ... ... ... ... ... 4976 Male 41.0 0 0 No Private 4977 Male 40.0 0 0 Yes Private 4978 Female 45.0 1 0 Yes Govt_job 4979 Male 40.0 0 0 Yes Private 4980 Female 80.0 1 0 Yes Private Residence_type avg_glucose_level bmi smoking_status stroke 0 Urban 228.69 36.6 formerly smoked 1 1 Rural 105.92 32.5 never smoked 1 2 Urban 171.23 34.4 smokes 1 3 Rural 174.12 24.0 never smoked 1 4 Urban 186.21 29.0 formerly smoked 1 ... ... ... ... ... ... 4976 Rural 70.15 29.8 formerly smoked 0 4977 Urban 191.15 31.1 smokes 0 4978 Rural 95.02 31.8 smokes 0 4979 Rural 83.94 30.0 smokes 0 4980 Urban 83.75 29.1 never smoked 0 [4981 rows x 11 columns]
As for the age feature, for this specific dataset, we will NOT filter the age. After applying the age filter to the dataset on unhealthy eating behaviors, we found that there were no stroke cases among young adults aged 18-30. This lack of data made it difficult to conduct a meaningful analysis within this specific age group. Despite this, we recognized the overall value of the dataset and decided to broaden our analysis to include all age groups. This approach allowed us to extract meaningful statistics and insights, which are still highly relevant to understanding the impact of dietary habits on stroke risk across a wider population.
Now, we will conduct a Pearson Correlation Coefficient (r) between two variables, "avg_glucose_level" and "stroke," in hopes of visualizing their relationship. Generating a Pearson Correlation Coefficient will help to indicate the strength and direction of the linear relationship between an individual's avergage glucose levels and their stroke status.
# Calculate the Pearson correlation coefficient (r) between "avg_glucose_level" and "stroke"
correlation_coefficient = glucose_df['avg_glucose_level'].corr(glucose_df['stroke'])
# Create a plot to visualize the relationship between "avg_glucose_level" and "stroke"
import seaborn as seas
plt.figure(figsize=(8, 6))
seas.violinplot(x='stroke', y='avg_glucose_level', data=glucose_df, inner='quartile')
#plt.scatter(glucose_df['avg_glucose_level'], glucose_df['stroke'], alpha=0.5)
plt.ylabel('Average Glucose Level')
plt.xlabel('Stroke Status')
plt.title(f'Average Glucose Level vs Stroke (Correlation Coefficient: {correlation_coefficient:.2f})')
plt.grid(True)
plt.show()
print("Pearson Correlation Coefficient (r):" , correlation_coefficient)
Pearson Correlation Coefficient (r): 0.13322732663313727
Unhealthy Eating Behaviors & Pearson Correlation Coefficient Conclusion:
The Pearson Correlation Coefficient (r) calculated for the relationship between 'avg_glucose_level' and 'stroke' is 0.133. This indicates a positive, but relatively weak, linear correlation between an individual's average glucose levels and their stroke status. Although there is a direct relationship, suggesting that higher glucose levels might be associated with an increased risk of stroke, the strength of this correlation is modest at best. This result implies that while average glucose levels are a factor in stroke risk, they are likely just one of multiple contributing factors. Therefore, further analysis involving additional variables and larger datasets may be necessary to fully understand the complex interactions that lead to stroke in young adults aged 18-30.
Source: https://www.kaggle.com/datasets/teamincribo/stroke-prediction
The dataset contains 15,000 entries and 22 features, including patient demographics, medical history, lifestyle factors, and stroke diagnosis. The features are a mix of categorical and numerical data. Based on the summary statistics, there is no indication of any feature being over-represented. We can observe that age has a moderate positive correlation with hypertension and heart disease, while average glucose level and BMI have very low correlation with other features, indicating they may independently contribute to the diagnosis. Boxplots and z-scores do not reveal potential outliers in age, average glucose level, BMI, and stress levels. The target variable, diagnosis, is binary, suggesting a classification approach for the primary analysis technique. Numerical features will require standardization or normalization, and categorical features will need appropriate encoding. Additionally, the "Symptoms" feature has missing values, which can be handled by imputation or exclusion strategies, but are not necessary. These insights will help guide the preprocessing steps and the choice of analysis technique.
In this step, we are creating a dataframe named "stroke_df", read the "stroke_prediction_dataset.csv" file into it and display the dataframe.
stroke_df = pd.read_csv('/content/drive/MyDrive/CSV/stroke_prediction_dataset.csv')
print(stroke_df)
Patient ID Patient Name Age Gender Hypertension \ 0 18153 Mamooty Khurana 56 Male 0 1 62749 Kaira Subramaniam 80 Male 0 2 32145 Dhanush Balan 26 Male 1 3 6154 Ivana Baral 73 Male 0 4 48973 Darshit Jayaraman 51 Male 1 ... ... ... ... ... ... 14995 13981 Keya Iyer 88 Female 1 14996 87707 Anahita Virk 47 Female 0 14997 33174 Ivana Kaur 35 Male 0 14998 22343 Anvi Mannan 73 Male 0 14999 11066 Gokul Trivedi 64 Female 0 Heart Disease Marital Status Work Type Residence Type \ 0 1 Married Self-employed Rural 1 0 Single Self-employed Urban 2 1 Married Never Worked Rural 3 0 Married Never Worked Urban 4 1 Divorced Self-employed Urban ... ... ... ... ... 14995 1 Divorced Self-employed Urban 14996 0 Married Private Urban 14997 0 Married Government Job Rural 14998 0 Single Self-employed Urban 14999 0 Single Never Worked Urban Average Glucose Level ... Alcohol Intake Physical Activity \ 0 130.91 ... Social Drinker Moderate 1 183.73 ... Never Low 2 189.00 ... Rarely High 3 185.29 ... Frequent Drinker Moderate 4 177.34 ... Rarely Low ... ... ... ... ... 14995 160.22 ... Social Drinker High 14996 107.58 ... Never Low 14997 134.90 ... Rarely High 14998 169.42 ... Never High 14999 186.88 ... Rarely Moderate Stroke History Family History of Stroke Dietary Habits Stress Levels \ 0 0 Yes Vegan 3.48 1 0 No Paleo 1.73 2 0 Yes Paleo 7.31 3 0 No Paleo 5.35 4 0 Yes Pescatarian 6.84 ... ... ... ... ... 14995 0 No Paleo 1.12 14996 1 No Gluten-Free 1.47 14997 1 No Paleo 0.51 14998 0 Yes Paleo 1.53 14999 0 No Vegan 4.57 Blood Pressure Levels Cholesterol Levels \ 0 140/108 HDL: 68, LDL: 133 1 146/91 HDL: 63, LDL: 70 2 154/97 HDL: 59, LDL: 95 3 174/81 HDL: 70, LDL: 137 4 121/95 HDL: 65, LDL: 68 ... ... ... 14995 171/92 HDL: 44, LDL: 153 14996 155/71 HDL: 35, LDL: 183 14997 121/110 HDL: 57, LDL: 159 14998 157/74 HDL: 79, LDL: 91 14999 133/81 HDL: 78, LDL: 179 Symptoms Diagnosis 0 Difficulty Speaking, Headache Stroke 1 Loss of Balance, Headache, Dizziness, Confusion Stroke 2 Seizures, Dizziness Stroke 3 Seizures, Blurred Vision, Severe Fatigue, Head... No Stroke 4 Difficulty Speaking Stroke ... ... ... 14995 NaN No Stroke 14996 Difficulty Speaking No Stroke 14997 Difficulty Speaking, Severe Fatigue, Headache Stroke 14998 Severe Fatigue, Numbness, Confusion, Dizziness... No Stroke 14999 Headache Stroke [15000 rows x 22 columns]
We focus on young adults aged 18 to 30. By isolating this age group from the broader dataset, we aim to examine potential risk factors and stroke incidences in this age group.
young_adults = stroke_df[(stroke_df['Age']>=18)&(stroke_df['Age']<=30)]
print(young_adults)
Patient ID Patient Name Age Gender Hypertension Heart Disease \ 2 32145 Dhanush Balan 26 Male 1 1 12 66924 Ahana Lalla 30 Female 0 1 19 23954 Taran Khatri 25 Male 0 0 25 36975 Jhanvi Brar 24 Female 0 0 37 94512 Anvi Salvi 23 Female 0 0 ... ... ... ... ... ... ... 14972 11839 Chirag Kurian 30 Male 0 1 14974 30150 Alisha Banerjee 20 Female 0 0 14981 12323 Pari Ravi 25 Male 0 0 14983 40381 Sana Goel 18 Female 0 0 14991 90658 Samaira Raj 26 Male 0 1 Marital Status Work Type Residence Type Average Glucose Level \ 2 Married Never Worked Rural 189.00 12 Divorced Government Job Urban 163.15 19 Married Private Urban 71.38 25 Married Self-employed Urban 79.89 37 Single Government Job Rural 164.72 ... ... ... ... ... 14972 Married Self-employed Rural 126.94 14974 Divorced Never Worked Rural 101.36 14981 Single Private Rural 77.64 14983 Single Self-employed Urban 68.26 14991 Married Self-employed Rural 145.05 ... Alcohol Intake Physical Activity Stroke History \ 2 ... Rarely High 0 12 ... Frequent Drinker Moderate 0 19 ... Rarely Moderate 0 25 ... Social Drinker High 1 37 ... Social Drinker Low 1 ... ... ... ... ... 14972 ... Rarely Moderate 0 14974 ... Never High 0 14981 ... Frequent Drinker Low 0 14983 ... Social Drinker Moderate 1 14991 ... Social Drinker Low 1 Family History of Stroke Dietary Habits Stress Levels \ 2 Yes Paleo 7.31 12 Yes Non-Vegetarian 9.19 19 Yes Gluten-Free 0.46 25 No Vegetarian 6.48 37 Yes Gluten-Free 7.86 ... ... ... ... 14972 No Pescatarian 9.51 14974 Yes Pescatarian 2.26 14981 Yes Paleo 2.69 14983 No Vegetarian 6.79 14991 No Pescatarian 0.71 Blood Pressure Levels Cholesterol Levels \ 2 154/97 HDL: 59, LDL: 95 12 114/67 HDL: 80, LDL: 83 19 170/64 HDL: 72, LDL: 174 25 151/65 HDL: 73, LDL: 111 37 148/74 HDL: 30, LDL: 62 ... ... ... 14972 113/65 HDL: 55, LDL: 179 14974 159/94 HDL: 42, LDL: 99 14981 135/66 HDL: 58, LDL: 161 14983 136/66 HDL: 59, LDL: 172 14991 180/110 HDL: 33, LDL: 99 Symptoms Diagnosis 2 Seizures, Dizziness Stroke 12 Loss of Balance, Numbness Stroke 19 Seizures Stroke 25 Numbness, Loss of Balance, Numbness, Blurred V... Stroke 37 Blurred Vision, Seizures, Weakness Stroke ... ... ... 14972 NaN No Stroke 14974 Seizures, Severe Fatigue Stroke 14981 Blurred Vision, Headache, Severe Fatigue, Loss... Stroke 14983 Severe Fatigue, Severe Fatigue, Headache, Seiz... Stroke 14991 Confusion, Confusion No Stroke [2662 rows x 22 columns]
Let us clean the young_adults dataframe to ensure proper data traversal. Drop any duplicates.
duplicate_rows = young_adults[young_adults.duplicated()]
num_duplicates = young_adults.duplicated().sum()
print(duplicate_rows)
Empty DataFrame Columns: [Patient ID, Patient Name, Age, Gender, Hypertension, Heart Disease, Marital Status, Work Type, Residence Type, Average Glucose Level, Body Mass Index (BMI), Smoking Status, Alcohol Intake, Physical Activity, Stroke History, Family History of Stroke, Dietary Habits, Stress Levels, Blood Pressure Levels, Cholesterol Levels, Symptoms, Diagnosis] Index: [] [0 rows x 22 columns]
No duplicates were found. Detect any outliers.
# Function to calculate Z-scores
def z_score(df, threshold=3.5):
numeric_cols = df.select_dtypes(include=[np.number]).columns
z_scores = np.abs((df[numeric_cols] - df[numeric_cols].mean()) / df[numeric_cols].std())
outliers = (z_scores > threshold).any(axis=1)
return outliers
# Detect outliers
outliers = z_score(young_adults)
# Show outliers
print("Outliers detected:")
print(young_adults[outliers])
Outliers detected: Empty DataFrame Columns: [Patient ID, Patient Name, Age, Gender, Hypertension, Heart Disease, Marital Status, Work Type, Residence Type, Average Glucose Level, Body Mass Index (BMI), Smoking Status, Alcohol Intake, Physical Activity, Stroke History, Family History of Stroke, Dietary Habits, Stress Levels, Blood Pressure Levels, Cholesterol Levels, Symptoms, Diagnosis] Index: [] [0 rows x 22 columns]
No outliers were found. The young_adults dataframe is clean and ready for use!
Lets begin by using Chi-Squared Test on whether the stress level has a correlation with stroke diagnosis. The Chi-Squared test is used when we want to determine if there is a significant association between two categorical variables. In this situation, we want to know whether there is a relationship between two categorical variables: the stress level and whether the patient was diagnosed with a stroke. Therefore, chi-squared test is suitable for this situation. We assume that the alpha value is 0.05. The following statement are Null Hypothesis and Alternate Hypothesis for the Chi-Squared Test.
$H_{0}$: The stress level does not have an effect on the likelihood of stroke diagnosis in the patient.
$H_{A}$: The stress level does have an effect on the likelihood of stroke diagnosis in the patient.
Firstly, we categorize stress levels into 10 groups: 0-1, 1-2, 2-3, 3-4, 4-5, 5-6, 6-7, 7-8, 8-9, and 9-10 by creating a new column that is called "New Stress Levels".
conditions = [
(young_adults['Stress Levels'] >= 0.00) & (young_adults['Stress Levels'] <= 1.00),
(young_adults['Stress Levels'] >= 1.01) & (young_adults['Stress Levels'] <= 2.00),
(young_adults['Stress Levels'] >= 2.01) & (young_adults['Stress Levels'] <= 3.00),
(young_adults['Stress Levels'] >= 3.01) & (young_adults['Stress Levels'] <= 4.00),
(young_adults['Stress Levels'] >= 4.01) & (young_adults['Stress Levels'] <= 5.00),
(young_adults['Stress Levels'] >= 5.01) & (young_adults['Stress Levels'] <= 6.00),
(young_adults['Stress Levels'] >= 6.01) & (young_adults['Stress Levels'] <= 7.00),
(young_adults['Stress Levels'] >= 7.01) & (young_adults['Stress Levels'] <= 8.00),
(young_adults['Stress Levels'] >= 8.01) & (young_adults['Stress Levels'] <= 9.00),
(young_adults['Stress Levels'] >= 9.01) & (young_adults['Stress Levels'] <= 10.00)
]
# create a list of the values we want to assign for each condition
values = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-6', '6-7', '7-8', '8-9', '9-10']
# create a new column and use np.select to assign values to it using our lists as arguments
young_adults['New Stress Levels'] = np.select(conditions, values)
print(young_adults)
Patient ID Patient Name Age Gender Hypertension Heart Disease \ 2 32145 Dhanush Balan 26 Male 1 1 12 66924 Ahana Lalla 30 Female 0 1 19 23954 Taran Khatri 25 Male 0 0 25 36975 Jhanvi Brar 24 Female 0 0 37 94512 Anvi Salvi 23 Female 0 0 ... ... ... ... ... ... ... 14972 11839 Chirag Kurian 30 Male 0 1 14974 30150 Alisha Banerjee 20 Female 0 0 14981 12323 Pari Ravi 25 Male 0 0 14983 40381 Sana Goel 18 Female 0 0 14991 90658 Samaira Raj 26 Male 0 1 Marital Status Work Type Residence Type Average Glucose Level \ 2 Married Never Worked Rural 189.00 12 Divorced Government Job Urban 163.15 19 Married Private Urban 71.38 25 Married Self-employed Urban 79.89 37 Single Government Job Rural 164.72 ... ... ... ... ... 14972 Married Self-employed Rural 126.94 14974 Divorced Never Worked Rural 101.36 14981 Single Private Rural 77.64 14983 Single Self-employed Urban 68.26 14991 Married Self-employed Rural 145.05 ... Physical Activity Stroke History Family History of Stroke \ 2 ... High 0 Yes 12 ... Moderate 0 Yes 19 ... Moderate 0 Yes 25 ... High 1 No 37 ... Low 1 Yes ... ... ... ... ... 14972 ... Moderate 0 No 14974 ... High 0 Yes 14981 ... Low 0 Yes 14983 ... Moderate 1 No 14991 ... Low 1 No Dietary Habits Stress Levels Blood Pressure Levels Cholesterol Levels \ 2 Paleo 7.31 154/97 HDL: 59, LDL: 95 12 Non-Vegetarian 9.19 114/67 HDL: 80, LDL: 83 19 Gluten-Free 0.46 170/64 HDL: 72, LDL: 174 25 Vegetarian 6.48 151/65 HDL: 73, LDL: 111 37 Gluten-Free 7.86 148/74 HDL: 30, LDL: 62 ... ... ... ... ... 14972 Pescatarian 9.51 113/65 HDL: 55, LDL: 179 14974 Pescatarian 2.26 159/94 HDL: 42, LDL: 99 14981 Paleo 2.69 135/66 HDL: 58, LDL: 161 14983 Vegetarian 6.79 136/66 HDL: 59, LDL: 172 14991 Pescatarian 0.71 180/110 HDL: 33, LDL: 99 Symptoms Diagnosis \ 2 Seizures, Dizziness Stroke 12 Loss of Balance, Numbness Stroke 19 Seizures Stroke 25 Numbness, Loss of Balance, Numbness, Blurred V... Stroke 37 Blurred Vision, Seizures, Weakness Stroke ... ... ... 14972 NaN No Stroke 14974 Seizures, Severe Fatigue Stroke 14981 Blurred Vision, Headache, Severe Fatigue, Loss... Stroke 14983 Severe Fatigue, Severe Fatigue, Headache, Seiz... Stroke 14991 Confusion, Confusion No Stroke New Stress Levels 2 7-8 12 9-10 19 0-1 25 6-7 37 7-8 ... ... 14972 9-10 14974 2-3 14981 2-3 14983 6-7 14991 0-1 [2662 rows x 23 columns]
<ipython-input-1683-f09478bf7528>:17: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy young_adults['New Stress Levels'] = np.select(conditions, values)
We create a contingency table and display it.
stress_table = pd.crosstab(young_adults['New Stress Levels'], young_adults['Diagnosis'])
print(stress_table)
Diagnosis No Stroke Stroke New Stress Levels 0-1 148 131 1-2 111 155 2-3 130 135 3-4 126 124 4-5 139 112 5-6 143 138 6-7 136 127 7-8 153 148 8-9 117 141 9-10 129 119
Next, we create a plot showing the relationship between the stress levels and the counts of patients having stroke.
stress_table.plot(kind='bar', colormap='Paired')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
Now, we display P-Value by applying the chi-squared test using the chi2_contingency() function.
chi2, p_value, dof, expected = chi2_contingency(stress_table)
print("P-Value:", p_value)
P-Value: 0.10741461278569753
Because p-value > alpha value (0.10741461278569753 > 0.05), we fail to reject the null hypothesis. There is not enough evidence to suggest that the stress level has an effect on the likelihood of being diagnosed with a stroke. We create a plot to visualize the relationship between "New Stress Levels" and "Stroke Diagnosis":
plt.figure(figsize=(8, 6))
seas.violinplot(x='Diagnosis', y='New Stress Levels', data=young_adults, inner='quartile')
plt.ylabel('New Stress Levels')
plt.xlabel('Stroke Diagnosis')
plt.title(f'Stress Levels vs Stroke (Contingency Table)')
plt.grid(True)
plt.show()
To examine the relationship bewteen Alcohol Intake and a Stroke Diagnosis, we can use the Chi-Squared Test, which will allow us to compare the two categorical values. The first variable, Alcohol Intake, is organized into four categories: 'Never', 'Rarely', 'Social Drinker', and 'Frequent Drinker'. The second variable, Stroke Diagnosis, is organized into either 'No Stroke' or 'Stroke'.
For the Chi-Squared Test, we can assume a significance level, or alpha value, of 0.05. Our Null and Alternate hypotheses are as follows:
$H_{0}$: The alcohol intake category does not have an impact on stroke diagnosis.
$H_{A}$: The alcohol intake category does have an impact on stroke diagnosis.
Because the dataset already includes discrete categories to describe alcohol consumption levels, we can immediately create a contingency table.
alcohol_level_table = pd.crosstab(young_adults['Alcohol Intake'], young_adults['Diagnosis'])
print(alcohol_level_table)
Diagnosis No Stroke Stroke Alcohol Intake Frequent Drinker 338 324 Never 336 296 Rarely 328 366 Social Drinker 330 344
We can visualize the same relationship using a bar graph.
alcohol_level_table.plot(kind='bar')
plt.title('Alcohol Intake vs Stroke Diagnosis')
plt.legend(title='Diagnosis', bbox_to_anchor=(1.05, 1.0), loc='upper left')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
Using the chi2_contingency function, we can conduct the chi-squared test to determine if there is a relationship between alcohol intake categories and stroke diagnosis.
# Conduct the chi2 test on the contingency table
chi2, p_value, dof, expected = chi2_contingency(alcohol_level_table)
print("P-Value for Alcohol Consumption vs Diagnosis:", p_value)
P-Value for Alcohol Consumption vs Diagnosis: 0.15787917256302525
Because the p-value obtained was greater than alpha, we cannot reject the null hypothesis. As such, we can conclude that we do not have sufficient evidence that alcohol intake has an impact on the liklihood of stroke diagnosis.
We use the Chi-Squared Test to determine the correlation between the categorical values "Formerly Smoked", "Non-smoker", and "Currently Smokes" of the Smoking Status and the likelyhood of stroke. Chi-Squared Test is suitable since our variables "Smoking Status" and "Diagnosis" are both categorical data.
Hypothesis:
$H_{0}$: Smoking status does not have an effect on the likelihood of stroke for young adults.
$H_{A}$: Smoking status does have an effect on the likelihood of stroke for young adults.
Step 1. Create a contingency table of the variables "Smoking Status" and "Diagnosis"
# Clean the columns for accurate result
young_adults["Smoking Status"].str.strip()
young_adults["Diagnosis"].str.strip()
# Contingency Table
smoking_stroke_table = pd.crosstab(young_adults["Smoking Status"],young_adults["Diagnosis"])
print(smoking_stroke_table)
Diagnosis No Stroke Stroke Smoking Status Currently Smokes 449 466 Formerly Smoked 415 408 Non-smoker 468 456
Step 2. Show a bar graph to visually able to compare the relationship between smoking status and the diagnosis of stroke.
# Data Comparison - Bar Graph (adequate for comparison between discrete values)
smoking_stroke_table.plot(kind = "bar")
<Axes: xlabel='Smoking Status'>
Step 3. Find P-Value to determine whether there are significant differences of diagnosis depending on the status of smoking. Our significant value is 0.05.
# P-Value
result = chi2_contingency(smoking_stroke_table)
print(result.pvalue)
0.7673106445186333
Step 4. Conclusion of Hypothesis Testing
Due to the p-value being 0.7673 > 0.05. There isn't enough statistical proof to prove the impact of smoking status on stroke diagnosis. There aren't significant differences between the levels of the smoking status impacting the diagnosis of stroke.
Step 5. Visualize the correlation between the smoking status and the stroke diagnosis.
plt.figure(figsize=(8, 6))
seas.violinplot(x='Diagnosis', y='Smoking Status', data=young_adults, inner='quartile')
plt.ylabel('Smoking Status')
plt.xlabel('Stroke Diagnosis')
plt.title(f'Smoking Status impacting Stroke Diagnosis')
plt.grid(True)
plt.show()
A Chi-Square Test was chosen here because it will help us examine the independence between a categorical independent variable (average glucose levels) and the binary outcome (presence of stroke). As we stated in the beginning, we will be analyzing data amongst young adults who are between the ages of 18-30. Therefore, we will use a dataframe called young_adults that will only contain people who are considered young adults in stroke_df (the main dataset).
print(young_adults)
Patient ID Patient Name Age Gender Hypertension Heart Disease \ 2 32145 Dhanush Balan 26 Male 1 1 12 66924 Ahana Lalla 30 Female 0 1 19 23954 Taran Khatri 25 Male 0 0 25 36975 Jhanvi Brar 24 Female 0 0 37 94512 Anvi Salvi 23 Female 0 0 ... ... ... ... ... ... ... 14972 11839 Chirag Kurian 30 Male 0 1 14974 30150 Alisha Banerjee 20 Female 0 0 14981 12323 Pari Ravi 25 Male 0 0 14983 40381 Sana Goel 18 Female 0 0 14991 90658 Samaira Raj 26 Male 0 1 Marital Status Work Type Residence Type Average Glucose Level \ 2 Married Never Worked Rural 189.00 12 Divorced Government Job Urban 163.15 19 Married Private Urban 71.38 25 Married Self-employed Urban 79.89 37 Single Government Job Rural 164.72 ... ... ... ... ... 14972 Married Self-employed Rural 126.94 14974 Divorced Never Worked Rural 101.36 14981 Single Private Rural 77.64 14983 Single Self-employed Urban 68.26 14991 Married Self-employed Rural 145.05 ... Physical Activity Stroke History Family History of Stroke \ 2 ... High 0 Yes 12 ... Moderate 0 Yes 19 ... Moderate 0 Yes 25 ... High 1 No 37 ... Low 1 Yes ... ... ... ... ... 14972 ... Moderate 0 No 14974 ... High 0 Yes 14981 ... Low 0 Yes 14983 ... Moderate 1 No 14991 ... Low 1 No Dietary Habits Stress Levels Blood Pressure Levels Cholesterol Levels \ 2 Paleo 7.31 154/97 HDL: 59, LDL: 95 12 Non-Vegetarian 9.19 114/67 HDL: 80, LDL: 83 19 Gluten-Free 0.46 170/64 HDL: 72, LDL: 174 25 Vegetarian 6.48 151/65 HDL: 73, LDL: 111 37 Gluten-Free 7.86 148/74 HDL: 30, LDL: 62 ... ... ... ... ... 14972 Pescatarian 9.51 113/65 HDL: 55, LDL: 179 14974 Pescatarian 2.26 159/94 HDL: 42, LDL: 99 14981 Paleo 2.69 135/66 HDL: 58, LDL: 161 14983 Vegetarian 6.79 136/66 HDL: 59, LDL: 172 14991 Pescatarian 0.71 180/110 HDL: 33, LDL: 99 Symptoms Diagnosis \ 2 Seizures, Dizziness Stroke 12 Loss of Balance, Numbness Stroke 19 Seizures Stroke 25 Numbness, Loss of Balance, Numbness, Blurred V... Stroke 37 Blurred Vision, Seizures, Weakness Stroke ... ... ... 14972 NaN No Stroke 14974 Seizures, Severe Fatigue Stroke 14981 Blurred Vision, Headache, Severe Fatigue, Loss... Stroke 14983 Severe Fatigue, Severe Fatigue, Headache, Seiz... Stroke 14991 Confusion, Confusion No Stroke New Stress Levels 2 7-8 12 9-10 19 0-1 25 6-7 37 7-8 ... ... 14972 9-10 14974 2-3 14981 2-3 14983 6-7 14991 0-1 [2662 rows x 23 columns]
In preparation for conducting a Chi-square test, we will categorize the "Average Glucose Levels" of individuals into three distinct groups based on guidelines provided by the Cleveland Clinic. Specifically, glucose levels below 117.0 will be classified as 'Normal', levels from 117.0 to 137.0 will be considered 'Pre-diabetic', and any levels exceeding 137.0 will be labeled as 'Diabetic' (Cleveland Clinic, 2022). This categorization will allow us to systematically analyze the association between glucose levels and other variables in our dataset.
def categorize_glucose_level(agl):
if agl < 117.0:
return 'Normal'
elif agl < 137.0:
return 'Pre-diabetic'
else:
return 'Diabetic'
young_adults.loc[:, 'Category'] = young_adults['Average Glucose Level'].apply(categorize_glucose_level)
print(young_adults)
Patient ID Patient Name Age Gender Hypertension Heart Disease \ 2 32145 Dhanush Balan 26 Male 1 1 12 66924 Ahana Lalla 30 Female 0 1 19 23954 Taran Khatri 25 Male 0 0 25 36975 Jhanvi Brar 24 Female 0 0 37 94512 Anvi Salvi 23 Female 0 0 ... ... ... ... ... ... ... 14972 11839 Chirag Kurian 30 Male 0 1 14974 30150 Alisha Banerjee 20 Female 0 0 14981 12323 Pari Ravi 25 Male 0 0 14983 40381 Sana Goel 18 Female 0 0 14991 90658 Samaira Raj 26 Male 0 1 Marital Status Work Type Residence Type Average Glucose Level \ 2 Married Never Worked Rural 189.00 12 Divorced Government Job Urban 163.15 19 Married Private Urban 71.38 25 Married Self-employed Urban 79.89 37 Single Government Job Rural 164.72 ... ... ... ... ... 14972 Married Self-employed Rural 126.94 14974 Divorced Never Worked Rural 101.36 14981 Single Private Rural 77.64 14983 Single Self-employed Urban 68.26 14991 Married Self-employed Rural 145.05 ... Stroke History Family History of Stroke Dietary Habits \ 2 ... 0 Yes Paleo 12 ... 0 Yes Non-Vegetarian 19 ... 0 Yes Gluten-Free 25 ... 1 No Vegetarian 37 ... 1 Yes Gluten-Free ... ... ... ... ... 14972 ... 0 No Pescatarian 14974 ... 0 Yes Pescatarian 14981 ... 0 Yes Paleo 14983 ... 1 No Vegetarian 14991 ... 1 No Pescatarian Stress Levels Blood Pressure Levels Cholesterol Levels \ 2 7.31 154/97 HDL: 59, LDL: 95 12 9.19 114/67 HDL: 80, LDL: 83 19 0.46 170/64 HDL: 72, LDL: 174 25 6.48 151/65 HDL: 73, LDL: 111 37 7.86 148/74 HDL: 30, LDL: 62 ... ... ... ... 14972 9.51 113/65 HDL: 55, LDL: 179 14974 2.26 159/94 HDL: 42, LDL: 99 14981 2.69 135/66 HDL: 58, LDL: 161 14983 6.79 136/66 HDL: 59, LDL: 172 14991 0.71 180/110 HDL: 33, LDL: 99 Symptoms Diagnosis \ 2 Seizures, Dizziness Stroke 12 Loss of Balance, Numbness Stroke 19 Seizures Stroke 25 Numbness, Loss of Balance, Numbness, Blurred V... Stroke 37 Blurred Vision, Seizures, Weakness Stroke ... ... ... 14972 NaN No Stroke 14974 Seizures, Severe Fatigue Stroke 14981 Blurred Vision, Headache, Severe Fatigue, Loss... Stroke 14983 Severe Fatigue, Severe Fatigue, Headache, Seiz... Stroke 14991 Confusion, Confusion No Stroke New Stress Levels Category 2 7-8 Diabetic 12 9-10 Diabetic 19 0-1 Normal 25 6-7 Normal 37 7-8 Diabetic ... ... ... 14972 9-10 Pre-diabetic 14974 2-3 Normal 14981 2-3 Normal 14983 6-7 Normal 14991 0-1 Diabetic [2662 rows x 24 columns]
<ipython-input-1696-ad58dde6d0c3>:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy young_adults.loc[:, 'Category'] = young_adults['Average Glucose Level'].apply(categorize_glucose_level)
Next, we will apply some hypothesis testing.
$H_{0}$: The category of the average glucose levels does not have an effect on the likelihood of stroke occurrence in the patient
$H_{A}$: The category of the average glucose levels does have an effect on the likelihood of stroke occurrence in the patient
Our plan is to apply Chi-Squared Test. For that first, we will probably want a contingency table and a way to create one. Create a contingency table and display it.
cont = pd.crosstab(young_adults['Category'], young_adults['Diagnosis'])
print(cont)
Diagnosis No Stroke Stroke Category Diabetic 592 616 Normal 541 527 Pre-diabetic 199 187
Create a plot showing the relationship between the average glucose level categories, and the occurrence of stroke.
cont.plot(kind = 'bar', title='Average Glucose Level vs Stroke (Contingency Table))', xlabel='Average Glucose Level Category', ylabel='Number of Patients')
<Axes: title={'center': 'Average Glucose Level vs Stroke (Contingency Table))'}, xlabel='Average Glucose Level Category', ylabel='Number of Patients'>
Next, we want to conduct the chi-squared test using the chi2_contingency() function. Display P-Value by applying the Chi-Squared Test using the chi2_contingency() function.
ob = spy.stats.contingency.chi2_contingency(cont)
print(ob.pvalue)
0.5969342118907002
Based on the obtained P-Value, determine whether to reject or fail to reject the null hypothesis:
In hypothesis testing, we have set our class's significance level to be 0.05, and this value will act as a threshold to determine whether the p-value indicates a statistically significant result. If the p-value is less than or equal to the significance level (0.05), we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis. Since 0.596934 > 0.05, the p-value is higher than the threshold set for statistical significance. Since the p-value is much greater than 0.05, it suggests that the observed data is quite likely under the null hypothesis of no association between the average glucose levels category and the likelihood of stroke occurring.
Unhealthy Eating Behaviors & Chi-Square Test Conclusion:
The obtained P-Value of 0.596934 implies that there is not enough statistical evidence to conclude that the category of the average glucose levels affects the likelihood of stroke occurrence, meaning we fail to reject the null hypothesis.
In this section, based on the results in the previous sections, we will apply machine learning techniques (e.g., classification, regression, etc.) to the Stroke Data to explore how lifestyle factors such as stress, alcohol consumption, smoking, and unhealthy eating behaviors correlate with the increasing incidence of strokes in young adults.
First of all, we display our data again.
print(stress_table)
Diagnosis No Stroke Stroke New Stress Levels 0-1 148 131 1-2 111 155 2-3 130 135 3-4 126 124 4-5 139 112 5-6 143 138 6-7 136 127 7-8 153 148 8-9 117 141 9-10 129 119
Look at the above data, we can see that the stress levels and stroke incidence are continuous variables. Therefore, linear regression is a suitable choice for the analysis of the relationship between stress levels and stroke incidence.
We will convert the provided data into a suitable format for regression analysis by creating a dataset with stress levels and stroke incidences. The independent variable (predictor) will be the stress levels, and the dependent variable (response) will be the number of stroke incidences.
data = {
"Stress Levels": [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5],
"Stroke Incidence": [131, 155, 135, 124, 112, 138, 127, 148, 141, 119]
}
stress2_df = pd.DataFrame(data)
Then, we will use linear regression to model the relationship between stress levels and stroke incidence.
X = stress2_df["Stress Levels"].values.reshape(-1, 1)
y = stress2_df["Stroke Incidence"].values
model = LinearRegression()
Next, we will train the model by fitting the regression model to the data.
model.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
In this step, we will interpret the regression coefficients to understand the correlation and display the results.
coef = model.coef_[0]
intercept = model.intercept_
r_squared = model.score(X, y)
print("Regression Coefficient (Slope):", coef)
print("\n Intercept:", intercept)
print("\n R-squared:", r_squared)
Regression Coefficient (Slope): -0.6424242424242423 Intercept: 136.21212121212122 R-squared: 0.02182595182595204
Finally, we will plot the data and the regression line.
plt.scatter(stress2_df["Stress Levels"], stress2_df["Stroke Incidence"], color='blue')
plt.plot(stress2_df["Stress Levels"], model.predict(X), color='red')
plt.xlabel('Stress Levels')
plt.ylabel('Stroke Incidence')
plt.title('Stress Levels vs Stroke Incidence')
plt.show()
Analysis of the Results
A negative coefficient suggests that as stress levels increase, stroke incidence tends to decrease slightly.
This is the expected stroke incidence when the stress level is 0.
The R-squared value suggests that only about 2.2% of the variance in stroke incidence is explained by stress levels. This indicates a weak correlation.
# Create a new dataframe
alcohol_stroke_ml_data = young_adults[["Alcohol Intake", "Diagnosis"]]
# One-hot encode the alcohol intake status
all_consumption_levels = alcohol_stroke_ml_data['Alcohol Intake'].str.get_dummies(sep=', ')
alcohol_stroke_ml_data = pd.concat([alcohol_stroke_ml_data, all_consumption_levels], axis=1).drop(columns=['Alcohol Intake'])
# Change the Stroke / No Stroke labels to 1 and 0
alcohol_stroke_ml_data.replace({"Stroke":1, "No Stroke":0}, inplace=True)
# Create a decision tree model
alcohol_model = DecisionTreeClassifier()
X_alc = alcohol_stroke_ml_data.drop('Diagnosis', axis=1)
Y_alc = alcohol_stroke_ml_data['Diagnosis']
x_alc_train, x_alc_test, y_alc_train, y_alc_test = train_test_split(X_alc, Y_alc, test_size= 0.2, random_state = 42)
alcohol_model.fit(x_alc_train, y_alc_train)
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier()
# Evaluate the Performance
predict_al = alcohol_model.predict(x_alc_test)
accuracy = accuracy_score(y_alc_test, predict_al)
print(f"Accuracy of predictions using DecisionTree: {accuracy}")
print(classification_report(y_alc_test, predict_al))
Accuracy of predictions using DecisionTree: 0.4896810506566604 precision recall f1-score support 0 0.49 0.49 0.49 267 1 0.49 0.49 0.49 266 accuracy 0.49 533 macro avg 0.49 0.49 0.49 533 weighted avg 0.49 0.49 0.49 533
al_matrix = confusion_matrix(y_alc_test, predict_al)
plt.figure(figsize=(8, 6))
seas.heatmap(al_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['No Stroke', 'Stroke'], yticklabels=['No Stroke', 'Stroke'])
plt.title('Confusion Matrix of Decision Tree Predictions')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
Analysis of the Results
The accuracy falls below 50%, meaning the model is performing worse than chance at predicting a stroke. This demonstrates that drinking patterns alone is not enough data for a model to predict a stroke
We will use the Random Forest model to find the correlation between the Smoking Status and Stroke Diagnosis and how predictions can be made through the feature. As predicting how smoking impacts diagnosis is a binary classification, using Random Forest for the purpose of ensembling is a good method.
# Feature : Smoking Status (categorical categories)
print (young_adults['Smoking Status'].unique())
# Label(prediction) : Diagnosis (categorical categories)
print (young_adults['Diagnosis'].unique())
#print(young_adults)
# Create a Data frame for our model : add additional features (not including other factors) to prevent underfitting
smoking_stroke_ml_df = young_adults[["Smoking Status", "Diagnosis"]]
# Convert the categorical values to numeric values
def convert_status2 (status) :
if status == "Non-smoker" :
return 0
elif status == "Formerly Smoked" :
return 1
elif status == "Currently Smokes" :
return 2
def convert_diagnosis2 (status) :
if status == "No Stroke" :
return 0
elif status == "Stroke" :
return 1
# Display the converted values
smoking_stroke_ml_df.loc[:,"Smoking Status"] = smoking_stroke_ml_df["Smoking Status"].apply(convert_status2)
smoking_stroke_ml_df.loc[:,"Diagnosis"] = smoking_stroke_ml_df["Diagnosis"].apply(convert_diagnosis2).astype(int)
print(smoking_stroke_ml_df)
# Display the unique values of the converted feature
print(smoking_stroke_ml_df['Diagnosis'].unique())
print(smoking_stroke_ml_df['Smoking Status'].unique())
['Formerly Smoked' 'Non-smoker' 'Currently Smokes'] ['Stroke' 'No Stroke'] Smoking Status Diagnosis 2 1 1 12 1 1 19 0 1 25 2 1 37 1 1 ... ... ... 14972 2 0 14974 1 1 14981 2 1 14983 2 1 14991 0 0 [2662 rows x 2 columns] [1 0] [1 0 2]
# Split the Data to training and testing data with testing being 0.2
X = smoking_stroke_ml_df[["Smoking Status"]]
y = smoking_stroke_ml_df['Diagnosis'] >= 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state = 42)
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
# Create our model
smoking_model = RandomForestClassifier()
# Train the data using RandomForestClassifier -> ensembling
smoking_model.fit(X_train, y_train)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()
# Evaluate the Performance
predictions = smoking_model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy of prediction: {accuracy}")
print(classification_report(y_test, predictions))
print("The accuracy of this model is not overall ideal")
Accuracy of prediction: 0.4521575984990619 precision recall f1-score support False 0.47 0.66 0.55 267 True 0.42 0.24 0.31 266 accuracy 0.45 533 macro avg 0.44 0.45 0.43 533 weighted avg 0.44 0.45 0.43 533 The accuracy of this model is not overall ideal
Analysis of the Results
This model does not perform well in distinguishing the diagnosis of stroke. Accuracy scores are near 0.5 meaning that its performance is similar to random guessing. Looking at the f1-score for both classes is low indicating that we need further investigation on why this is happening.
# Use the confusion matrix to visualize the positives (true false) and negatives (true false)
smoking_conf_matrix = confusion_matrix(y_test, predictions)
plt.figure(figsize=(12, 6))
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.title('Confusion Matrix of Smoking Status VS Diagnosis of Stroke')
seas.heatmap(smoking_conf_matrix, annot=True, fmt="d", xticklabels=["No Stroke", "Stroke"], yticklabels=["No Stroke", "Stroke"])
plt.show()
# Use the correlation matrix to comprehend why the model resulted in having poor performance
smoking_corr_matrix = smoking_stroke_ml_df.corr()
plt.figure(figsize = (12,6))
plt.title('Correlation Matrix')
seas.heatmap(smoking_corr_matrix, annot = True, fmt = '.4f', linewidths = 0.5)
plt.show()
Analysis of the Results
It is likely that the reason for the poor performance of the machine learning model is that the correlation coefficient is close to 0 for smoking status and diagnosis of stroke which means that these two variables are not related linearly. Additionally, more features are needed to prevent underfitting as our model cannot generalize the patterns and predict the diagnosis.
print(young_adults)
Patient ID Patient Name Age Gender Hypertension Heart Disease \ 2 32145 Dhanush Balan 26 Male 1 1 12 66924 Ahana Lalla 30 Female 0 1 19 23954 Taran Khatri 25 Male 0 0 25 36975 Jhanvi Brar 24 Female 0 0 37 94512 Anvi Salvi 23 Female 0 0 ... ... ... ... ... ... ... 14972 11839 Chirag Kurian 30 Male 0 1 14974 30150 Alisha Banerjee 20 Female 0 0 14981 12323 Pari Ravi 25 Male 0 0 14983 40381 Sana Goel 18 Female 0 0 14991 90658 Samaira Raj 26 Male 0 1 Marital Status Work Type Residence Type Average Glucose Level \ 2 Married Never Worked Rural 189.00 12 Divorced Government Job Urban 163.15 19 Married Private Urban 71.38 25 Married Self-employed Urban 79.89 37 Single Government Job Rural 164.72 ... ... ... ... ... 14972 Married Self-employed Rural 126.94 14974 Divorced Never Worked Rural 101.36 14981 Single Private Rural 77.64 14983 Single Self-employed Urban 68.26 14991 Married Self-employed Rural 145.05 ... Stroke History Family History of Stroke Dietary Habits \ 2 ... 0 Yes Paleo 12 ... 0 Yes Non-Vegetarian 19 ... 0 Yes Gluten-Free 25 ... 1 No Vegetarian 37 ... 1 Yes Gluten-Free ... ... ... ... ... 14972 ... 0 No Pescatarian 14974 ... 0 Yes Pescatarian 14981 ... 0 Yes Paleo 14983 ... 1 No Vegetarian 14991 ... 1 No Pescatarian Stress Levels Blood Pressure Levels Cholesterol Levels \ 2 7.31 154/97 HDL: 59, LDL: 95 12 9.19 114/67 HDL: 80, LDL: 83 19 0.46 170/64 HDL: 72, LDL: 174 25 6.48 151/65 HDL: 73, LDL: 111 37 7.86 148/74 HDL: 30, LDL: 62 ... ... ... ... 14972 9.51 113/65 HDL: 55, LDL: 179 14974 2.26 159/94 HDL: 42, LDL: 99 14981 2.69 135/66 HDL: 58, LDL: 161 14983 6.79 136/66 HDL: 59, LDL: 172 14991 0.71 180/110 HDL: 33, LDL: 99 Symptoms Diagnosis \ 2 Seizures, Dizziness Stroke 12 Loss of Balance, Numbness Stroke 19 Seizures Stroke 25 Numbness, Loss of Balance, Numbness, Blurred V... Stroke 37 Blurred Vision, Seizures, Weakness Stroke ... ... ... 14972 NaN No Stroke 14974 Seizures, Severe Fatigue Stroke 14981 Blurred Vision, Headache, Severe Fatigue, Loss... Stroke 14983 Severe Fatigue, Severe Fatigue, Headache, Seiz... Stroke 14991 Confusion, Confusion No Stroke New Stress Levels Category 2 7-8 Diabetic 12 9-10 Diabetic 19 0-1 Normal 25 6-7 Normal 37 7-8 Diabetic ... ... ... 14972 9-10 Pre-diabetic 14974 2-3 Normal 14981 2-3 Normal 14983 6-7 Normal 14991 0-1 Diabetic [2662 rows x 24 columns]
Looking at the above data, we can see that glucose levels are continuous variables. Continuous variables are those that can take on an infinite number of values within a given range. In this case, glucose levels can vary continuously over a spectrum of values, rather than being limited to specific categories or discrete numbers. Therefore, it is suitable to classify glucose levels as continuous values for the purpose of regression analysis. This classification will allow us to model the relationship between glucose levels and stroke occurrence using logistic regression, which is ideal for analyzing the impact of one continuous variable on a binary dependent variable.
We will convert the provided data into a suitable format for logistic regression analysis by creating a dataset with glucose levels and stroke occurrences. The independent variable (predictor) will be the glucose levels, and the dependent variable (response) will be stroke occurrence, coded as 0 for no stroke and 1 for stroke. This will enable us to analyze how changes in glucose levels influence the likelihood of experiencing a stroke.
glucose_ml_df = young_adults[["Age", "Average Glucose Level", "Diagnosis"]]
def stroke_conversion (status) :
if status == "No Stroke" :
return 0
elif status == "Stroke" :
return 1
glucose_ml_df.loc[:,"Diagnosis"] = glucose_ml_df["Diagnosis"].apply(stroke_conversion).astype(int)
print(glucose_ml_df)
Age Average Glucose Level Diagnosis 2 26 189.00 1 12 30 163.15 1 19 25 71.38 1 25 24 79.89 1 37 23 164.72 1 ... ... ... ... 14972 30 126.94 0 14974 20 101.36 1 14981 25 77.64 1 14983 18 68.26 1 14991 26 145.05 0 [2662 rows x 3 columns]
Xg = glucose_ml_df[['Average Glucose Level', 'Age']]
Yg = glucose_ml_df['Diagnosis'] >= 1
Xg_train, Xg_test, Yg_train, Yg_test = train_test_split(Xg, Yg, test_size= 0.2, random_state = 42)
scaler = StandardScaler()
Xg_train = scaler.fit_transform(Xg_train)
Xg_test = scaler.fit_transform(Xg_test)
glucose_ml_model = LogisticRegression()
glucose_ml_model.fit(Xg_train, Yg_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
pred_g = glucose_ml_model.predict(Xg_test)
accu_g = accuracy_score(Yg_test, pred_g)
print(classification_report(Yg_test, pred_g))
print(f"Accuracy Score: {accu_g}")
precision recall f1-score support False 0.52 0.52 0.52 267 True 0.52 0.53 0.52 266 accuracy 0.52 533 macro avg 0.52 0.52 0.52 533 weighted avg 0.52 0.52 0.52 533 Accuracy Score: 0.5215759849906192
Analysis of the Results (Precision, Recall, F1 Score, & Accuracy)
Given an accuracy of approximately 0.522 (52.2%), the precision, recall, and F1-score for both classes (False and True) are very similar, indicating that the model performs equally well (or equally poorly) across both classes. Here are the main takeaways from evaluating these results: For precision, the model correctly identifies about 52% of instances for both classes out of all instances it predicts as such. For recall, the model correctly identifies about 52% of instances for False (no stroke) and 53% for True (stroke) out of all actual instances of each class. And for the F1-Score, this harmonic mean suggests a balance between precision and recall, with the model showing consistent performance across both classes.
glucose_correlation_matrix = glucose_ml_df.corr()
plt.figure(figsize = (12,6))
plt.title('Correlation Matrix for Logisitic Regression Model on Average Glucose Levels vs Stroke Occurrence Based On Age')
seas.heatmap(glucose_correlation_matrix, annot = True, fmt = '.4f', xticklabels=["Age", "Average Glucose Level", "Diagnosis"], yticklabels=["Age", "Average Glucose Level", "Diagnosis"], linewidths = 0.5)
plt.show()
Analysis of the Results (Correlation Matrix)
For the correlation matrix, this is what the results are depicting: 1. For the Age vs. Average Glucose Level, the correlation coefficient is -0.0039. This indicates a very weak negative correlation between age and average glucose level. Essentially, there is almost no linear relationship between these two variables. 2. For Age vs. Diagnosis, the correlation coefficient is -0.0197. This indicates a very weak negative correlation between age and the diagnosis of stroke. This suggests that as age increases, there is a very slight decrease in the likelihood of having a stroke, but the relationship is mostly negligible. 3. For the Average Glucose Level vs. Diagnosis, the correlation coefficient is 0.0022. This indicates a very weak positive correlation between average glucose level and the diagnosis of stroke. This suggests that higher glucose levels are slightly associated with an increased likelihood of having a stroke, but again, the relationship is almost negligible. 4. The diagonal elements all have a value of 1.0000, which indicates a perfect positive correlation of each variable with itself, as expected.
Let us interpret the results of the correlation matrix to reach a conclusion. The values in the correlation matrix are close to zero for all pairings of the variables, indicating that there is no strong linear relationship between age, average glucose level, and the diagnosis of stroke in the dataset being analyzed. The weak correlations suggest that other factors might be more significant in predicting stroke occurrence and that age and average glucose levels alone are not strong predictors in this context. For building a more predictive model, we might need to include additional features or consider interactions between variables.
Interpretation
The logistic regression model appears to have limited biased power based on blood glucose levels alone, as indicated by the similar performance metrics across both classes (False and True). This could imply that blood glucose levels alone might not be sufficiently predictive of stroke occurrence, or additional features or model improvements might be necessary to achieve higher accuracy and differentiation between the classes.
Our analysis of the main demonstrated that we could not find a correlation between stroke diagnosis and individual health factors. However, we can use the same data to understand if the individual factors can be considered together to predict the occurence of a stroke. To do so, we can use four ML models to compare the features and generate predictions.
# Begin with the young_adults dataset
# We don't need the patient names or IDs for the analysis, so we can remove those
main_dataset = young_adults.copy()
main_dataset = main_dataset.drop(['Patient Name', 'Patient ID'], axis=1)
# In our earlier analysis, we reformatted the Stress Levels into discrete ranges
# For this, we can use the raw stress levels instead
main_dataset = main_dataset.drop(['New Stress Levels'], axis=1)
# We also do not need the diabetes category, so we can drop this as well
main_dataset = main_dataset.drop(['Category'], axis=1)
# To aid the analysis, we can split Cholesterol Levels into HDL and LDL levels
main_dataset[['Cholesterol Levels HDL', 'Cholesterol Levels LDL']] = main_dataset['Cholesterol Levels'].str.split(',', expand=True)
main_dataset = main_dataset.drop(['Cholesterol Levels'], axis=1)
# To make sure the values are treated as numbers, we can remove the "HDL: " and "LDL: " text
# and convert the columns to numerical values
main_dataset['Cholesterol Levels HDL'] = main_dataset['Cholesterol Levels HDL'].apply(lambda x: int(x[4:]))
main_dataset['Cholesterol Levels LDL'] = main_dataset['Cholesterol Levels LDL'].apply(lambda x: int(x[5:]))
# Next, to treat blood pressure levels as numbers, we can split it into the systolic and diastolic numbers
main_dataset[['Blood Pressure Systolic', 'Blood Pressure Diastolic']] = main_dataset['Blood Pressure Levels'].str.split('/', expand=True)
main_dataset = main_dataset.drop(['Blood Pressure Levels'], axis=1)
main_dataset['Blood Pressure Systolic'] = main_dataset['Blood Pressure Systolic'].apply(lambda x: int(x))
main_dataset['Blood Pressure Diastolic'] = main_dataset['Blood Pressure Diastolic'].apply(lambda x: int(x))
# To consider the impact of individual symptoms, we must one-hot encode the symptoms.
# To do so, we can split the Symptoms column into distinct symptoms, then run an encoder
all_symptoms = main_dataset['Symptoms'].str.get_dummies(sep=', ')
main_dataset = pd.concat([main_dataset, all_symptoms], axis=1).drop(columns=['Symptoms'])
# Finally, to feed our data to the random forest model, we must one-hot encode our remaining categorical variables
# Sex of the patient:
genders = main_dataset['Gender'].str.get_dummies()
main_dataset = pd.concat([main_dataset, genders], axis=1).drop(columns=['Gender'])
# Marital Status
marital_statuses = main_dataset['Marital Status'].str.get_dummies()
main_dataset = pd.concat([main_dataset, marital_statuses], axis=1).drop(columns=['Marital Status'])
# Work Type
work_types = main_dataset['Work Type'].str.get_dummies()
main_dataset = pd.concat([main_dataset, work_types], axis=1).drop(columns=['Work Type'])
# Residence Type
residence_types = main_dataset['Residence Type'].str.get_dummies()
main_dataset = pd.concat([main_dataset, residence_types], axis=1).drop(columns=['Residence Type'])
# Smoking Statuses
smoking_statuses = main_dataset['Smoking Status'].str.get_dummies()
main_dataset = pd.concat([main_dataset, smoking_statuses], axis=1).drop(columns=['Smoking Status'])
# Alcohol Intakes
alcohol_intakes = main_dataset['Alcohol Intake'].str.get_dummies()
main_dataset = pd.concat([main_dataset, alcohol_intakes], axis=1).drop(columns=['Alcohol Intake'])
# Dietary Habits
diets = main_dataset['Dietary Habits'].str.get_dummies()
main_dataset = pd.concat([main_dataset, diets], axis=1).drop(columns=['Dietary Habits'])
# Physical Activity
physical_activity_levels = main_dataset['Physical Activity'].str.get_dummies()
main_dataset = pd.concat([main_dataset, physical_activity_levels], axis=1).drop(columns=['Physical Activity'])
# Change column names to be more specific
main_dataset = main_dataset.rename(columns={"High": "High Physical Activity", "Low": "Low Physical Activity", "Moderate":"Moderate Physical Activity"})
# Family History of Stroke - replace Yes with 1 and No with 0
main_dataset['Family History of Stroke'].replace({'Yes': 1, 'No': 0}, inplace=True)
# Final cleaning to treat numerical values as integers/floats
main_dataset['Age'] = main_dataset['Age'].apply(lambda x: int(x))
main_dataset['Average Glucose Level'] = main_dataset['Average Glucose Level'].apply(lambda x: float(x))
main_dataset['Body Mass Index (BMI)'] = main_dataset['Body Mass Index (BMI)'].apply(lambda x: float(x))
main_dataset['Stress Levels'] = main_dataset['Stress Levels'].apply(lambda x: float(x))
#main_dataset.columns
main_dataset
Age | Hypertension | Heart Disease | Average Glucose Level | Body Mass Index (BMI) | Stroke History | Family History of Stroke | Stress Levels | Diagnosis | Cholesterol Levels HDL | ... | Gluten-Free | Keto | Non-Vegetarian | Paleo | Pescatarian | Vegan | Vegetarian | High Physical Activity | Low Physical Activity | Moderate Physical Activity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 26 | 1 | 1 | 189.00 | 20.32 | 0 | 1 | 7.31 | Stroke | 59 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
12 | 30 | 0 | 1 | 163.15 | 19.36 | 0 | 1 | 9.19 | Stroke | 80 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
19 | 25 | 0 | 0 | 71.38 | 39.00 | 0 | 1 | 0.46 | Stroke | 72 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
25 | 24 | 0 | 0 | 79.89 | 17.58 | 1 | 0 | 6.48 | Stroke | 73 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
37 | 23 | 0 | 0 | 164.72 | 31.56 | 1 | 1 | 7.86 | Stroke | 30 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14972 | 30 | 0 | 1 | 126.94 | 36.08 | 0 | 0 | 9.51 | No Stroke | 55 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
14974 | 20 | 0 | 0 | 101.36 | 21.15 | 0 | 1 | 2.26 | Stroke | 42 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
14981 | 25 | 0 | 0 | 77.64 | 23.88 | 0 | 1 | 2.69 | Stroke | 58 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
14983 | 18 | 0 | 0 | 68.26 | 36.48 | 1 | 0 | 6.79 | Stroke | 59 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
14991 | 26 | 0 | 1 | 145.05 | 35.94 | 1 | 0 | 0.71 | No Stroke | 33 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2662 rows × 51 columns
# Now we can begin preparing the ML model
X_main_data = main_dataset.drop('Diagnosis', axis=1)
Y_main_data = main_dataset['Diagnosis']
x_main_train, x_main_test, y_main_train, y_main_test = train_test_split(X_main_data, Y_main_data, test_size=.2, random_state=4)
# Features of the model -> added interaction features
print(x_main_train.columns)
Index(['Age', 'Hypertension', 'Heart Disease', 'Average Glucose Level', 'Body Mass Index (BMI)', 'Stroke History', 'Family History of Stroke', 'Stress Levels', 'Cholesterol Levels HDL', 'Cholesterol Levels LDL', 'Blood Pressure Systolic', 'Blood Pressure Diastolic', 'Blurred Vision', 'Confusion', 'Difficulty Speaking', 'Dizziness', 'Headache', 'Loss of Balance', 'Numbness', 'Seizures', 'Severe Fatigue', 'Weakness', 'Female', 'Male', 'Divorced', 'Married', 'Single', 'Government Job', 'Never Worked', 'Private', 'Self-employed', 'Rural', 'Urban', 'Currently Smokes', 'Formerly Smoked', 'Non-smoker', 'Frequent Drinker', 'Never', 'Rarely', 'Social Drinker', 'Gluten-Free', 'Keto', 'Non-Vegetarian', 'Paleo', 'Pescatarian', 'Vegan', 'Vegetarian', 'High Physical Activity', 'Low Physical Activity', 'Moderate Physical Activity'], dtype='object')
# Standardize the features
scaler = StandardScaler()
x_main_train = scaler.fit_transform(x_main_train)
x_main_test = scaler.fit_transform(x_main_test)
# Models
models = {
'KNN': KNeighborsClassifier(),
'DecisionTree': DecisionTreeClassifier(),
'LogisticRegression':LogisticRegression(max_iter=1000000),
'RandomForest': RandomForestClassifier()
}
# Apply K-fold cross validation and evaluate to alleviate overfitting
skf = StratifiedKFold(n_splits= 5, shuffle=True, random_state=42)
for model_name, model in models.items():
accuracy = cross_val_score(model, x_main_train, y_main_train, cv = skf)
# Display the mean and standard deviation of the cross validation accuracy
print(f"{model_name} \nMean: {accuracy.mean()} \nStandard Deviation: {accuracy.std()}\n")
KNN Mean: 0.4894194973764153 Standard Deviation: 0.023114698595755705 DecisionTree Mean: 0.5025893399613366 Standard Deviation: 0.02539913400661669 LogisticRegression Mean: 0.5049400718033693 Standard Deviation: 0.024654173650678273 RandomForest Mean: 0.512448494890914 Standard Deviation: 0.016121716175018355
# Train each model using the training data
for model_name, model in models.items():
model.fit(x_main_train, y_main_train)
# Evaluate the performance of each model
for model_name, model in models.items():
predicted = model.predict(x_main_test)
print(f"Accuracy of {model_name}: {accuracy_score(y_main_test, predicted)}")
print(
f"Classification report for classifier {model_name}:\n"
f"{classification_report(y_main_test, predicted)}\n"
)
Accuracy of KNN: 0.46904315196998125 Classification report for classifier KNN: precision recall f1-score support No Stroke 0.47 0.51 0.49 263 Stroke 0.47 0.43 0.45 270 accuracy 0.47 533 macro avg 0.47 0.47 0.47 533 weighted avg 0.47 0.47 0.47 533 Accuracy of DecisionTree: 0.4727954971857411 Classification report for classifier DecisionTree: precision recall f1-score support No Stroke 0.46 0.44 0.45 263 Stroke 0.48 0.51 0.49 270 accuracy 0.47 533 macro avg 0.47 0.47 0.47 533 weighted avg 0.47 0.47 0.47 533 Accuracy of LogisticRegression: 0.46904315196998125 Classification report for classifier LogisticRegression: precision recall f1-score support No Stroke 0.46 0.50 0.48 263 Stroke 0.47 0.44 0.45 270 accuracy 0.47 533 macro avg 0.47 0.47 0.47 533 weighted avg 0.47 0.47 0.47 533 Accuracy of RandomForest: 0.5196998123827392 Classification report for classifier RandomForest: precision recall f1-score support No Stroke 0.51 0.54 0.53 263 Stroke 0.53 0.50 0.51 270 accuracy 0.52 533 macro avg 0.52 0.52 0.52 533 weighted avg 0.52 0.52 0.52 533
Analysis of the Results
Although, the Random Forest model performed a little better than the other models, overall, all models have performed poorly on the predictions for the diagnosis of stroke with the accuracy close to 0.5. Additionally, the model performed equally poorly on both precision and recall and the overall f1-scores.
# Visualize the performance using the confusion matrix
maindata_confusion_matrix = confusion_matrix(y_main_test, predicted)
plt.figure(figsize=(12, 6))
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.title('Confusion Matrix for the prediction of the main dataset on the diagnosis of stroke')
seas.heatmap(maindata_confusion_matrix, annot=True, fmt="d", xticklabels=["No Stroke", "Stroke"], yticklabels=["No Stroke", "Stroke"])
plt.show()
Based on the linear regression analysis, there is a very weak negative correlation between stress levels and stroke incidence in young adults. The R-squared value is very low, implying that stress levels do not significantly explain the variation in stroke incidence based on the given data.
The DecisionTree model performed poorly when given just the alcohol consumption levels to analyze. This demonstrates that alcohol consumption alone is not enough on its own to predict a stroke. As such, in order to improve an ML model's ability to predict the occurence of a stroke, more features must be considered and more training must be conducted. This is in alignment with the statistical analysis performed on the dataset, which was unable to find a statistical relationship between alcohol levels and occurence of a stroke.
The RandomForest model did not perform well with the prediction of stroke with the smoking status. Following further examination, the correlation coefficient of these two variables was close to 0, indicating that there is no linear relationship between them. Additionally, more features are needed to prevent underfitting as our model is too simple and cannot generalize the patterns and predict the diagnosis. As a result, the smoking status solely cannot predict the diagnosis of stroke.
Based on the logistic regression analysis, the model demonstrates moderate performance in predicting stroke occurrence based on blood glucose levels. The precision, recall, and F1-score metrics consistently hover around 0.52 for both True and False classes, suggesting the model's ability to distinguish between instances of stroke and no stroke is limited. The balanced performance metrics indicate that blood glucose levels alone may not be sufficiently predictive of stroke occurrence.
Overall, the KNN, Decision Tree, Logistic Regression and Random Forest classifier shows a balanced performance for predictions, with an accuracy slightly below 0.5. As such, we can conclude that a new approach may be needed to generate more accurate predictions. This could mean including more features, more precise features (such as considering the exact type of physical activity, rather than the scale of Low/Moderate/Medium Activity), or, conversely, feature pruning to determine the most important features to consider and discarding the rest.
In conclusion, our overall machine learning models have performed poorly with low accuracy in the prediction of the diagnosis of stroke. The evaluation results of our models suggest that our approach of focusing on four significant individual features does not adequately reflect the complex relationships between various lifestyle factors that influence stroke risk in young adults. This poor performance also suggests two possibilities. Firstly, these factors may not be the main causes of increased stroke risk in young people, or another factor, such as family history of stroke, might be more significant. Secondly, we need a more comprehensive approach, considering more features, to better understand and predict stroke risk in young adults.
Although our study didn't provide definitive answers, it has raised important questions and highlighted areas needing further investigation. To address the shortcomings of our models and for further investigation into the correlation of lifestyle factors with the risk of stroke in young adults, we suggest introducing new features that represent the interactions of each feature. Interaction features can better capture the complexity of diverse lifestyles. Engineering more features as such will allow the machine learning models to learn the combined effects of these individual features, which will be more effective in predicting the diagnosis of stroke.
Board on Children, Youth, and Families; Institute of Medicine; National Research Council. Improving the Health, Safety, and Well-Being of Young Adults: Workshop Summary. Washington (DC): National Academies Press (US); 2013 Sep 27. Available from: https://www.ncbi.nlm.nih.gov/books/NBK202207/ doi: 10.17226/18340
Cleveland Clinic. (n.d.). A1C. Retrieved June 15, 2024, from https://my.clevelandclinic.org/health/diagnostics/9731-a1c