**How do lifestyle factors such as stress, alcohol consumption, smoking, and unhealthy eating behaviors correlate with the rising stroke incidence in young adults?**

¶

Summer 2024 Data Science Project

¶

Team Members:

  1. Manasi Dixit - U_ID: 120376153
  2. Simran Sidhu - U_ID: 115424869
  3. Soomin Joh - U_ID: 120412141
  4. Kristen Nguyen (Thi Nguyen) - U_ID: 119404096

**Table of Contents:**

¶
  1. Introduction
  2. Data Curation and Exploratory Data Analysis
    1. Stress Data
    2. Alcohol Consumption Data
    3. Smoking Data
    4. Unhealthy Eating Behaviors Data
    5. Stroke Data
  3. Machine Learning Analysis and Visualization
  4. Insights and Conclusions
  5. References

**I. Introduction**

¶

The number of strokes in young adults has been rising in recent years, and it recently becomes one of the most significant public health concerns. In the past, strokes mainly affected older adults, but strokes are now increasingly seen in young adults. This raises questions about the causes. Therefore, the objective of this study is to answer the question: "How do lifestyle factors such as stress, alcohol consumption, smoking, and unhealthy eating behaviors correlate with the rising stroke incidence in young adults?". Stress significantly impacts heart health leading to conditions that raise stroke risk. Similarly, alcohol consumption and smoking are also known risk factors for many health problems including stroke. As for unhealthy eating habits, this factor is related to high average glucose levels, and it contributes to diabetes, which is a common stroke risk factor. Through this analysis, we hope to provide valuable insights into preventing and managing strokes in young adults.

**II. Data Curation and Exploratory Data Analysis**

¶

The first step in our process will be to import several relevant Python libraries for this study.

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as seas
import scipy as spy
from scipy.stats import chi2_contingency
from scipy.stats import f_oneway
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve

According to "Improving the Health, Safety, and Well-Being of Young Adults: Workshop Summary", which is the summary of a workshop hosted by the Board on Children, Youth, and Families of the Institute of Medicine (IOM) and the National Research Council (NRC) in May, 2013, in the United States, young adulthood typically starts with high school graduation around age 18 and can extend into the late 20s or early 30s. Therefore, we will use data of people who are considered young adults (between the ages 18-30).

**1. Stress Data:**

¶

Source: https://www.kaggle.com/datasets/shashwatwork/depression-and-mental-health-data-analysis

**Overall Summary For This Dataset:**
¶

This dataset contains 824 rows and 13 columns offering a broad range of variables related to stress. It includes information on factors such as age, gender, occupation, number of days the participant has stayed indoors, whether their stress is increasing daily, frustrations during the first two weeks of quarantine, significant changes in eating and sleeping habits, their history of mental disorders in the previous generation, changes in body weight during quarantine, extreme mood changes, difficulty coping with daily problems or stress, loss of interest in work, and feelings of mental weakness when interacting with others. These factors are crucial for our project as they help us demonstrate that young adults aged approximately 18 to 30 experience increasing stress levels.

**1A. Data Preprocessing**

¶

Before exploring stress data, we need to load in stress dataset into a variable called stress_df.

In [ ]:
stress_df = pd.read_csv('/content/drive/MyDrive/CSV/mental_health_finaldata_1.csv')
print(stress_df)
          Age  Gender Occupation        Days_Indoors Growing_Stress  \
0       20-25  Female  Corporate           1-14 days            Yes   
1    30-Above    Male     Others          31-60 days            Yes   
2    30-Above  Female    Student    Go out Every day             No   
3       25-30    Male     Others           1-14 days            Yes   
4       16-20  Female    Student  More than 2 months            Yes   
..        ...     ...        ...                 ...            ...   
819     20-25    Male  Corporate    Go out Every day             No   
820     20-25    Male     Others           1-14 days            Yes   
821     20-25    Male    Student  More than 2 months            Yes   
822     16-20    Male   Business          15-30 days             No   
823  30-Above  Female     Others          15-30 days             No   

    Quarantine_Frustrations Changes_Habits Mental_Health_History  \
0                       Yes             No                   Yes   
1                       Yes          Maybe                    No   
2                        No            Yes                    No   
3                        No          Maybe                    No   
4                       Yes            Yes                    No   
..                      ...            ...                   ...   
819                     Yes             No                   Yes   
820                     Yes             No                   Yes   
821                   Maybe          Maybe                    No   
822                      No          Maybe                    No   
823                      No             No                    No   

    Weight_Change Mood_Swings Coping_Struggles Work_Interest Social_Weakness  
0             Yes      Medium               No            No             Yes  
1              No        High               No            No             Yes  
2              No      Medium              Yes         Maybe              No  
3           Maybe      Medium               No         Maybe             Yes  
4             Yes      Medium              Yes         Maybe              No  
..            ...         ...              ...           ...             ...  
819           Yes      Medium               No           Yes           Maybe  
820         Maybe         Low               No         Maybe           Maybe  
821           Yes        High              Yes           Yes           Maybe  
822         Maybe         Low              Yes            No           Maybe  
823           Yes         Low              Yes            No           Maybe  

[824 rows x 13 columns]

**1B. Data Exploration**

¶

ANOVA is used when comparing means among three or more groups to determine if there are significant differences among them. This dataset contains four age groups: 16-20, 20-25, 25-30, and 30-Above. Therefore, we will apply ANOVA Test on "whether the age group has a correlation with growing stress". We also assume that the alpha value is 0.05. These are Null Hypothesis and Alternate Hypothesis for the Anova Test:

$H_{0}$: The age group does not have an effect on the likelihood of growing stress.

$H_{A}$: The age group does have an effect on the likelihood of growing stress.

In [ ]:
stress_table2 = pd.crosstab(stress_df['Growing_Stress'], stress_df['Age'])
statistic, p_value = f_oneway(stress_table2['16-20'], stress_table2['20-25'], stress_table2['25-30'], stress_table2['30-Above'])
print(stress_table2)
print("P-Value:", p_value)
Age             16-20  20-25  25-30  30-Above
Growing_Stress                               
Maybe              60     63     66        78
No                 75     47     65        69
Yes                76     76     74        75
P-Value: 0.4821995988536243

Besides ANOVA Test, we can use Descriptive Statistics method to observe "Growing_Stress" for each age group.

In [ ]:
stress_table3 = pd.crosstab(stress_df['Age'], stress_df['Growing_Stress'])
descriptive_stats = stress_table3.describe()
print(stress_table3)
print(descriptive_stats)
Growing_Stress  Maybe  No  Yes
Age                           
16-20              60  75   76
20-25              63  47   76
25-30              66  65   74
30-Above           78  69   75
Growing_Stress      Maybe         No        Yes
count            4.000000   4.000000   4.000000
mean            66.750000  64.000000  75.250000
std              7.889867  12.055428   0.957427
min             60.000000  47.000000  74.000000
25%             62.250000  60.500000  74.750000
50%             64.500000  67.000000  75.500000
75%             69.000000  70.500000  76.000000
max             78.000000  75.000000  76.000000

We create a graph using the matplotlib showing the realtion between the age groups and growing stress below.

In [ ]:
stress_table2.plot(kind='bar', colormap='Paired')
plt.ylabel('Count')
Out[ ]:
Text(0, 0.5, 'Count')

Look at the above Descriptive Statistics table, we have:

Count:

Each stress category ("Maybe", "No", "Yes") has 4 data points (4 age groups).

Mean:

The average count of individuals reporting "Maybe" stress is 66.75.

The average count of individuals reporting "No" stress is 64.00.

The average count of individuals reporting "Yes" stress is 75.25.

Standard Deviation (std):

The standard deviation for "Maybe" stress is 7.89 indicating moderate variability.

The standard deviation for "No" stress is 12.06 indicating higher variability compared to "Maybe" and "Yes".

The standard deviation for "Yes" stress is 0.96 indicating very low variability.

Minimum (min):

The minimum count for "Maybe" stress is 60.

The minimum count for "No" stress is 47.

The minimum count for "Yes" stress is 74.

25%:

25% of the data for "Maybe" stress is less than or equal to 62.25.

25% of the data for "No" stress is less than or equal to 60.50.

25% of the data for "Yes" stress is less than or equal to 74.75.

Median (50%):

The median for "Maybe" stress is 64.50.

The median for "No" stress is 67.00.

The median for "Yes" stress is 75.50.

75%:

75% of the data for "Maybe" stress is less than or equal to 69.00.

75% of the data for "No" stress is less than or equal to 70.50.

75% of the data for "Yes" stress is less than or equal to 76.00.

Maximum (max):

The maximum count for "Maybe" stress is 78.

The maximum count for "No" stress is 75.

The maximum count for "Yes" stress is 76.

Summary statistics:

The "Maybe" and "Yes" stress levels have higher average counts compared to "No" stress. The "No" stress level shows the highest variability indicating that the counts are more spread out across the age groups. The "Yes" stress level has the highest average count and the lowest variability suggesting that a consistent number of individuals across age groups report high stress. These statistics provide a clear picture of how stress levels vary among different age groups.

Because p-value > alpha value (0.4821995988536243 > 0.05), we fail to reject the null hypothesis. There is not enough evidence to suggest that the age group has an effect on the likelihood of growing stress. We create a plot to visualize the relationship between "Age Groups" and "Growing Stress":

In [ ]:
plt.figure(figsize=(8, 6))
seas.violinplot(x='Growing_Stress', y='Age', data=stress_df, inner='quartile')
plt.ylabel('Growing Stress')
plt.xlabel('Age Group')
plt.title(f'Age Groups vs Growing Stress (Contingency Table)')
plt.grid(True)
plt.show()

**2. Alcohol Consumption Data:**

¶

Source: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1310009611

**Overall Summary For This Dataset:**
¶

This dataset provides an overview of heavy drinking rates across age groups in Canada, with heavy drinking defined as 5 or more drinks in males or 4 or more drinks in females on a single occasion at least once per month in the past year. More specifically, the data includes the age groups 12-17 years old and 18-34 years old, which overlaps with our focus group of people aged 15 to 30 years old. For each age group, the data includes the number of people who reported heavy drinking in that year and that number expressed as a percentage of that age group. Because the data was sourced through the Canadian Community Health Survey, the sample size is sufficiently large, and the participants are spread out geographically, which in turn increases the diversity of participants. Given that the data ranges from 2015 to 2022, we can analyze how drinking habits within each age group have changed over time, contributing to our overall analysis of how the risk of strokes in young people has changed in recent years. Finally, the data can be further divided between males and females, providing insight into whether the sex of a participant, combined with their alcohol use, impacts the likelihood of a stroke.

**2A. Data Preprocessing**

¶

To begin, we create a variable called alcohol_df that will store the raw data from the alcohol_consumption.csv file.

In [ ]:
alcohol_df = pd.read_csv('/content/drive/MyDrive/CSV/alcohol_consumption.csv')
print(alcohol_df)
    REF_DATE                             GEO  DGUID                 Age group  \
0       2015  Canada (excluding territories)    NaN  Total, 12 years and over   
1       2016  Canada (excluding territories)    NaN  Total, 12 years and over   
2       2017  Canada (excluding territories)    NaN  Total, 12 years and over   
3       2018  Canada (excluding territories)    NaN  Total, 12 years and over   
4       2019  Canada (excluding territories)    NaN  Total, 12 years and over   
..       ...                             ...    ...                       ...   
91      2018  Canada (excluding territories)    NaN         65 years and over   
92      2019  Canada (excluding territories)    NaN         65 years and over   
93      2020  Canada (excluding territories)    NaN         65 years and over   
94      2021  Canada (excluding territories)    NaN         65 years and over   
95      2022  Canada (excluding territories)    NaN         65 years and over   

           Sex      Indicators    Characteristics      UOM  UOM_ID  \
0   Both sexes  Heavy drinking  Number of persons   Number     223   
1   Both sexes  Heavy drinking  Number of persons   Number     223   
2   Both sexes  Heavy drinking  Number of persons   Number     223   
3   Both sexes  Heavy drinking  Number of persons   Number     223   
4   Both sexes  Heavy drinking  Number of persons   Number     223   
..         ...             ...                ...      ...     ...   
91  Both sexes  Heavy drinking            Percent  Percent     239   
92  Both sexes  Heavy drinking            Percent  Percent     239   
93  Both sexes  Heavy drinking            Percent  Percent     239   
94  Both sexes  Heavy drinking            Percent  Percent     239   
95  Both sexes  Heavy drinking            Percent  Percent     239   

   SCALAR_FACTOR  SCALAR_ID      VECTOR  COORDINATE      VALUE STATUS  SYMBOL  \
0          units          0  v110787655  1.1.1.17.1  5782800.0    NaN     NaN   
1          units          0  v110787655  1.1.1.17.1  5770900.0    NaN     NaN   
2          units          0  v110787655  1.1.1.17.1  6015500.0    NaN     NaN   
3          units          0  v110787655  1.1.1.17.1  5946400.0    NaN     NaN   
4          units          0  v110787655  1.1.1.17.1  5802200.0    NaN     NaN   
..           ...        ...         ...         ...        ...    ...     ...   
91         units          0  v110790388  1.6.1.17.4        7.4    NaN     NaN   
92         units          0  v110790388  1.6.1.17.4        7.6    NaN     NaN   
93         units          0  v110790388  1.6.1.17.4        7.4    NaN     NaN   
94         units          0  v110790388  1.6.1.17.4        7.9    NaN     NaN   
95         units          0  v110790388  1.6.1.17.4       10.0    NaN     NaN   

    TERMINATED  DECIMALS  
0          NaN         0  
1          NaN         0  
2          NaN         0  
3          NaN         0  
4          NaN         0  
..         ...       ...  
91         NaN         1  
92         NaN         1  
93         NaN         1  
94         NaN         1  
95         NaN         1  

[96 rows x 18 columns]

Next, we will explore the data to determine if there are any columns that can be deleted.

In [ ]:
alcohol_df['SCALAR_ID'].unique() # this output shows us that the entire column has the same value
Out[ ]:
array([0])
In [ ]:
# Since the SCALAR_ID column doesn't provide information for the dataset, we can remove it
alcohol_df = alcohol_df.drop(columns=['SCALAR_ID'])

# Since the STATUS, SYMBOL, AND TERMINATED columns are completely empty, we can remove them as well
alcohol_df = alcohol_df.drop(columns=['STATUS', 'SYMBOL', 'TERMINATED'])

# Since the UOM_ID (unit of measure ID), DGUID , SCALAR_FACTOR, VECTOR, DECIMALS and COORDINATE
# columns don't provide data necessary to our analysis, we will remove them as well
alcohol_df = alcohol_df.drop(columns=['UOM_ID', 'DGUID', 'SCALAR_FACTOR', 'VECTOR', 'DECIMALS','COORDINATE'])
In [ ]:
print(alcohol_df)
    REF_DATE                             GEO                 Age group  \
0       2015  Canada (excluding territories)  Total, 12 years and over   
1       2016  Canada (excluding territories)  Total, 12 years and over   
2       2017  Canada (excluding territories)  Total, 12 years and over   
3       2018  Canada (excluding territories)  Total, 12 years and over   
4       2019  Canada (excluding territories)  Total, 12 years and over   
..       ...                             ...                       ...   
91      2018  Canada (excluding territories)         65 years and over   
92      2019  Canada (excluding territories)         65 years and over   
93      2020  Canada (excluding territories)         65 years and over   
94      2021  Canada (excluding territories)         65 years and over   
95      2022  Canada (excluding territories)         65 years and over   

           Sex      Indicators    Characteristics      UOM      VALUE  
0   Both sexes  Heavy drinking  Number of persons   Number  5782800.0  
1   Both sexes  Heavy drinking  Number of persons   Number  5770900.0  
2   Both sexes  Heavy drinking  Number of persons   Number  6015500.0  
3   Both sexes  Heavy drinking  Number of persons   Number  5946400.0  
4   Both sexes  Heavy drinking  Number of persons   Number  5802200.0  
..         ...             ...                ...      ...        ...  
91  Both sexes  Heavy drinking            Percent  Percent        7.4  
92  Both sexes  Heavy drinking            Percent  Percent        7.6  
93  Both sexes  Heavy drinking            Percent  Percent        7.4  
94  Both sexes  Heavy drinking            Percent  Percent        7.9  
95  Both sexes  Heavy drinking            Percent  Percent       10.0  

[96 rows x 8 columns]
In [ ]:
# Upon further examination of the data, we notice that the Characteristics and UOM columns provide similar data
print(alcohol_df['Characteristics'].unique())
print(alcohol_df['UOM'].unique())

# Given that UOM is more concise, we will keep UOM and drop Characteristics
alcohol_df = alcohol_df.drop(columns=['Characteristics'])
print(alcohol_df)
['Number of persons' 'Percent']
['Number' 'Percent']
    REF_DATE                             GEO                 Age group  \
0       2015  Canada (excluding territories)  Total, 12 years and over   
1       2016  Canada (excluding territories)  Total, 12 years and over   
2       2017  Canada (excluding territories)  Total, 12 years and over   
3       2018  Canada (excluding territories)  Total, 12 years and over   
4       2019  Canada (excluding territories)  Total, 12 years and over   
..       ...                             ...                       ...   
91      2018  Canada (excluding territories)         65 years and over   
92      2019  Canada (excluding territories)         65 years and over   
93      2020  Canada (excluding territories)         65 years and over   
94      2021  Canada (excluding territories)         65 years and over   
95      2022  Canada (excluding territories)         65 years and over   

           Sex      Indicators      UOM      VALUE  
0   Both sexes  Heavy drinking   Number  5782800.0  
1   Both sexes  Heavy drinking   Number  5770900.0  
2   Both sexes  Heavy drinking   Number  6015500.0  
3   Both sexes  Heavy drinking   Number  5946400.0  
4   Both sexes  Heavy drinking   Number  5802200.0  
..         ...             ...      ...        ...  
91  Both sexes  Heavy drinking  Percent        7.4  
92  Both sexes  Heavy drinking  Percent        7.6  
93  Both sexes  Heavy drinking  Percent        7.4  
94  Both sexes  Heavy drinking  Percent        7.9  
95  Both sexes  Heavy drinking  Percent       10.0  

[96 rows x 7 columns]
In [ ]:
# Finally, we will check for duplicates
duplicated = alcohol_df[alcohol_df.duplicated()]
num_of_duplicates_alc = alcohol_df.duplicated().sum()

print(num_of_duplicates_alc)
0

Since we have removed any unnecessary columns and confirmed that there are no duplicates, we can begin our analysis of the data!

**2B. Data Exploration**

¶

To analyze the data, we can first copy the dataframe and limit it to just the rows expressing the data as a percent of the age group. Then, we can visualize the relationship between the age group and the percent of that age group reporting heavy drinking across all years.

In [ ]:
alcohol_df_percents = alcohol_df.copy()
alcohol_df_percents = alcohol_df_percents[alcohol_df_percents['UOM'] == 'Percent']

# To improve the graph, remove the rows with the age group as 'Total'
alcohol_df_percents = alcohol_df_percents[alcohol_df_percents['Age group'] != 'Total, 12 years and over']


df_pivot = alcohol_df_percents.pivot(index='REF_DATE', columns='Age group', values='VALUE')

for column in df_pivot.columns:
    plt.plot(df_pivot.index, df_pivot[column], marker='o', label=column)

plt.title('Percent of Age Group Reporting Heavy Drinking, Over Years')
plt.xlabel('Year')
plt.ylabel('Percent Reporting Heavy Drinking')
plt.legend(title='Age Group', bbox_to_anchor=(1.05, 1.0), loc='upper left')
plt.grid(True)
plt.show()

Hypothesis Testing:

ANOVA tests are used to compare means between 3 or more groups. In this case, there are 5 different age groups to compare: people aged 12-17 years old, 18-34 years old, 35-49 years old, 50-64 years old, and 65 years old and older. We can use the ANOVA test to determine if there is a statistically significant difference in the mean heavy drinking rate reported across years. We will use a significance value, or alpha value, of 0.05. To conduct the test, our Null and Alternative Hypotheses are as follows:

$H_{0}$: The age group does not have an effect on the mean reported rate of heavy drinking.

$H_{A}$: The age group does have an effect on the mean reported rate of heavy drinking.

In [ ]:
# To begin, we will create arrays storing the numeric percents associated with each age group
age_12_17 = alcohol_df_percents[alcohol_df_percents['Age group'] == '12 to 17 years']['VALUE']
age_18_34 = alcohol_df_percents[alcohol_df_percents['Age group'] == '18 to 34 years']['VALUE']
age_35_49 = alcohol_df_percents[alcohol_df_percents['Age group'] == '35 to 49 years']['VALUE']
age_50_64 = alcohol_df_percents[alcohol_df_percents['Age group'] == '50 to 64 years']['VALUE']
age_65_above = alcohol_df_percents[alcohol_df_percents['Age group'] == '65 years and over']['VALUE']

results = f_oneway(age_12_17,age_18_34,age_35_49, age_50_64, age_65_above)
print(results.pvalue)
6.293707068743887e-24

Given that the resulting p-value is less than the value for alpha, we can reject the null hypothesis. As such, we can conclude that there exists a difference among age groups regarding their average reported drinking rates.

**3. Smoking Data:**

¶

Source: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

**Overall Summary For This Dataset:**
¶

This dataset has a large sample size of 5,110 people, reflecting a diverse population from rural and urban areas. The dataset contains various demographic, physiological, and behavioral factors that provide information about the samples' health problems and lifestyle choices. These attributes include gender, age, marital status, occupation, type of residence, medical conditions (such as hypertension and heart disease), BMI, and smoking status. The dataset's age range is 0.08 years (roughly eight months) to 82 years, ensuring enough observations from the young adult age group (18-30 years). The smoking status variable is extensive, with options such as "formerly smoked," "never smoked," "smokes," and "unknown," providing helpful information about smoking frequency and history. This precise categorization will allow for an in-depth assessment of how different smoking habits affect the chance of a stroke in young adults.

**3A. Data Preprocessing**

¶

Step 1. Load the initial dataset containing all the samples.

In [ ]:
smoking_df = pd.read_csv('/content/drive/MyDrive/CSV/smoking_stroke.csv')
print(smoking_df)
         id  gender   age  hypertension  heart_disease ever_married  \
0      9046    Male  67.0             0              1          Yes   
1     51676  Female  61.0             0              0          Yes   
2     31112    Male  80.0             0              1          Yes   
3     60182  Female  49.0             0              0          Yes   
4      1665  Female  79.0             1              0          Yes   
...     ...     ...   ...           ...            ...          ...   
5105  18234  Female  80.0             1              0          Yes   
5106  44873  Female  81.0             0              0          Yes   
5107  19723  Female  35.0             0              0          Yes   
5108  37544    Male  51.0             0              0          Yes   
5109  44679  Female  44.0             0              0          Yes   

          work_type Residence_type  avg_glucose_level   bmi   smoking_status  \
0           Private          Urban             228.69  36.6  formerly smoked   
1     Self-employed          Rural             202.21   NaN     never smoked   
2           Private          Rural             105.92  32.5     never smoked   
3           Private          Urban             171.23  34.4           smokes   
4     Self-employed          Rural             174.12  24.0     never smoked   
...             ...            ...                ...   ...              ...   
5105        Private          Urban              83.75   NaN     never smoked   
5106  Self-employed          Urban             125.20  40.0     never smoked   
5107  Self-employed          Rural              82.99  30.6     never smoked   
5108        Private          Rural             166.29  25.6  formerly smoked   
5109       Govt_job          Urban              85.28  26.2          Unknown   

      stroke  
0          1  
1          1  
2          1  
3          1  
4          1  
...      ...  
5105       0  
5106       0  
5107       0  
5108       0  
5109       0  

[5110 rows x 12 columns]

Step 2. Data Cleaning process -> delete duplicates and determine what to do with unknown data (is it MNAR or MCAR/MAR)

In [ ]:
# 1. Check for exact duplicates -> None Existing
print(smoking_df.shape)
smoking_df.drop_duplicates()
print(smoking_df.shape, "\n")

# 2. Check for missing values -> BMI category is not used for this exploration thus can be ignored
missing_values = smoking_df.isnull().sum()
print(missing_values)
(5110, 12)
(5110, 12) 

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

Our data set has 5110 individual responses with no duplicates and all answers with unique ID numbers. Since for we want to explore the correlation between the smoking status and stroke diagnosis we will disregard the missing data in the bmi column.

**3B. Data Exploration**

¶

Step 1. Define the objective: Does smoking status have an impact on the diagnosis of stroke? (For this dataset all young adults between 18 - 30 had no diagnosis of stroke thus we will explore the correlation for all ages)

In [ ]:
print(smoking_df)
         id  gender   age  hypertension  heart_disease ever_married  \
0      9046    Male  67.0             0              1          Yes   
1     51676  Female  61.0             0              0          Yes   
2     31112    Male  80.0             0              1          Yes   
3     60182  Female  49.0             0              0          Yes   
4      1665  Female  79.0             1              0          Yes   
...     ...     ...   ...           ...            ...          ...   
5105  18234  Female  80.0             1              0          Yes   
5106  44873  Female  81.0             0              0          Yes   
5107  19723  Female  35.0             0              0          Yes   
5108  37544    Male  51.0             0              0          Yes   
5109  44679  Female  44.0             0              0          Yes   

          work_type Residence_type  avg_glucose_level   bmi   smoking_status  \
0           Private          Urban             228.69  36.6  formerly smoked   
1     Self-employed          Rural             202.21   NaN     never smoked   
2           Private          Rural             105.92  32.5     never smoked   
3           Private          Urban             171.23  34.4           smokes   
4     Self-employed          Rural             174.12  24.0     never smoked   
...             ...            ...                ...   ...              ...   
5105        Private          Urban              83.75   NaN     never smoked   
5106  Self-employed          Urban             125.20  40.0     never smoked   
5107  Self-employed          Rural              82.99  30.6     never smoked   
5108        Private          Rural             166.29  25.6  formerly smoked   
5109       Govt_job          Urban              85.28  26.2          Unknown   

      stroke  
0          1  
1          1  
2          1  
3          1  
4          1  
...      ...  
5105       0  
5106       0  
5107       0  
5108       0  
5109       0  

[5110 rows x 12 columns]

Step 2. Explore the filtered dataset for enough comprehension of the data.

In [ ]:
# 0. Explore the columns and rows of our filtered data
print("Num of Rows and Columns : " , smoking_df.shape)

# 1. Explore the columns of the data
print("\nColumns : ", smoking_df.columns)

# 2. Explore the data types of the columns -> age : float64 smoking_status : object (string)
print("\nData types :\n" , smoking_df.dtypes )

# 3. Explore the categories of smoking_status
print("\nCategories of smoking status: ", smoking_df["smoking_status"].unique())

# 4. Check for unknown response for smoking status -> more than 25% refused to respond MNAR
unknown_smoking_status = smoking_df[smoking_df["smoking_status"] == "Unknown"]
unknown_smoking_status
print("\nMissing Num of Rows and Columns :", unknown_smoking_status.shape)
Num of Rows and Columns :  (5110, 12)

Columns :  Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

Data types :
 id                     int64
gender                object
age                  float64
hypertension           int64
heart_disease          int64
ever_married          object
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object

Categories of smoking status:  ['formerly smoked' 'never smoked' 'smokes' 'Unknown']

Missing Num of Rows and Columns : (1544, 12)

Our categorical values for the smoking_status column are "Unknown," "formerly smoked," "never smoked," and "smokes." This data is assumed to be data missing not at random or MNAR, since some of the individuals did not wish to disclose their smoking behaviors. If these individuals were less than 1% of our data, we would simply drop them; however, they were approximately 30% of our data. Thus, we have decided to consider Unknown as a new category to study the correlation between smoking status and stroke diagnosis.

Step 3. Hypothesis Testing Using ANOVA to test the correlation between "smoking_status" and "stroke" for young adults.

< Hypothesis >

  • $H_{0}$: Smoking status does not have an effect on the likelihood of stroke.

  • $H_{A}$: Smoking status does have an effect on the likelihood of stroke.

Step 3.1 Since we have categorical values for smoking status we will convert them to numeric values before using ANOVA

In [ ]:
# Convert the smoking status to numeric values
def convert_status (status) :
  if status == "never smoked" :
    return 0
  elif status == "formerly smoked" :
    return 1
  elif status == "smokes" :
    return 2
  elif status == "Unknown" :
    return -1

smoking_df["smoking_status"] = smoking_df["smoking_status"].apply(convert_status)
print(smoking_df)
         id  gender   age  hypertension  heart_disease ever_married  \
0      9046    Male  67.0             0              1          Yes   
1     51676  Female  61.0             0              0          Yes   
2     31112    Male  80.0             0              1          Yes   
3     60182  Female  49.0             0              0          Yes   
4      1665  Female  79.0             1              0          Yes   
...     ...     ...   ...           ...            ...          ...   
5105  18234  Female  80.0             1              0          Yes   
5106  44873  Female  81.0             0              0          Yes   
5107  19723  Female  35.0             0              0          Yes   
5108  37544    Male  51.0             0              0          Yes   
5109  44679  Female  44.0             0              0          Yes   

          work_type Residence_type  avg_glucose_level   bmi  smoking_status  \
0           Private          Urban             228.69  36.6               1   
1     Self-employed          Rural             202.21   NaN               0   
2           Private          Rural             105.92  32.5               0   
3           Private          Urban             171.23  34.4               2   
4     Self-employed          Rural             174.12  24.0               0   
...             ...            ...                ...   ...             ...   
5105        Private          Urban              83.75   NaN               0   
5106  Self-employed          Urban             125.20  40.0               0   
5107  Self-employed          Rural              82.99  30.6               0   
5108        Private          Rural             166.29  25.6               1   
5109       Govt_job          Urban              85.28  26.2              -1   

      stroke  
0          1  
1          1  
2          1  
3          1  
4          1  
...      ...  
5105       0  
5106       0  
5107       0  
5108       0  
5109       0  

[5110 rows x 12 columns]

Step 3.2 Use ANOVA to conduct a test for our encoded dataset.

In [ ]:
# Create and display a contingency table used for our testing
smoking_stroke_table2 = pd.crosstab(smoking_df['stroke'], smoking_df['smoking_status'])
print(smoking_stroke_table2)

# ANOVA Testing to find P-Value
result = f_oneway(smoking_stroke_table2[0],smoking_stroke_table2[1], smoking_stroke_table2[2], smoking_stroke_table2[-1])
print("P-Value: ", result.pvalue)
smoking_status    -1     0    1    2
stroke                              
0               1497  1802  815  747
1                 47    90   70   42
P-Value:  0.9018883799637181

Step 4. Visualize the statistic result.

In [ ]:
contingency = pd.crosstab(smoking_df['smoking_status'], smoking_df['stroke'])
contingency.plot(kind= "bar")
Out[ ]:
<Axes: xlabel='smoking_status'>

We fail to reject the null hypothesis because the P-value exceeds the significance level 0.9019 > 0.05. This means that we fail to prove that smoking status influences the likelihood of having a stroke. No Post-Hoc testing is needed since there was not enough statistical proof of convincing significant differences between the four categories of smoking status and the diagnosis of stroke.

In [ ]:
plt.figure(figsize=(8, 6))
seas.violinplot(x='smoking_status', y='stroke', data=smoking_df, inner='quartile')
plt.ylabel('Stroke')
plt.xlabel('Encoded Smoking Status (-1: Unknown, 0: never smoked, 1: formerly smoked, 2: smokes)')
plt.title(f'Smoking Status and Stroke')
plt.grid(True)
plt.show()

**4. Unhealthy Eating Behaviors Data:**

¶

Source: https://www.kaggle.com/datasets/jillanisofttech/brain-stroke-dataset

**Overall Summary For This Dataset:**
¶

The dataset consists of 4,981 entries and 11 features, which encompass patient demographics, medical history, average glucose levels, and stroke diagnosis. It includes both categorical and numerical data types. No features appear over-represented according to the summary statistics provided. We can observe the correlation among most features is low, suggesting that each contributes independently to determining the stroke diagnosis. Boxplots and z-scores indicate that there are no significant outliers in age, average glucose level, and BMI. The binary nature of the target variable, stroke, suggests that a classification approach is appropriate for the primary analysis technique. These findings will inform the necessary preprocessing steps and help in selecting the appropriate analysis method to ensure thorough and accurate modeling.

**4A. Data Preprocessing**

¶

In the last step, we will take a look at the relationship between unhealthy eating behaviors (linked to increased average glucose levels) and occurence of stroke. First, create a dataframe named "glucose_df", read the "brain_stroke(avg_glucose_lvl).csv" file into it and display the dataframe.

In [ ]:
glucose_df = pd.read_csv('/content/drive/MyDrive/CSV/brain_stroke.csv')
print(glucose_df)
      gender   age  hypertension  heart_disease ever_married      work_type  \
0       Male  67.0             0              1          Yes        Private   
1       Male  80.0             0              1          Yes        Private   
2     Female  49.0             0              0          Yes        Private   
3     Female  79.0             1              0          Yes  Self-employed   
4       Male  81.0             0              0          Yes        Private   
...      ...   ...           ...            ...          ...            ...   
4976    Male  41.0             0              0           No        Private   
4977    Male  40.0             0              0          Yes        Private   
4978  Female  45.0             1              0          Yes       Govt_job   
4979    Male  40.0             0              0          Yes        Private   
4980  Female  80.0             1              0          Yes        Private   

     Residence_type  avg_glucose_level   bmi   smoking_status  stroke  
0             Urban             228.69  36.6  formerly smoked       1  
1             Rural             105.92  32.5     never smoked       1  
2             Urban             171.23  34.4           smokes       1  
3             Rural             174.12  24.0     never smoked       1  
4             Urban             186.21  29.0  formerly smoked       1  
...             ...                ...   ...              ...     ...  
4976          Rural              70.15  29.8  formerly smoked       0  
4977          Urban             191.15  31.1           smokes       0  
4978          Rural              95.02  31.8           smokes       0  
4979          Rural              83.94  30.0           smokes       0  
4980          Urban              83.75  29.1     never smoked       0  

[4981 rows x 11 columns]

As for the age feature, for this specific dataset, we will NOT filter the age. After applying the age filter to the dataset on unhealthy eating behaviors, we found that there were no stroke cases among young adults aged 18-30. This lack of data made it difficult to conduct a meaningful analysis within this specific age group. Despite this, we recognized the overall value of the dataset and decided to broaden our analysis to include all age groups. This approach allowed us to extract meaningful statistics and insights, which are still highly relevant to understanding the impact of dietary habits on stroke risk across a wider population.

**4B. Data Exploration**

¶

Now, we will conduct a Pearson Correlation Coefficient (r) between two variables, "avg_glucose_level" and "stroke," in hopes of visualizing their relationship. Generating a Pearson Correlation Coefficient will help to indicate the strength and direction of the linear relationship between an individual's avergage glucose levels and their stroke status.

In [ ]:
# Calculate the Pearson correlation coefficient (r) between "avg_glucose_level" and "stroke"
correlation_coefficient = glucose_df['avg_glucose_level'].corr(glucose_df['stroke'])

# Create a plot to visualize the relationship between "avg_glucose_level" and "stroke"
import seaborn as seas
plt.figure(figsize=(8, 6))
seas.violinplot(x='stroke', y='avg_glucose_level', data=glucose_df, inner='quartile')
#plt.scatter(glucose_df['avg_glucose_level'], glucose_df['stroke'], alpha=0.5)
plt.ylabel('Average Glucose Level')
plt.xlabel('Stroke Status')
plt.title(f'Average Glucose Level vs Stroke (Correlation Coefficient: {correlation_coefficient:.2f})')
plt.grid(True)
plt.show()

print("Pearson Correlation Coefficient (r):" , correlation_coefficient)
Pearson Correlation Coefficient (r): 0.13322732663313727

Unhealthy Eating Behaviors & Pearson Correlation Coefficient Conclusion:

The Pearson Correlation Coefficient (r) calculated for the relationship between 'avg_glucose_level' and 'stroke' is 0.133. This indicates a positive, but relatively weak, linear correlation between an individual's average glucose levels and their stroke status. Although there is a direct relationship, suggesting that higher glucose levels might be associated with an increased risk of stroke, the strength of this correlation is modest at best. This result implies that while average glucose levels are a factor in stroke risk, they are likely just one of multiple contributing factors. Therefore, further analysis involving additional variables and larger datasets may be necessary to fully understand the complex interactions that lead to stroke in young adults aged 18-30.

**5. Stroke Data:**

¶

Source: https://www.kaggle.com/datasets/teamincribo/stroke-prediction

**Overall Summary For This Dataset:**
¶

The dataset contains 15,000 entries and 22 features, including patient demographics, medical history, lifestyle factors, and stroke diagnosis. The features are a mix of categorical and numerical data. Based on the summary statistics, there is no indication of any feature being over-represented. We can observe that age has a moderate positive correlation with hypertension and heart disease, while average glucose level and BMI have very low correlation with other features, indicating they may independently contribute to the diagnosis. Boxplots and z-scores do not reveal potential outliers in age, average glucose level, BMI, and stress levels. The target variable, diagnosis, is binary, suggesting a classification approach for the primary analysis technique. Numerical features will require standardization or normalization, and categorical features will need appropriate encoding. Additionally, the "Symptoms" feature has missing values, which can be handled by imputation or exclusion strategies, but are not necessary. These insights will help guide the preprocessing steps and the choice of analysis technique.

5A. Data Preprocessing</h2>

In this step, we are creating a dataframe named "stroke_df", read the "stroke_prediction_dataset.csv" file into it and display the dataframe.

In [ ]:
stroke_df = pd.read_csv('/content/drive/MyDrive/CSV/stroke_prediction_dataset.csv')
print(stroke_df)
       Patient ID       Patient Name  Age  Gender  Hypertension  \
0           18153    Mamooty Khurana   56    Male             0   
1           62749  Kaira Subramaniam   80    Male             0   
2           32145      Dhanush Balan   26    Male             1   
3            6154        Ivana Baral   73    Male             0   
4           48973  Darshit Jayaraman   51    Male             1   
...           ...                ...  ...     ...           ...   
14995       13981          Keya Iyer   88  Female             1   
14996       87707       Anahita Virk   47  Female             0   
14997       33174         Ivana Kaur   35    Male             0   
14998       22343        Anvi Mannan   73    Male             0   
14999       11066      Gokul Trivedi   64  Female             0   

       Heart Disease Marital Status       Work Type Residence Type  \
0                  1        Married   Self-employed          Rural   
1                  0         Single   Self-employed          Urban   
2                  1        Married    Never Worked          Rural   
3                  0        Married    Never Worked          Urban   
4                  1       Divorced   Self-employed          Urban   
...              ...            ...             ...            ...   
14995              1       Divorced   Self-employed          Urban   
14996              0        Married         Private          Urban   
14997              0        Married  Government Job          Rural   
14998              0         Single   Self-employed          Urban   
14999              0         Single    Never Worked          Urban   

       Average Glucose Level  ...    Alcohol Intake Physical Activity  \
0                     130.91  ...    Social Drinker          Moderate   
1                     183.73  ...             Never               Low   
2                     189.00  ...            Rarely              High   
3                     185.29  ...  Frequent Drinker          Moderate   
4                     177.34  ...            Rarely               Low   
...                      ...  ...               ...               ...   
14995                 160.22  ...    Social Drinker              High   
14996                 107.58  ...             Never               Low   
14997                 134.90  ...            Rarely              High   
14998                 169.42  ...             Never              High   
14999                 186.88  ...            Rarely          Moderate   

      Stroke History Family History of Stroke  Dietary Habits Stress Levels  \
0                  0                      Yes           Vegan          3.48   
1                  0                       No           Paleo          1.73   
2                  0                      Yes           Paleo          7.31   
3                  0                       No           Paleo          5.35   
4                  0                      Yes     Pescatarian          6.84   
...              ...                      ...             ...           ...   
14995              0                       No           Paleo          1.12   
14996              1                       No     Gluten-Free          1.47   
14997              1                       No           Paleo          0.51   
14998              0                      Yes           Paleo          1.53   
14999              0                       No           Vegan          4.57   

      Blood Pressure Levels  Cholesterol Levels  \
0                   140/108   HDL: 68, LDL: 133   
1                    146/91    HDL: 63, LDL: 70   
2                    154/97    HDL: 59, LDL: 95   
3                    174/81   HDL: 70, LDL: 137   
4                    121/95    HDL: 65, LDL: 68   
...                     ...                 ...   
14995                171/92   HDL: 44, LDL: 153   
14996                155/71   HDL: 35, LDL: 183   
14997               121/110   HDL: 57, LDL: 159   
14998                157/74    HDL: 79, LDL: 91   
14999                133/81   HDL: 78, LDL: 179   

                                                Symptoms  Diagnosis  
0                          Difficulty Speaking, Headache     Stroke  
1        Loss of Balance, Headache, Dizziness, Confusion     Stroke  
2                                    Seizures, Dizziness     Stroke  
3      Seizures, Blurred Vision, Severe Fatigue, Head...  No Stroke  
4                                    Difficulty Speaking     Stroke  
...                                                  ...        ...  
14995                                                NaN  No Stroke  
14996                                Difficulty Speaking  No Stroke  
14997      Difficulty Speaking, Severe Fatigue, Headache     Stroke  
14998  Severe Fatigue, Numbness, Confusion, Dizziness...  No Stroke  
14999                                           Headache     Stroke  

[15000 rows x 22 columns]

We focus on young adults aged 18 to 30. By isolating this age group from the broader dataset, we aim to examine potential risk factors and stroke incidences in this age group.

In [ ]:
young_adults = stroke_df[(stroke_df['Age']>=18)&(stroke_df['Age']<=30)]
print(young_adults)
       Patient ID     Patient Name  Age  Gender  Hypertension  Heart Disease  \
2           32145    Dhanush Balan   26    Male             1              1   
12          66924     Ahana  Lalla   30  Female             0              1   
19          23954     Taran Khatri   25    Male             0              0   
25          36975      Jhanvi Brar   24  Female             0              0   
37          94512       Anvi Salvi   23  Female             0              0   
...           ...              ...  ...     ...           ...            ...   
14972       11839    Chirag Kurian   30    Male             0              1   
14974       30150  Alisha Banerjee   20  Female             0              0   
14981       12323        Pari Ravi   25    Male             0              0   
14983       40381        Sana Goel   18  Female             0              0   
14991       90658      Samaira Raj   26    Male             0              1   

      Marital Status       Work Type Residence Type  Average Glucose Level  \
2            Married    Never Worked          Rural                 189.00   
12          Divorced  Government Job          Urban                 163.15   
19           Married         Private          Urban                  71.38   
25           Married   Self-employed          Urban                  79.89   
37            Single  Government Job          Rural                 164.72   
...              ...             ...            ...                    ...   
14972        Married   Self-employed          Rural                 126.94   
14974       Divorced    Never Worked          Rural                 101.36   
14981         Single         Private          Rural                  77.64   
14983         Single   Self-employed          Urban                  68.26   
14991        Married   Self-employed          Rural                 145.05   

       ...    Alcohol Intake Physical Activity Stroke History  \
2      ...            Rarely              High              0   
12     ...  Frequent Drinker          Moderate              0   
19     ...            Rarely          Moderate              0   
25     ...    Social Drinker              High              1   
37     ...    Social Drinker               Low              1   
...    ...               ...               ...            ...   
14972  ...            Rarely          Moderate              0   
14974  ...             Never              High              0   
14981  ...  Frequent Drinker               Low              0   
14983  ...    Social Drinker          Moderate              1   
14991  ...    Social Drinker               Low              1   

      Family History of Stroke  Dietary Habits Stress Levels  \
2                          Yes           Paleo          7.31   
12                         Yes  Non-Vegetarian          9.19   
19                         Yes     Gluten-Free          0.46   
25                          No      Vegetarian          6.48   
37                         Yes     Gluten-Free          7.86   
...                        ...             ...           ...   
14972                       No     Pescatarian          9.51   
14974                      Yes     Pescatarian          2.26   
14981                      Yes           Paleo          2.69   
14983                       No      Vegetarian          6.79   
14991                       No     Pescatarian          0.71   

      Blood Pressure Levels  Cholesterol Levels  \
2                    154/97    HDL: 59, LDL: 95   
12                   114/67    HDL: 80, LDL: 83   
19                   170/64   HDL: 72, LDL: 174   
25                   151/65   HDL: 73, LDL: 111   
37                   148/74    HDL: 30, LDL: 62   
...                     ...                 ...   
14972                113/65   HDL: 55, LDL: 179   
14974                159/94    HDL: 42, LDL: 99   
14981                135/66   HDL: 58, LDL: 161   
14983                136/66   HDL: 59, LDL: 172   
14991               180/110    HDL: 33, LDL: 99   

                                                Symptoms  Diagnosis  
2                                    Seizures, Dizziness     Stroke  
12                             Loss of Balance, Numbness     Stroke  
19                                              Seizures     Stroke  
25     Numbness, Loss of Balance, Numbness, Blurred V...     Stroke  
37                    Blurred Vision, Seizures, Weakness     Stroke  
...                                                  ...        ...  
14972                                                NaN  No Stroke  
14974                           Seizures, Severe Fatigue     Stroke  
14981  Blurred Vision, Headache, Severe Fatigue, Loss...     Stroke  
14983  Severe Fatigue, Severe Fatigue, Headache, Seiz...     Stroke  
14991                               Confusion, Confusion  No Stroke  

[2662 rows x 22 columns]

Let us clean the young_adults dataframe to ensure proper data traversal. Drop any duplicates.

In [ ]:
duplicate_rows = young_adults[young_adults.duplicated()]
num_duplicates = young_adults.duplicated().sum()

print(duplicate_rows)
Empty DataFrame
Columns: [Patient ID, Patient Name, Age, Gender, Hypertension, Heart Disease, Marital Status, Work Type, Residence Type, Average Glucose Level, Body Mass Index (BMI), Smoking Status, Alcohol Intake, Physical Activity, Stroke History, Family History of Stroke, Dietary Habits, Stress Levels, Blood Pressure Levels, Cholesterol Levels, Symptoms, Diagnosis]
Index: []

[0 rows x 22 columns]

No duplicates were found. Detect any outliers.

In [ ]:
# Function to calculate Z-scores
def z_score(df, threshold=3.5):
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    z_scores = np.abs((df[numeric_cols] - df[numeric_cols].mean()) / df[numeric_cols].std())
    outliers = (z_scores > threshold).any(axis=1)
    return outliers

# Detect outliers
outliers = z_score(young_adults)

# Show outliers
print("Outliers detected:")
print(young_adults[outliers])
Outliers detected:
Empty DataFrame
Columns: [Patient ID, Patient Name, Age, Gender, Hypertension, Heart Disease, Marital Status, Work Type, Residence Type, Average Glucose Level, Body Mass Index (BMI), Smoking Status, Alcohol Intake, Physical Activity, Stroke History, Family History of Stroke, Dietary Habits, Stress Levels, Blood Pressure Levels, Cholesterol Levels, Symptoms, Diagnosis]
Index: []

[0 rows x 22 columns]

No outliers were found. The young_adults dataframe is clean and ready for use!

**5B. Data Exploration**

¶

**5B1. Stress Levels vs Stroke Diagnosis**
¶

Lets begin by using Chi-Squared Test on whether the stress level has a correlation with stroke diagnosis. The Chi-Squared test is used when we want to determine if there is a significant association between two categorical variables. In this situation, we want to know whether there is a relationship between two categorical variables: the stress level and whether the patient was diagnosed with a stroke. Therefore, chi-squared test is suitable for this situation. We assume that the alpha value is 0.05. The following statement are Null Hypothesis and Alternate Hypothesis for the Chi-Squared Test.

  • $H_{0}$: The stress level does not have an effect on the likelihood of stroke diagnosis in the patient.

  • $H_{A}$: The stress level does have an effect on the likelihood of stroke diagnosis in the patient.

Firstly, we categorize stress levels into 10 groups: 0-1, 1-2, 2-3, 3-4, 4-5, 5-6, 6-7, 7-8, 8-9, and 9-10 by creating a new column that is called "New Stress Levels".

In [ ]:
conditions = [
    (young_adults['Stress Levels'] >= 0.00) & (young_adults['Stress Levels'] <= 1.00),
    (young_adults['Stress Levels'] >= 1.01) & (young_adults['Stress Levels'] <= 2.00),
    (young_adults['Stress Levels'] >= 2.01) & (young_adults['Stress Levels'] <= 3.00),
    (young_adults['Stress Levels'] >= 3.01) & (young_adults['Stress Levels'] <= 4.00),
    (young_adults['Stress Levels'] >= 4.01) & (young_adults['Stress Levels'] <= 5.00),
    (young_adults['Stress Levels'] >= 5.01) & (young_adults['Stress Levels'] <= 6.00),
    (young_adults['Stress Levels'] >= 6.01) & (young_adults['Stress Levels'] <= 7.00),
    (young_adults['Stress Levels'] >= 7.01) & (young_adults['Stress Levels'] <= 8.00),
    (young_adults['Stress Levels'] >= 8.01) & (young_adults['Stress Levels'] <= 9.00),
    (young_adults['Stress Levels'] >= 9.01) & (young_adults['Stress Levels'] <= 10.00)
]
# create a list of the values we want to assign for each condition
values = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-6', '6-7', '7-8', '8-9', '9-10']

# create a new column and use np.select to assign values to it using our lists as arguments
young_adults['New Stress Levels'] = np.select(conditions, values)
print(young_adults)
       Patient ID     Patient Name  Age  Gender  Hypertension  Heart Disease  \
2           32145    Dhanush Balan   26    Male             1              1   
12          66924     Ahana  Lalla   30  Female             0              1   
19          23954     Taran Khatri   25    Male             0              0   
25          36975      Jhanvi Brar   24  Female             0              0   
37          94512       Anvi Salvi   23  Female             0              0   
...           ...              ...  ...     ...           ...            ...   
14972       11839    Chirag Kurian   30    Male             0              1   
14974       30150  Alisha Banerjee   20  Female             0              0   
14981       12323        Pari Ravi   25    Male             0              0   
14983       40381        Sana Goel   18  Female             0              0   
14991       90658      Samaira Raj   26    Male             0              1   

      Marital Status       Work Type Residence Type  Average Glucose Level  \
2            Married    Never Worked          Rural                 189.00   
12          Divorced  Government Job          Urban                 163.15   
19           Married         Private          Urban                  71.38   
25           Married   Self-employed          Urban                  79.89   
37            Single  Government Job          Rural                 164.72   
...              ...             ...            ...                    ...   
14972        Married   Self-employed          Rural                 126.94   
14974       Divorced    Never Worked          Rural                 101.36   
14981         Single         Private          Rural                  77.64   
14983         Single   Self-employed          Urban                  68.26   
14991        Married   Self-employed          Rural                 145.05   

       ...  Physical Activity Stroke History Family History of Stroke  \
2      ...               High              0                      Yes   
12     ...           Moderate              0                      Yes   
19     ...           Moderate              0                      Yes   
25     ...               High              1                       No   
37     ...                Low              1                      Yes   
...    ...                ...            ...                      ...   
14972  ...           Moderate              0                       No   
14974  ...               High              0                      Yes   
14981  ...                Low              0                      Yes   
14983  ...           Moderate              1                       No   
14991  ...                Low              1                       No   

       Dietary Habits  Stress Levels Blood Pressure Levels Cholesterol Levels  \
2               Paleo           7.31                154/97   HDL: 59, LDL: 95   
12     Non-Vegetarian           9.19                114/67   HDL: 80, LDL: 83   
19        Gluten-Free           0.46                170/64  HDL: 72, LDL: 174   
25         Vegetarian           6.48                151/65  HDL: 73, LDL: 111   
37        Gluten-Free           7.86                148/74   HDL: 30, LDL: 62   
...               ...            ...                   ...                ...   
14972     Pescatarian           9.51                113/65  HDL: 55, LDL: 179   
14974     Pescatarian           2.26                159/94   HDL: 42, LDL: 99   
14981           Paleo           2.69                135/66  HDL: 58, LDL: 161   
14983      Vegetarian           6.79                136/66  HDL: 59, LDL: 172   
14991     Pescatarian           0.71               180/110   HDL: 33, LDL: 99   

                                                Symptoms  Diagnosis  \
2                                    Seizures, Dizziness     Stroke   
12                             Loss of Balance, Numbness     Stroke   
19                                              Seizures     Stroke   
25     Numbness, Loss of Balance, Numbness, Blurred V...     Stroke   
37                    Blurred Vision, Seizures, Weakness     Stroke   
...                                                  ...        ...   
14972                                                NaN  No Stroke   
14974                           Seizures, Severe Fatigue     Stroke   
14981  Blurred Vision, Headache, Severe Fatigue, Loss...     Stroke   
14983  Severe Fatigue, Severe Fatigue, Headache, Seiz...     Stroke   
14991                               Confusion, Confusion  No Stroke   

      New Stress Levels  
2                   7-8  
12                 9-10  
19                  0-1  
25                  6-7  
37                  7-8  
...                 ...  
14972              9-10  
14974               2-3  
14981               2-3  
14983               6-7  
14991               0-1  

[2662 rows x 23 columns]
<ipython-input-1683-f09478bf7528>:17: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  young_adults['New Stress Levels'] = np.select(conditions, values)

We create a contingency table and display it.

In [ ]:
stress_table = pd.crosstab(young_adults['New Stress Levels'], young_adults['Diagnosis'])
print(stress_table)
Diagnosis          No Stroke  Stroke
New Stress Levels                   
0-1                      148     131
1-2                      111     155
2-3                      130     135
3-4                      126     124
4-5                      139     112
5-6                      143     138
6-7                      136     127
7-8                      153     148
8-9                      117     141
9-10                     129     119

Next, we create a plot showing the relationship between the stress levels and the counts of patients having stroke.

In [ ]:
stress_table.plot(kind='bar', colormap='Paired')
plt.ylabel('Count')
Out[ ]:
Text(0, 0.5, 'Count')

Now, we display P-Value by applying the chi-squared test using the chi2_contingency() function.

In [ ]:
chi2, p_value, dof, expected = chi2_contingency(stress_table)
print("P-Value:", p_value)
P-Value: 0.10741461278569753

Because p-value > alpha value (0.10741461278569753 > 0.05), we fail to reject the null hypothesis. There is not enough evidence to suggest that the stress level has an effect on the likelihood of being diagnosed with a stroke. We create a plot to visualize the relationship between "New Stress Levels" and "Stroke Diagnosis":

In [ ]:
plt.figure(figsize=(8, 6))
seas.violinplot(x='Diagnosis', y='New Stress Levels', data=young_adults, inner='quartile')
plt.ylabel('New Stress Levels')
plt.xlabel('Stroke Diagnosis')
plt.title(f'Stress Levels vs Stroke (Contingency Table)')
plt.grid(True)
plt.show()

**5B2. Alcohol Intake vs Stroke Diagnosis**
¶

To examine the relationship bewteen Alcohol Intake and a Stroke Diagnosis, we can use the Chi-Squared Test, which will allow us to compare the two categorical values. The first variable, Alcohol Intake, is organized into four categories: 'Never', 'Rarely', 'Social Drinker', and 'Frequent Drinker'. The second variable, Stroke Diagnosis, is organized into either 'No Stroke' or 'Stroke'.

For the Chi-Squared Test, we can assume a significance level, or alpha value, of 0.05. Our Null and Alternate hypotheses are as follows:

$H_{0}$: The alcohol intake category does not have an impact on stroke diagnosis.

$H_{A}$: The alcohol intake category does have an impact on stroke diagnosis.

Because the dataset already includes discrete categories to describe alcohol consumption levels, we can immediately create a contingency table.

In [ ]:
alcohol_level_table = pd.crosstab(young_adults['Alcohol Intake'], young_adults['Diagnosis'])
print(alcohol_level_table)
Diagnosis         No Stroke  Stroke
Alcohol Intake                     
Frequent Drinker        338     324
Never                   336     296
Rarely                  328     366
Social Drinker          330     344

We can visualize the same relationship using a bar graph.

In [ ]:
alcohol_level_table.plot(kind='bar')
plt.title('Alcohol Intake vs Stroke Diagnosis')
plt.legend(title='Diagnosis', bbox_to_anchor=(1.05, 1.0), loc='upper left')
plt.ylabel('Count')
Out[ ]:
Text(0, 0.5, 'Count')

Using the chi2_contingency function, we can conduct the chi-squared test to determine if there is a relationship between alcohol intake categories and stroke diagnosis.

In [ ]:
# Conduct the chi2 test on the contingency table
chi2, p_value, dof, expected = chi2_contingency(alcohol_level_table)
print("P-Value for Alcohol Consumption vs Diagnosis:", p_value)
P-Value for Alcohol Consumption vs Diagnosis: 0.15787917256302525

Because the p-value obtained was greater than alpha, we cannot reject the null hypothesis. As such, we can conclude that we do not have sufficient evidence that alcohol intake has an impact on the liklihood of stroke diagnosis.

**5B3. Smoking Status vs Stroke Diagnosis**
¶

We use the Chi-Squared Test to determine the correlation between the categorical values "Formerly Smoked", "Non-smoker", and "Currently Smokes" of the Smoking Status and the likelyhood of stroke. Chi-Squared Test is suitable since our variables "Smoking Status" and "Diagnosis" are both categorical data.

Hypothesis:

  • $H_{0}$: Smoking status does not have an effect on the likelihood of stroke for young adults.

  • $H_{A}$: Smoking status does have an effect on the likelihood of stroke for young adults.

Step 1. Create a contingency table of the variables "Smoking Status" and "Diagnosis"

In [ ]:
# Clean the columns for accurate result
young_adults["Smoking Status"].str.strip()
young_adults["Diagnosis"].str.strip()

# Contingency Table
smoking_stroke_table = pd.crosstab(young_adults["Smoking Status"],young_adults["Diagnosis"])
print(smoking_stroke_table)
Diagnosis         No Stroke  Stroke
Smoking Status                     
Currently Smokes        449     466
Formerly Smoked         415     408
Non-smoker              468     456

Step 2. Show a bar graph to visually able to compare the relationship between smoking status and the diagnosis of stroke.

In [ ]:
# Data Comparison - Bar Graph (adequate for comparison between discrete values)
smoking_stroke_table.plot(kind = "bar")
Out[ ]:
<Axes: xlabel='Smoking Status'>

Step 3. Find P-Value to determine whether there are significant differences of diagnosis depending on the status of smoking. Our significant value is 0.05.

In [ ]:
# P-Value
result = chi2_contingency(smoking_stroke_table)
print(result.pvalue)
0.7673106445186333

Step 4. Conclusion of Hypothesis Testing

Due to the p-value being 0.7673 > 0.05. There isn't enough statistical proof to prove the impact of smoking status on stroke diagnosis. There aren't significant differences between the levels of the smoking status impacting the diagnosis of stroke.

Step 5. Visualize the correlation between the smoking status and the stroke diagnosis.

In [ ]:
plt.figure(figsize=(8, 6))
seas.violinplot(x='Diagnosis', y='Smoking Status', data=young_adults, inner='quartile')
plt.ylabel('Smoking Status')
plt.xlabel('Stroke Diagnosis')
plt.title(f'Smoking Status impacting Stroke Diagnosis')
plt.grid(True)
plt.show()

**5B4. Average Glucose Level vs Stroke Diagnosis**
¶

A Chi-Square Test was chosen here because it will help us examine the independence between a categorical independent variable (average glucose levels) and the binary outcome (presence of stroke). As we stated in the beginning, we will be analyzing data amongst young adults who are between the ages of 18-30. Therefore, we will use a dataframe called young_adults that will only contain people who are considered young adults in stroke_df (the main dataset).

In [ ]:
print(young_adults)
       Patient ID     Patient Name  Age  Gender  Hypertension  Heart Disease  \
2           32145    Dhanush Balan   26    Male             1              1   
12          66924     Ahana  Lalla   30  Female             0              1   
19          23954     Taran Khatri   25    Male             0              0   
25          36975      Jhanvi Brar   24  Female             0              0   
37          94512       Anvi Salvi   23  Female             0              0   
...           ...              ...  ...     ...           ...            ...   
14972       11839    Chirag Kurian   30    Male             0              1   
14974       30150  Alisha Banerjee   20  Female             0              0   
14981       12323        Pari Ravi   25    Male             0              0   
14983       40381        Sana Goel   18  Female             0              0   
14991       90658      Samaira Raj   26    Male             0              1   

      Marital Status       Work Type Residence Type  Average Glucose Level  \
2            Married    Never Worked          Rural                 189.00   
12          Divorced  Government Job          Urban                 163.15   
19           Married         Private          Urban                  71.38   
25           Married   Self-employed          Urban                  79.89   
37            Single  Government Job          Rural                 164.72   
...              ...             ...            ...                    ...   
14972        Married   Self-employed          Rural                 126.94   
14974       Divorced    Never Worked          Rural                 101.36   
14981         Single         Private          Rural                  77.64   
14983         Single   Self-employed          Urban                  68.26   
14991        Married   Self-employed          Rural                 145.05   

       ...  Physical Activity Stroke History Family History of Stroke  \
2      ...               High              0                      Yes   
12     ...           Moderate              0                      Yes   
19     ...           Moderate              0                      Yes   
25     ...               High              1                       No   
37     ...                Low              1                      Yes   
...    ...                ...            ...                      ...   
14972  ...           Moderate              0                       No   
14974  ...               High              0                      Yes   
14981  ...                Low              0                      Yes   
14983  ...           Moderate              1                       No   
14991  ...                Low              1                       No   

       Dietary Habits  Stress Levels Blood Pressure Levels Cholesterol Levels  \
2               Paleo           7.31                154/97   HDL: 59, LDL: 95   
12     Non-Vegetarian           9.19                114/67   HDL: 80, LDL: 83   
19        Gluten-Free           0.46                170/64  HDL: 72, LDL: 174   
25         Vegetarian           6.48                151/65  HDL: 73, LDL: 111   
37        Gluten-Free           7.86                148/74   HDL: 30, LDL: 62   
...               ...            ...                   ...                ...   
14972     Pescatarian           9.51                113/65  HDL: 55, LDL: 179   
14974     Pescatarian           2.26                159/94   HDL: 42, LDL: 99   
14981           Paleo           2.69                135/66  HDL: 58, LDL: 161   
14983      Vegetarian           6.79                136/66  HDL: 59, LDL: 172   
14991     Pescatarian           0.71               180/110   HDL: 33, LDL: 99   

                                                Symptoms  Diagnosis  \
2                                    Seizures, Dizziness     Stroke   
12                             Loss of Balance, Numbness     Stroke   
19                                              Seizures     Stroke   
25     Numbness, Loss of Balance, Numbness, Blurred V...     Stroke   
37                    Blurred Vision, Seizures, Weakness     Stroke   
...                                                  ...        ...   
14972                                                NaN  No Stroke   
14974                           Seizures, Severe Fatigue     Stroke   
14981  Blurred Vision, Headache, Severe Fatigue, Loss...     Stroke   
14983  Severe Fatigue, Severe Fatigue, Headache, Seiz...     Stroke   
14991                               Confusion, Confusion  No Stroke   

      New Stress Levels  
2                   7-8  
12                 9-10  
19                  0-1  
25                  6-7  
37                  7-8  
...                 ...  
14972              9-10  
14974               2-3  
14981               2-3  
14983               6-7  
14991               0-1  

[2662 rows x 23 columns]

In preparation for conducting a Chi-square test, we will categorize the "Average Glucose Levels" of individuals into three distinct groups based on guidelines provided by the Cleveland Clinic. Specifically, glucose levels below 117.0 will be classified as 'Normal', levels from 117.0 to 137.0 will be considered 'Pre-diabetic', and any levels exceeding 137.0 will be labeled as 'Diabetic' (Cleveland Clinic, 2022). This categorization will allow us to systematically analyze the association between glucose levels and other variables in our dataset.

In [ ]:
def categorize_glucose_level(agl):
    if agl < 117.0:
        return 'Normal'
    elif agl < 137.0:
        return 'Pre-diabetic'
    else:
        return 'Diabetic'

young_adults.loc[:, 'Category'] = young_adults['Average Glucose Level'].apply(categorize_glucose_level)
print(young_adults)
       Patient ID     Patient Name  Age  Gender  Hypertension  Heart Disease  \
2           32145    Dhanush Balan   26    Male             1              1   
12          66924     Ahana  Lalla   30  Female             0              1   
19          23954     Taran Khatri   25    Male             0              0   
25          36975      Jhanvi Brar   24  Female             0              0   
37          94512       Anvi Salvi   23  Female             0              0   
...           ...              ...  ...     ...           ...            ...   
14972       11839    Chirag Kurian   30    Male             0              1   
14974       30150  Alisha Banerjee   20  Female             0              0   
14981       12323        Pari Ravi   25    Male             0              0   
14983       40381        Sana Goel   18  Female             0              0   
14991       90658      Samaira Raj   26    Male             0              1   

      Marital Status       Work Type Residence Type  Average Glucose Level  \
2            Married    Never Worked          Rural                 189.00   
12          Divorced  Government Job          Urban                 163.15   
19           Married         Private          Urban                  71.38   
25           Married   Self-employed          Urban                  79.89   
37            Single  Government Job          Rural                 164.72   
...              ...             ...            ...                    ...   
14972        Married   Self-employed          Rural                 126.94   
14974       Divorced    Never Worked          Rural                 101.36   
14981         Single         Private          Rural                  77.64   
14983         Single   Self-employed          Urban                  68.26   
14991        Married   Self-employed          Rural                 145.05   

       ...  Stroke History Family History of Stroke  Dietary Habits  \
2      ...               0                      Yes           Paleo   
12     ...               0                      Yes  Non-Vegetarian   
19     ...               0                      Yes     Gluten-Free   
25     ...               1                       No      Vegetarian   
37     ...               1                      Yes     Gluten-Free   
...    ...             ...                      ...             ...   
14972  ...               0                       No     Pescatarian   
14974  ...               0                      Yes     Pescatarian   
14981  ...               0                      Yes           Paleo   
14983  ...               1                       No      Vegetarian   
14991  ...               1                       No     Pescatarian   

      Stress Levels  Blood Pressure Levels Cholesterol Levels  \
2              7.31                 154/97   HDL: 59, LDL: 95   
12             9.19                 114/67   HDL: 80, LDL: 83   
19             0.46                 170/64  HDL: 72, LDL: 174   
25             6.48                 151/65  HDL: 73, LDL: 111   
37             7.86                 148/74   HDL: 30, LDL: 62   
...             ...                    ...                ...   
14972          9.51                 113/65  HDL: 55, LDL: 179   
14974          2.26                 159/94   HDL: 42, LDL: 99   
14981          2.69                 135/66  HDL: 58, LDL: 161   
14983          6.79                 136/66  HDL: 59, LDL: 172   
14991          0.71                180/110   HDL: 33, LDL: 99   

                                                Symptoms  Diagnosis  \
2                                    Seizures, Dizziness     Stroke   
12                             Loss of Balance, Numbness     Stroke   
19                                              Seizures     Stroke   
25     Numbness, Loss of Balance, Numbness, Blurred V...     Stroke   
37                    Blurred Vision, Seizures, Weakness     Stroke   
...                                                  ...        ...   
14972                                                NaN  No Stroke   
14974                           Seizures, Severe Fatigue     Stroke   
14981  Blurred Vision, Headache, Severe Fatigue, Loss...     Stroke   
14983  Severe Fatigue, Severe Fatigue, Headache, Seiz...     Stroke   
14991                               Confusion, Confusion  No Stroke   

      New Stress Levels      Category  
2                   7-8      Diabetic  
12                 9-10      Diabetic  
19                  0-1        Normal  
25                  6-7        Normal  
37                  7-8      Diabetic  
...                 ...           ...  
14972              9-10  Pre-diabetic  
14974               2-3        Normal  
14981               2-3        Normal  
14983               6-7        Normal  
14991               0-1      Diabetic  

[2662 rows x 24 columns]
<ipython-input-1696-ad58dde6d0c3>:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  young_adults.loc[:, 'Category'] = young_adults['Average Glucose Level'].apply(categorize_glucose_level)

Next, we will apply some hypothesis testing.

  • $H_{0}$: The category of the average glucose levels does not have an effect on the likelihood of stroke occurrence in the patient

  • $H_{A}$: The category of the average glucose levels does have an effect on the likelihood of stroke occurrence in the patient

Our plan is to apply Chi-Squared Test. For that first, we will probably want a contingency table and a way to create one. Create a contingency table and display it.

In [ ]:
cont = pd.crosstab(young_adults['Category'], young_adults['Diagnosis'])
print(cont)
Diagnosis     No Stroke  Stroke
Category                       
Diabetic            592     616
Normal              541     527
Pre-diabetic        199     187

Create a plot showing the relationship between the average glucose level categories, and the occurrence of stroke.

In [ ]:
cont.plot(kind = 'bar', title='Average Glucose Level vs Stroke (Contingency Table))', xlabel='Average Glucose Level Category', ylabel='Number of Patients')
Out[ ]:
<Axes: title={'center': 'Average Glucose Level vs Stroke (Contingency Table))'}, xlabel='Average Glucose Level Category', ylabel='Number of Patients'>

Next, we want to conduct the chi-squared test using the chi2_contingency() function. Display P-Value by applying the Chi-Squared Test using the chi2_contingency() function.

In [ ]:
ob = spy.stats.contingency.chi2_contingency(cont)
print(ob.pvalue)
0.5969342118907002

Based on the obtained P-Value, determine whether to reject or fail to reject the null hypothesis:

In hypothesis testing, we have set our class's significance level to be 0.05, and this value will act as a threshold to determine whether the p-value indicates a statistically significant result. If the p-value is less than or equal to the significance level (0.05), we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis. Since 0.596934 > 0.05, the p-value is higher than the threshold set for statistical significance. Since the p-value is much greater than 0.05, it suggests that the observed data is quite likely under the null hypothesis of no association between the average glucose levels category and the likelihood of stroke occurring.

Unhealthy Eating Behaviors & Chi-Square Test Conclusion:

The obtained P-Value of 0.596934 implies that there is not enough statistical evidence to conclude that the category of the average glucose levels affects the likelihood of stroke occurrence, meaning we fail to reject the null hypothesis.

**III. Machine Learning Analysis and Visualization**

¶

In this section, based on the results in the previous sections, we will apply machine learning techniques (e.g., classification, regression, etc.) to the Stroke Data to explore how lifestyle factors such as stress, alcohol consumption, smoking, and unhealthy eating behaviors correlate with the increasing incidence of strokes in young adults.

**1. Stress Levels vs Stroke Diagnosis**

¶

First of all, we display our data again.

In [ ]:
print(stress_table)
Diagnosis          No Stroke  Stroke
New Stress Levels                   
0-1                      148     131
1-2                      111     155
2-3                      130     135
3-4                      126     124
4-5                      139     112
5-6                      143     138
6-7                      136     127
7-8                      153     148
8-9                      117     141
9-10                     129     119

Look at the above data, we can see that the stress levels and stroke incidence are continuous variables. Therefore, linear regression is a suitable choice for the analysis of the relationship between stress levels and stroke incidence.

We will convert the provided data into a suitable format for regression analysis by creating a dataset with stress levels and stroke incidences. The independent variable (predictor) will be the stress levels, and the dependent variable (response) will be the number of stroke incidences.

In [ ]:
data = {
    "Stress Levels": [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5],
    "Stroke Incidence": [131, 155, 135, 124, 112, 138, 127, 148, 141, 119]
}
stress2_df = pd.DataFrame(data)

Then, we will use linear regression to model the relationship between stress levels and stroke incidence.

In [ ]:
X = stress2_df["Stress Levels"].values.reshape(-1, 1)
y = stress2_df["Stroke Incidence"].values

model = LinearRegression()

Next, we will train the model by fitting the regression model to the data.

In [ ]:
model.fit(X, y)
Out[ ]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

In this step, we will interpret the regression coefficients to understand the correlation and display the results.

In [ ]:
coef = model.coef_[0]
intercept = model.intercept_
r_squared = model.score(X, y)

print("Regression Coefficient (Slope):", coef)
print("\n Intercept:", intercept)
print("\n R-squared:", r_squared)
Regression Coefficient (Slope): -0.6424242424242423

 Intercept: 136.21212121212122

 R-squared: 0.02182595182595204

Finally, we will plot the data and the regression line.

In [ ]:
plt.scatter(stress2_df["Stress Levels"], stress2_df["Stroke Incidence"], color='blue')
plt.plot(stress2_df["Stress Levels"], model.predict(X), color='red')
plt.xlabel('Stress Levels')
plt.ylabel('Stroke Incidence')
plt.title('Stress Levels vs Stroke Incidence')
plt.show()

Analysis of the Results

  • Regression Coefficient (Slope): -0.6424242424242423

A negative coefficient suggests that as stress levels increase, stroke incidence tends to decrease slightly.

  • Intercept: 136.21212121212122

This is the expected stroke incidence when the stress level is 0.

  • R_squared: 0.02182595182595204

The R-squared value suggests that only about 2.2% of the variance in stroke incidence is explained by stress levels. This indicates a weak correlation.

**2. Alcohol Intake vs Stroke Diagnosis**

¶
  1. Feature Engineering our data
In [ ]:
# Create a new dataframe
alcohol_stroke_ml_data = young_adults[["Alcohol Intake", "Diagnosis"]]

# One-hot encode the alcohol intake status
all_consumption_levels = alcohol_stroke_ml_data['Alcohol Intake'].str.get_dummies(sep=', ')
alcohol_stroke_ml_data = pd.concat([alcohol_stroke_ml_data, all_consumption_levels], axis=1).drop(columns=['Alcohol Intake'])

# Change the Stroke / No Stroke labels to 1 and 0
alcohol_stroke_ml_data.replace({"Stroke":1, "No Stroke":0}, inplace=True)
  1. Create our model and set up the training and testing data
In [ ]:
# Create a decision tree model
alcohol_model = DecisionTreeClassifier()

X_alc = alcohol_stroke_ml_data.drop('Diagnosis', axis=1)
Y_alc = alcohol_stroke_ml_data['Diagnosis']

x_alc_train, x_alc_test, y_alc_train, y_alc_test = train_test_split(X_alc, Y_alc, test_size= 0.2, random_state = 42)

alcohol_model.fit(x_alc_train, y_alc_train)
Out[ ]:
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
  1. Evaluate the performance of our model
In [ ]:
# Evaluate the Performance
predict_al = alcohol_model.predict(x_alc_test)


accuracy = accuracy_score(y_alc_test, predict_al)

print(f"Accuracy of predictions using DecisionTree: {accuracy}")
print(classification_report(y_alc_test, predict_al))
Accuracy of predictions using DecisionTree: 0.4896810506566604
              precision    recall  f1-score   support

           0       0.49      0.49      0.49       267
           1       0.49      0.49      0.49       266

    accuracy                           0.49       533
   macro avg       0.49      0.49      0.49       533
weighted avg       0.49      0.49      0.49       533

  1. Display the evaluation results
In [ ]:
al_matrix = confusion_matrix(y_alc_test, predict_al)
plt.figure(figsize=(8, 6))
seas.heatmap(al_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['No Stroke', 'Stroke'], yticklabels=['No Stroke', 'Stroke'])
plt.title('Confusion Matrix of Decision Tree Predictions')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Analysis of the Results

The accuracy falls below 50%, meaning the model is performing worse than chance at predicting a stroke. This demonstrates that drinking patterns alone is not enough data for a model to predict a stroke

**3. Smoking Status vs Stroke Diagnosis**

¶

We will use the Random Forest model to find the correlation between the Smoking Status and Stroke Diagnosis and how predictions can be made through the feature. As predicting how smoking impacts diagnosis is a binary classification, using Random Forest for the purpose of ensembling is a good method.

  1. Apply Feature Engineering to our dataset
In [ ]:
# Feature : Smoking Status (categorical categories)
print (young_adults['Smoking Status'].unique())
# Label(prediction) : Diagnosis (categorical categories)
print (young_adults['Diagnosis'].unique())

#print(young_adults)
# Create a Data frame for our model : add additional features (not including other factors) to prevent underfitting
smoking_stroke_ml_df = young_adults[["Smoking Status", "Diagnosis"]]


# Convert the categorical values to numeric values
def convert_status2 (status) :
  if status == "Non-smoker" :
    return 0
  elif status == "Formerly Smoked" :
    return 1
  elif status == "Currently Smokes" :
    return 2

def convert_diagnosis2 (status) :
  if status == "No Stroke" :
    return 0
  elif status == "Stroke" :
    return 1

# Display the converted values
smoking_stroke_ml_df.loc[:,"Smoking Status"] = smoking_stroke_ml_df["Smoking Status"].apply(convert_status2)
smoking_stroke_ml_df.loc[:,"Diagnosis"] = smoking_stroke_ml_df["Diagnosis"].apply(convert_diagnosis2).astype(int)


print(smoking_stroke_ml_df)

# Display the unique values of the converted feature
print(smoking_stroke_ml_df['Diagnosis'].unique())
print(smoking_stroke_ml_df['Smoking Status'].unique())
['Formerly Smoked' 'Non-smoker' 'Currently Smokes']
['Stroke' 'No Stroke']
      Smoking Status Diagnosis
2                  1         1
12                 1         1
19                 0         1
25                 2         1
37                 1         1
...              ...       ...
14972              2         0
14974              1         1
14981              2         1
14983              2         1
14991              0         0

[2662 rows x 2 columns]
[1 0]
[1 0 2]
  1. Split the feature engineered data
In [ ]:
# Split the Data to training and testing data with testing being  0.2
X = smoking_stroke_ml_df[["Smoking Status"]]
y = smoking_stroke_ml_df['Diagnosis'] >= 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state = 42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
  1. Train the training data using Random Forest Model
In [ ]:
# Create our model
smoking_model = RandomForestClassifier()
# Train the data using RandomForestClassifier -> ensembling
smoking_model.fit(X_train, y_train)
Out[ ]:
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
  1. Evaluate the performance of the Model
In [ ]:
# Evaluate the Performance
predictions = smoking_model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

print(f"Accuracy of prediction: {accuracy}")
print(classification_report(y_test, predictions))

print("The accuracy of this model is not overall ideal")
Accuracy of prediction: 0.4521575984990619
              precision    recall  f1-score   support

       False       0.47      0.66      0.55       267
        True       0.42      0.24      0.31       266

    accuracy                           0.45       533
   macro avg       0.44      0.45      0.43       533
weighted avg       0.44      0.45      0.43       533

The accuracy of this model is not overall ideal

Analysis of the Results

This model does not perform well in distinguishing the diagnosis of stroke. Accuracy scores are near 0.5 meaning that its performance is similar to random guessing. Looking at the f1-score for both classes is low indicating that we need further investigation on why this is happening.

  1. Show the confusion matrix to visualize the performance of the model.
In [ ]:
# Use the confusion matrix to visualize the positives (true false) and negatives (true false)
smoking_conf_matrix = confusion_matrix(y_test, predictions)

plt.figure(figsize=(12, 6))
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.title('Confusion Matrix of Smoking Status VS Diagnosis of Stroke')
seas.heatmap(smoking_conf_matrix, annot=True, fmt="d", xticklabels=["No Stroke", "Stroke"], yticklabels=["No Stroke", "Stroke"])
plt.show()
  1. Visualize the Correlation Matrix to show the relationship of Smoking Status and Diagosis of Stroke
In [ ]:
# Use the correlation matrix to comprehend why the model resulted in having poor performance
smoking_corr_matrix = smoking_stroke_ml_df.corr()

plt.figure(figsize = (12,6))
plt.title('Correlation Matrix')
seas.heatmap(smoking_corr_matrix, annot = True, fmt = '.4f', linewidths = 0.5)
plt.show()

Analysis of the Results

It is likely that the reason for the poor performance of the machine learning model is that the correlation coefficient is close to 0 for smoking status and diagnosis of stroke which means that these two variables are not related linearly. Additionally, more features are needed to prevent underfitting as our model cannot generalize the patterns and predict the diagnosis.

**4. Average Glucose Level vs Stroke Diagnosis**

¶
  1. Let us display our data table again.
In [ ]:
print(young_adults)
       Patient ID     Patient Name  Age  Gender  Hypertension  Heart Disease  \
2           32145    Dhanush Balan   26    Male             1              1   
12          66924     Ahana  Lalla   30  Female             0              1   
19          23954     Taran Khatri   25    Male             0              0   
25          36975      Jhanvi Brar   24  Female             0              0   
37          94512       Anvi Salvi   23  Female             0              0   
...           ...              ...  ...     ...           ...            ...   
14972       11839    Chirag Kurian   30    Male             0              1   
14974       30150  Alisha Banerjee   20  Female             0              0   
14981       12323        Pari Ravi   25    Male             0              0   
14983       40381        Sana Goel   18  Female             0              0   
14991       90658      Samaira Raj   26    Male             0              1   

      Marital Status       Work Type Residence Type  Average Glucose Level  \
2            Married    Never Worked          Rural                 189.00   
12          Divorced  Government Job          Urban                 163.15   
19           Married         Private          Urban                  71.38   
25           Married   Self-employed          Urban                  79.89   
37            Single  Government Job          Rural                 164.72   
...              ...             ...            ...                    ...   
14972        Married   Self-employed          Rural                 126.94   
14974       Divorced    Never Worked          Rural                 101.36   
14981         Single         Private          Rural                  77.64   
14983         Single   Self-employed          Urban                  68.26   
14991        Married   Self-employed          Rural                 145.05   

       ...  Stroke History Family History of Stroke  Dietary Habits  \
2      ...               0                      Yes           Paleo   
12     ...               0                      Yes  Non-Vegetarian   
19     ...               0                      Yes     Gluten-Free   
25     ...               1                       No      Vegetarian   
37     ...               1                      Yes     Gluten-Free   
...    ...             ...                      ...             ...   
14972  ...               0                       No     Pescatarian   
14974  ...               0                      Yes     Pescatarian   
14981  ...               0                      Yes           Paleo   
14983  ...               1                       No      Vegetarian   
14991  ...               1                       No     Pescatarian   

      Stress Levels  Blood Pressure Levels Cholesterol Levels  \
2              7.31                 154/97   HDL: 59, LDL: 95   
12             9.19                 114/67   HDL: 80, LDL: 83   
19             0.46                 170/64  HDL: 72, LDL: 174   
25             6.48                 151/65  HDL: 73, LDL: 111   
37             7.86                 148/74   HDL: 30, LDL: 62   
...             ...                    ...                ...   
14972          9.51                 113/65  HDL: 55, LDL: 179   
14974          2.26                 159/94   HDL: 42, LDL: 99   
14981          2.69                 135/66  HDL: 58, LDL: 161   
14983          6.79                 136/66  HDL: 59, LDL: 172   
14991          0.71                180/110   HDL: 33, LDL: 99   

                                                Symptoms  Diagnosis  \
2                                    Seizures, Dizziness     Stroke   
12                             Loss of Balance, Numbness     Stroke   
19                                              Seizures     Stroke   
25     Numbness, Loss of Balance, Numbness, Blurred V...     Stroke   
37                    Blurred Vision, Seizures, Weakness     Stroke   
...                                                  ...        ...   
14972                                                NaN  No Stroke   
14974                           Seizures, Severe Fatigue     Stroke   
14981  Blurred Vision, Headache, Severe Fatigue, Loss...     Stroke   
14983  Severe Fatigue, Severe Fatigue, Headache, Seiz...     Stroke   
14991                               Confusion, Confusion  No Stroke   

      New Stress Levels      Category  
2                   7-8      Diabetic  
12                 9-10      Diabetic  
19                  0-1        Normal  
25                  6-7        Normal  
37                  7-8      Diabetic  
...                 ...           ...  
14972              9-10  Pre-diabetic  
14974               2-3        Normal  
14981               2-3        Normal  
14983               6-7        Normal  
14991               0-1      Diabetic  

[2662 rows x 24 columns]
  1. Here, we will complete some feature engineering.

Looking at the above data, we can see that glucose levels are continuous variables. Continuous variables are those that can take on an infinite number of values within a given range. In this case, glucose levels can vary continuously over a spectrum of values, rather than being limited to specific categories or discrete numbers. Therefore, it is suitable to classify glucose levels as continuous values for the purpose of regression analysis. This classification will allow us to model the relationship between glucose levels and stroke occurrence using logistic regression, which is ideal for analyzing the impact of one continuous variable on a binary dependent variable.

We will convert the provided data into a suitable format for logistic regression analysis by creating a dataset with glucose levels and stroke occurrences. The independent variable (predictor) will be the glucose levels, and the dependent variable (response) will be stroke occurrence, coded as 0 for no stroke and 1 for stroke. This will enable us to analyze how changes in glucose levels influence the likelihood of experiencing a stroke.

In [ ]:
glucose_ml_df = young_adults[["Age", "Average Glucose Level", "Diagnosis"]]

def stroke_conversion (status) :
  if status == "No Stroke" :
    return 0
  elif status == "Stroke" :
    return 1

glucose_ml_df.loc[:,"Diagnosis"] = glucose_ml_df["Diagnosis"].apply(stroke_conversion).astype(int)

print(glucose_ml_df)
       Age  Average Glucose Level Diagnosis
2       26                 189.00         1
12      30                 163.15         1
19      25                  71.38         1
25      24                  79.89         1
37      23                 164.72         1
...    ...                    ...       ...
14972   30                 126.94         0
14974   20                 101.36         1
14981   25                  77.64         1
14983   18                  68.26         1
14991   26                 145.05         0

[2662 rows x 3 columns]
  1. Next, we will now split the data into training and testing sets.
In [ ]:
Xg = glucose_ml_df[['Average Glucose Level', 'Age']]
Yg = glucose_ml_df['Diagnosis'] >= 1

Xg_train, Xg_test, Yg_train, Yg_test = train_test_split(Xg, Yg, test_size= 0.2, random_state = 42)

scaler = StandardScaler()
Xg_train = scaler.fit_transform(Xg_train)
Xg_test = scaler.fit_transform(Xg_test)
  1. After splitting the dataset, we can now train our Logistic Regression Model.
In [ ]:
glucose_ml_model = LogisticRegression()
glucose_ml_model.fit(Xg_train, Yg_train)
Out[ ]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
  1. Now, we can assess the model's performance in relation to our potential risk factor and stroke occurrence.
In [ ]:
pred_g = glucose_ml_model.predict(Xg_test)
accu_g = accuracy_score(Yg_test, pred_g)

print(classification_report(Yg_test, pred_g))
print(f"Accuracy Score: {accu_g}")
              precision    recall  f1-score   support

       False       0.52      0.52      0.52       267
        True       0.52      0.53      0.52       266

    accuracy                           0.52       533
   macro avg       0.52      0.52      0.52       533
weighted avg       0.52      0.52      0.52       533

Accuracy Score: 0.5215759849906192

Analysis of the Results (Precision, Recall, F1 Score, & Accuracy)

Given an accuracy of approximately 0.522 (52.2%), the precision, recall, and F1-score for both classes (False and True) are very similar, indicating that the model performs equally well (or equally poorly) across both classes. Here are the main takeaways from evaluating these results: For precision, the model correctly identifies about 52% of instances for both classes out of all instances it predicts as such. For recall, the model correctly identifies about 52% of instances for False (no stroke) and 53% for True (stroke) out of all actual instances of each class. And for the F1-Score, this harmonic mean suggests a balance between precision and recall, with the model showing consistent performance across both classes.

  1. Finally, to illustrate the model's performance, we will display the correlation matrix.
In [ ]:
glucose_correlation_matrix = glucose_ml_df.corr()

plt.figure(figsize = (12,6))
plt.title('Correlation Matrix for Logisitic Regression Model on Average Glucose Levels vs Stroke Occurrence Based On Age')
seas.heatmap(glucose_correlation_matrix, annot = True, fmt = '.4f', xticklabels=["Age", "Average Glucose Level", "Diagnosis"], yticklabels=["Age", "Average Glucose Level", "Diagnosis"], linewidths = 0.5)
plt.show()

Analysis of the Results (Correlation Matrix)

For the correlation matrix, this is what the results are depicting: 1. For the Age vs. Average Glucose Level, the correlation coefficient is -0.0039. This indicates a very weak negative correlation between age and average glucose level. Essentially, there is almost no linear relationship between these two variables. 2. For Age vs. Diagnosis, the correlation coefficient is -0.0197. This indicates a very weak negative correlation between age and the diagnosis of stroke. This suggests that as age increases, there is a very slight decrease in the likelihood of having a stroke, but the relationship is mostly negligible. 3. For the Average Glucose Level vs. Diagnosis, the correlation coefficient is 0.0022. This indicates a very weak positive correlation between average glucose level and the diagnosis of stroke. This suggests that higher glucose levels are slightly associated with an increased likelihood of having a stroke, but again, the relationship is almost negligible. 4. The diagonal elements all have a value of 1.0000, which indicates a perfect positive correlation of each variable with itself, as expected.

Let us interpret the results of the correlation matrix to reach a conclusion. The values in the correlation matrix are close to zero for all pairings of the variables, indicating that there is no strong linear relationship between age, average glucose level, and the diagnosis of stroke in the dataset being analyzed. The weak correlations suggest that other factors might be more significant in predicting stroke occurrence and that age and average glucose levels alone are not strong predictors in this context. For building a more predictive model, we might need to include additional features or consider interactions between variables.

Interpretation

The logistic regression model appears to have limited biased power based on blood glucose levels alone, as indicated by the similar performance metrics across both classes (False and True). This could imply that blood glucose levels alone might not be sufficiently predictive of stroke occurrence, or additional features or model improvements might be necessary to achieve higher accuracy and differentiation between the classes.

**5. Main Dataset Analysis Using KNN, Decision Tree, Logistic Regression and Random Forest**

¶

Our analysis of the main demonstrated that we could not find a correlation between stroke diagnosis and individual health factors. However, we can use the same data to understand if the individual factors can be considered together to predict the occurence of a stroke. To do so, we can use four ML models to compare the features and generate predictions.

  1. Feature Engineering : encode and clean our data
In [ ]:
# Begin with the young_adults dataset
# We don't need the patient names or IDs for the analysis, so we can remove those
main_dataset = young_adults.copy()
main_dataset = main_dataset.drop(['Patient Name', 'Patient ID'], axis=1)
# In our earlier analysis, we reformatted the Stress Levels into discrete ranges
# For this, we can use the raw stress levels instead
main_dataset = main_dataset.drop(['New Stress Levels'], axis=1)

# We also do not need the diabetes category, so we can drop this as well
main_dataset = main_dataset.drop(['Category'], axis=1)


# To aid the analysis, we can split Cholesterol Levels into HDL and LDL levels
main_dataset[['Cholesterol Levels HDL', 'Cholesterol Levels LDL']] = main_dataset['Cholesterol Levels'].str.split(',', expand=True)
main_dataset = main_dataset.drop(['Cholesterol Levels'], axis=1)

# To make sure the values are treated as numbers, we can remove the "HDL: " and "LDL: " text
# and convert the columns to numerical values
main_dataset['Cholesterol Levels HDL'] = main_dataset['Cholesterol Levels HDL'].apply(lambda x: int(x[4:]))
main_dataset['Cholesterol Levels LDL'] = main_dataset['Cholesterol Levels LDL'].apply(lambda x: int(x[5:]))

# Next, to treat blood pressure levels as numbers, we can split it into the systolic and diastolic numbers
main_dataset[['Blood Pressure Systolic', 'Blood Pressure Diastolic']] = main_dataset['Blood Pressure Levels'].str.split('/', expand=True)
main_dataset = main_dataset.drop(['Blood Pressure Levels'], axis=1)
main_dataset['Blood Pressure Systolic'] = main_dataset['Blood Pressure Systolic'].apply(lambda x: int(x))
main_dataset['Blood Pressure Diastolic'] = main_dataset['Blood Pressure Diastolic'].apply(lambda x: int(x))


# To consider the impact of individual symptoms, we must one-hot encode the symptoms.
# To do so, we can split the Symptoms column into distinct symptoms, then run an encoder
all_symptoms = main_dataset['Symptoms'].str.get_dummies(sep=', ')
main_dataset = pd.concat([main_dataset, all_symptoms], axis=1).drop(columns=['Symptoms'])

# Finally, to feed our data to the random forest model, we must one-hot encode our remaining categorical variables
# Sex of the patient:
genders = main_dataset['Gender'].str.get_dummies()
main_dataset = pd.concat([main_dataset, genders], axis=1).drop(columns=['Gender'])

# Marital Status
marital_statuses = main_dataset['Marital Status'].str.get_dummies()
main_dataset = pd.concat([main_dataset, marital_statuses], axis=1).drop(columns=['Marital Status'])

# Work Type
work_types = main_dataset['Work Type'].str.get_dummies()
main_dataset = pd.concat([main_dataset, work_types], axis=1).drop(columns=['Work Type'])

# Residence Type
residence_types = main_dataset['Residence Type'].str.get_dummies()
main_dataset = pd.concat([main_dataset, residence_types], axis=1).drop(columns=['Residence Type'])

# Smoking Statuses
smoking_statuses = main_dataset['Smoking Status'].str.get_dummies()
main_dataset = pd.concat([main_dataset, smoking_statuses], axis=1).drop(columns=['Smoking Status'])

# Alcohol Intakes
alcohol_intakes = main_dataset['Alcohol Intake'].str.get_dummies()
main_dataset = pd.concat([main_dataset, alcohol_intakes], axis=1).drop(columns=['Alcohol Intake'])

# Dietary Habits
diets = main_dataset['Dietary Habits'].str.get_dummies()
main_dataset = pd.concat([main_dataset, diets], axis=1).drop(columns=['Dietary Habits'])

# Physical Activity
physical_activity_levels = main_dataset['Physical Activity'].str.get_dummies()
main_dataset = pd.concat([main_dataset, physical_activity_levels], axis=1).drop(columns=['Physical Activity'])
# Change column names to be more specific
main_dataset = main_dataset.rename(columns={"High": "High Physical Activity", "Low": "Low Physical Activity", "Moderate":"Moderate Physical Activity"})

# Family History of Stroke - replace Yes with 1 and No with 0
main_dataset['Family History of Stroke'].replace({'Yes': 1, 'No': 0}, inplace=True)

# Final cleaning to treat numerical values as integers/floats
main_dataset['Age'] = main_dataset['Age'].apply(lambda x: int(x))
main_dataset['Average Glucose Level'] = main_dataset['Average Glucose Level'].apply(lambda x: float(x))
main_dataset['Body Mass Index (BMI)'] = main_dataset['Body Mass Index (BMI)'].apply(lambda x: float(x))
main_dataset['Stress Levels'] = main_dataset['Stress Levels'].apply(lambda x: float(x))

#main_dataset.columns
main_dataset
Out[ ]:
Age Hypertension Heart Disease Average Glucose Level Body Mass Index (BMI) Stroke History Family History of Stroke Stress Levels Diagnosis Cholesterol Levels HDL ... Gluten-Free Keto Non-Vegetarian Paleo Pescatarian Vegan Vegetarian High Physical Activity Low Physical Activity Moderate Physical Activity
2 26 1 1 189.00 20.32 0 1 7.31 Stroke 59 ... 0 0 0 1 0 0 0 1 0 0
12 30 0 1 163.15 19.36 0 1 9.19 Stroke 80 ... 0 0 1 0 0 0 0 0 0 1
19 25 0 0 71.38 39.00 0 1 0.46 Stroke 72 ... 1 0 0 0 0 0 0 0 0 1
25 24 0 0 79.89 17.58 1 0 6.48 Stroke 73 ... 0 0 0 0 0 0 1 1 0 0
37 23 0 0 164.72 31.56 1 1 7.86 Stroke 30 ... 1 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14972 30 0 1 126.94 36.08 0 0 9.51 No Stroke 55 ... 0 0 0 0 1 0 0 0 0 1
14974 20 0 0 101.36 21.15 0 1 2.26 Stroke 42 ... 0 0 0 0 1 0 0 1 0 0
14981 25 0 0 77.64 23.88 0 1 2.69 Stroke 58 ... 0 0 0 1 0 0 0 0 1 0
14983 18 0 0 68.26 36.48 1 0 6.79 Stroke 59 ... 0 0 0 0 0 0 1 0 0 1
14991 26 0 1 145.05 35.94 1 0 0.71 No Stroke 33 ... 0 0 0 0 1 0 0 0 1 0

2662 rows × 51 columns

  1. Split our training and testing data
In [ ]:
# Now we can begin preparing the ML model
X_main_data = main_dataset.drop('Diagnosis', axis=1)
Y_main_data = main_dataset['Diagnosis']
x_main_train, x_main_test, y_main_train, y_main_test = train_test_split(X_main_data, Y_main_data, test_size=.2, random_state=4)

# Features of the model -> added interaction features
print(x_main_train.columns)
Index(['Age', 'Hypertension', 'Heart Disease', 'Average Glucose Level',
       'Body Mass Index (BMI)', 'Stroke History', 'Family History of Stroke',
       'Stress Levels', 'Cholesterol Levels HDL', 'Cholesterol Levels LDL',
       'Blood Pressure Systolic', 'Blood Pressure Diastolic', 'Blurred Vision',
       'Confusion', 'Difficulty Speaking', 'Dizziness', 'Headache',
       'Loss of Balance', 'Numbness', 'Seizures', 'Severe Fatigue', 'Weakness',
       'Female', 'Male', 'Divorced', 'Married', 'Single', 'Government Job',
       'Never Worked', 'Private', 'Self-employed', 'Rural', 'Urban',
       'Currently Smokes', 'Formerly Smoked', 'Non-smoker', 'Frequent Drinker',
       'Never', 'Rarely', 'Social Drinker', 'Gluten-Free', 'Keto',
       'Non-Vegetarian', 'Paleo', 'Pescatarian', 'Vegan', 'Vegetarian',
       'High Physical Activity', 'Low Physical Activity',
       'Moderate Physical Activity'],
      dtype='object')
  1. Use k-fold cross validation for the purpose of alleviating any overfitting
In [ ]:
# Standardize the features
scaler = StandardScaler()
x_main_train = scaler.fit_transform(x_main_train)
x_main_test = scaler.fit_transform(x_main_test)

# Models
models = {
    'KNN': KNeighborsClassifier(),
    'DecisionTree': DecisionTreeClassifier(),
    'LogisticRegression':LogisticRegression(max_iter=1000000),
    'RandomForest': RandomForestClassifier()
}

# Apply K-fold cross validation and evaluate to alleviate overfitting
skf = StratifiedKFold(n_splits= 5, shuffle=True, random_state=42)

for model_name, model in models.items():
  accuracy = cross_val_score(model, x_main_train, y_main_train, cv = skf)

  # Display the mean and standard deviation of the cross validation accuracy
  print(f"{model_name} \nMean: {accuracy.mean()} \nStandard Deviation: {accuracy.std()}\n")
KNN 
Mean: 0.4894194973764153 
Standard Deviation: 0.023114698595755705

DecisionTree 
Mean: 0.5025893399613366 
Standard Deviation: 0.02539913400661669

LogisticRegression 
Mean: 0.5049400718033693 
Standard Deviation: 0.024654173650678273

RandomForest 
Mean: 0.512448494890914 
Standard Deviation: 0.016121716175018355

  1. Train and Evaluate our model
In [ ]:
# Train each model using the training data
for model_name, model in models.items():
  model.fit(x_main_train, y_main_train)

# Evaluate the performance of each model
for model_name, model in models.items():
  predicted = model.predict(x_main_test)
  print(f"Accuracy of {model_name}: {accuracy_score(y_main_test, predicted)}")
  print(
    f"Classification report for classifier {model_name}:\n"
    f"{classification_report(y_main_test, predicted)}\n"
  )
Accuracy of KNN: 0.46904315196998125
Classification report for classifier KNN:
              precision    recall  f1-score   support

   No Stroke       0.47      0.51      0.49       263
      Stroke       0.47      0.43      0.45       270

    accuracy                           0.47       533
   macro avg       0.47      0.47      0.47       533
weighted avg       0.47      0.47      0.47       533


Accuracy of DecisionTree: 0.4727954971857411
Classification report for classifier DecisionTree:
              precision    recall  f1-score   support

   No Stroke       0.46      0.44      0.45       263
      Stroke       0.48      0.51      0.49       270

    accuracy                           0.47       533
   macro avg       0.47      0.47      0.47       533
weighted avg       0.47      0.47      0.47       533


Accuracy of LogisticRegression: 0.46904315196998125
Classification report for classifier LogisticRegression:
              precision    recall  f1-score   support

   No Stroke       0.46      0.50      0.48       263
      Stroke       0.47      0.44      0.45       270

    accuracy                           0.47       533
   macro avg       0.47      0.47      0.47       533
weighted avg       0.47      0.47      0.47       533


Accuracy of RandomForest: 0.5196998123827392
Classification report for classifier RandomForest:
              precision    recall  f1-score   support

   No Stroke       0.51      0.54      0.53       263
      Stroke       0.53      0.50      0.51       270

    accuracy                           0.52       533
   macro avg       0.52      0.52      0.52       533
weighted avg       0.52      0.52      0.52       533


Analysis of the Results

Although, the Random Forest model performed a little better than the other models, overall, all models have performed poorly on the predictions for the diagnosis of stroke with the accuracy close to 0.5. Additionally, the model performed equally poorly on both precision and recall and the overall f1-scores.

  1. Visualize the evaluation results of the model
In [ ]:
# Visualize the performance using the confusion matrix
maindata_confusion_matrix = confusion_matrix(y_main_test, predicted)

plt.figure(figsize=(12, 6))
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.title('Confusion Matrix for the prediction of the main dataset on the diagnosis of stroke')
seas.heatmap(maindata_confusion_matrix, annot=True, fmt="d", xticklabels=["No Stroke", "Stroke"], yticklabels=["No Stroke", "Stroke"])
plt.show()

**IV. Insights and Conclusions**

¶

**1. Stress Levels vs Stroke Diagnosis**

¶

Based on the linear regression analysis, there is a very weak negative correlation between stress levels and stroke incidence in young adults. The R-squared value is very low, implying that stress levels do not significantly explain the variation in stroke incidence based on the given data.

**2. Alcohol Intake vs Stroke Diagnosis**

¶

The DecisionTree model performed poorly when given just the alcohol consumption levels to analyze. This demonstrates that alcohol consumption alone is not enough on its own to predict a stroke. As such, in order to improve an ML model's ability to predict the occurence of a stroke, more features must be considered and more training must be conducted. This is in alignment with the statistical analysis performed on the dataset, which was unable to find a statistical relationship between alcohol levels and occurence of a stroke.

**3. Smoking Status vs Stroke Diagnosis**

¶

The RandomForest model did not perform well with the prediction of stroke with the smoking status. Following further examination, the correlation coefficient of these two variables was close to 0, indicating that there is no linear relationship between them. Additionally, more features are needed to prevent underfitting as our model is too simple and cannot generalize the patterns and predict the diagnosis. As a result, the smoking status solely cannot predict the diagnosis of stroke.

**4. Average Glucose Level vs Stroke Diagnosis**

¶

Based on the logistic regression analysis, the model demonstrates moderate performance in predicting stroke occurrence based on blood glucose levels. The precision, recall, and F1-score metrics consistently hover around 0.52 for both True and False classes, suggesting the model's ability to distinguish between instances of stroke and no stroke is limited. The balanced performance metrics indicate that blood glucose levels alone may not be sufficiently predictive of stroke occurrence.

**5. Main Dataset Analysis Using KNN, Decision Tree, Logistic Regression and Random Forest**

¶

Overall, the KNN, Decision Tree, Logistic Regression and Random Forest classifier shows a balanced performance for predictions, with an accuracy slightly below 0.5. As such, we can conclude that a new approach may be needed to generate more accurate predictions. This could mean including more features, more precise features (such as considering the exact type of physical activity, rather than the scale of Low/Moderate/Medium Activity), or, conversely, feature pruning to determine the most important features to consider and discarding the rest.

**6. Conclusion and Future Directions**

¶

In conclusion, our overall machine learning models have performed poorly with low accuracy in the prediction of the diagnosis of stroke. The evaluation results of our models suggest that our approach of focusing on four significant individual features does not adequately reflect the complex relationships between various lifestyle factors that influence stroke risk in young adults. This poor performance also suggests two possibilities. Firstly, these factors may not be the main causes of increased stroke risk in young people, or another factor, such as family history of stroke, might be more significant. Secondly, we need a more comprehensive approach, considering more features, to better understand and predict stroke risk in young adults.

Although our study didn't provide definitive answers, it has raised important questions and highlighted areas needing further investigation. To address the shortcomings of our models and for further investigation into the correlation of lifestyle factors with the risk of stroke in young adults, we suggest introducing new features that represent the interactions of each feature. Interaction features can better capture the complexity of diverse lifestyles. Engineering more features as such will allow the machine learning models to learn the combined effects of these individual features, which will be more effective in predicting the diagnosis of stroke.

**V. References**

¶

Board on Children, Youth, and Families; Institute of Medicine; National Research Council. Improving the Health, Safety, and Well-Being of Young Adults: Workshop Summary. Washington (DC): National Academies Press (US); 2013 Sep 27. Available from: https://www.ncbi.nlm.nih.gov/books/NBK202207/ doi: 10.17226/18340

Cleveland Clinic. (n.d.). A1C. Retrieved June 15, 2024, from https://my.clevelandclinic.org/health/diagnostics/9731-a1c