-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path707_Proposal.Rmd
339 lines (267 loc) · 11.3 KB
/
707_Proposal.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
---
title: "Project Code"
author: "Costa Stavrianidis"
date: "2022-10-13"
output: html_document
---
```{r setup, include=FALSE}
library(tidyverse)
library(reshape2)
```
# Reading in the dataset
```{r, warning=FALSE}
survey <- read_csv("C:/Users/kelvi/Downloads/archive/2015.csv", show_col_types = FALSE)
```
# Cleaning the data for variables of interest
```{r}
survey1 <- survey %>% select(`_MICHD`, ADDEPEV2, MENTHLTH, EMTSUPRT, ADDOWN, `_RFHLTH`, PHYSHLTH,
`_HCVU651`, MEDCOST, `_RFHYPE5`, `_RFCHOL`, CVDSTRK3, DIABETE3, SEX,
`_EDUCAG`, INCOME2, `_BMI5`, `_SMOKER3`, AVEDRNK2, `_TOTINDA`, DRADVISE,
RDUCHART, SCNTMEL1, `_AGE65YR`, `_FRTLT1`, `_VEGLT1`) %>%
rename(Gen_Health = `_RFHLTH`, Phys_Health = PHYSHLTH, Ment_Health = MENTHLTH, Med_Cost = MEDCOST,
Stroke = CVDSTRK3, Diabetes = DIABETE3, Sex = SEX, Income = INCOME2, Avg_Drinks = AVEDRNK2,
Sodium_Intake = DRADVISE, Aspirin = RDUCHART, Meal_Money = SCNTMEL1, Health_Cov = `_HCVU651`,
BP = `_RFHYPE5`, Chol = `_RFCHOL`, Heart_Disease = `_MICHD`, Age_65 = `_AGE65YR`,
BMI = `_BMI5`, Education = `_EDUCAG`, Smoking = `_SMOKER3`, Fruit = `_FRTLT1`,
Veggies = `_VEGLT1`, Exercise = `_TOTINDA`, Depression = ADDEPEV2, Support = EMTSUPRT,
Down = ADDOWN) %>%
mutate(Heart_Disease = recode(Heart_Disease, '2' = 0),
Depression = recode(Depression, '2' = 0, '7' = as.numeric(NA), '9' = as.numeric(NA)),
Ment_Health = recode(Ment_Health, '88' = 0, '77' = as.numeric(NA), '99' = as.numeric(NA)),
Support = recode(Support, '5' = 1, '4' = 2, '2' = 4, '1' = 5, '7' = as.numeric(NA),
'9' = as.numeric(NA)),
Down = recode(Down, '88' = 0, '77' = as.numeric(NA), '99' = as.numeric(NA)),
Gen_Health = recode(Gen_Health, '2' = 0, '3' = as.numeric(NA)),
Phys_Health = recode(Phys_Health, '88' = 0, '77' = as.numeric(NA), '99' = as.numeric(NA)),
Health_Cov = recode(Health_Cov, '2' = 0, '9' = as.numeric(NA)),
Med_Cost = recode(Med_Cost, '2' = 0, '7' = as.numeric(NA), '9' = as.numeric(NA)),
BP = recode(BP, '1' = 0, '2' = 1, '9' = as.numeric(NA)),
Chol = recode(Chol, '1' = 0, '2' = 1, '9' = as.numeric(NA)),
Stroke = recode(Stroke, '1' = 0, '2' = 1, '7' = as.numeric(NA), '9' = as.numeric(NA)),
Diabetes = recode(Diabetes, '2' = 1, '3' = 1, '4' = 2, '1' = 3, '7' = as.numeric(NA),
'9' = as.numeric(NA)),
Education = recode(Education, '9' = as.numeric(NA)),
Income = recode(Income, '77' = as.numeric(NA), '99' = as.numeric(NA)),
BMI = BMI/100,
Smoking = recode(Smoking, '4' = 1, '3' = 2, '2' = 3, '1' = 4, '9' = as.numeric(NA)),
Avg_Drinks = recode(Avg_Drinks, '77' = as.numeric(NA), '99' = as.numeric(NA)),
Exercise = recode(Exercise, '2' = 0, '9' = as.numeric(NA)),
Sodium_Intake = recode(Sodium_Intake, '2' = 0, '7' = as.numeric(NA), '9' = as.numeric(NA)),
Aspirin = recode(Aspirin, '2' = 0, '7' = as.numeric(NA), '9' = as.numeric(NA)),
Meal_Money = recode(Meal_Money, '5' = 1, '4' = 2, '2' = 4, '1' = 5, '7' = as.numeric(NA),
'8' = as.numeric(NA), '9' = as.numeric(NA)),
Age_65 = recode(Age_65, '3' = as.numeric(NA)),
Fruit = recode(Fruit, '2' = 0, '9' = as.numeric(NA)),
Veggies = recode(Veggies, '2' = 0, '9' = as.numeric(NA)))
glimpse(survey1)
```
# Variable descriptions
Heart_Disease: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)
0 - did not report
1 - reported
NA - not asked or missing
Depression: (Ever told) you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression?
0 - no
1 - yes
NA - don't know, refused
Ment_Health: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
0:30 - # of days
NA - don't know, refused
Support: how often do you get the social and emotional support you need?
1 - never
2 - rarely
3 - sometimes
4 - usually
5 - always
NA - don't know, refused, missing
Down: Over the last 2 weeks, how many days have you felt down, depressed or hopeless?
0:14 - # of days
NA - don't know, refused, missing
Gen_Health: Adults with good or better general health
0 - fair or poor health
1 - good or better health
NA - don't know, not sure or refused/missing
Phys_Health: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
0:30 - # days
NA - don't know, refused, missing
Health_Cov: Respondents aged 18-64 who have any form of health care coverage
0 - do not have
1 - have
NA - don't know, refused, missing
Med_Cost: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
0 - no
1 - yes
NA - don't know, refused, missing
BP: Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional
0 - no
1 - yes
NA - don't know, refused, missing
Chol: Adults who have had their cholesterol checked and have been told by a doctor, nurse, or other health professional that it was high
0 - no
1 - yes
NA - don't know, refused, missing
Stroke: (Ever told) you had a stroke.
0 - no
1 - yes
NA - don't know, refused
Diabetes: (Ever told) you have diabetes (If "Yes" and respondent is female, ask "Was this only when you were pregnant?". If Respondent says pre-diabetes or borderline diabetes, use response code 4.)
1 - no or yes, but female told only during pregnancy
2 - no, pre-diabetes or borderline
3 - yes
NA - don't know, refused, missing
Sex: Indicate sex of respondent.
1 - male
2 - female
Education: Level of education completed
1 - did not graduate highschool
2 - graduated highschool
3 - attended college or technical school
4 - graduated from college or technical school
NA - don't know, refused, missing
Income: Is your annual household income from all sources: (If respondent refuses at any income level, code "Refused.")
1 - <$10,000
2 - $10,000:$14,999
3 - $15,000:$19,999
4 - $20,000:$24,999
5 - $25,000:$34,999
6 - $35,000:$49,999
7 - $50,000:$74,999
8 - >=$75,000
NA - don't know, refused, missing
BMI: Body Mass Index (BMI)
1:99.99 - bmi
NA - don't know, refused, missing
Smoking: Four-level smoker status: Everyday smoker, Someday smoker, Former smoker, Non-smoker
1 - never smoked
2 - former smoker
3 - now smokes some days
4 - now smokes every day
NA - don't know, refused, missing
Avg_Drinks: One drink is equivalent to a 12-ounce beer, a 5-ounce glass of wine, or a drink with one shot of liquor. During the past 30 days, on the days when you drank, about how many drinks did you drink on the average? (A 40 ounce beer would count as 3 drinks, or a cocktail drink with 2 shots would count as 2 drinks.)
1:76 - # of drinks
NA - don't know, refused, missing
Exercise: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
0 - did not have physical activity or exercise
1 - had physical activity or exercise
NA - don't know, refused, missing
Sodium_Intake: Has a doctor or other health professional ever advised you to reduce sodium or salt intake?
0 - no
1 - yes
NA - don't know, refused, missing
Aspirin: Do you take aspirin to reduce the chance of a heart attack?
0 - no
1 - yes
NA - don't know, refused, missing
Meal_Money: How often in the past 12 months would you say you were worried or stressed about having enough money to buy nutritious meals?
1 - never
2 - rarely
3 - sometimes
4 - usually
5 - always
NA - don't know, not applicable, refused, missing
Age_65: Two-level age category
1 - age 18:64
2 - age 65 or older
NA - don't know, refused, missing
Fruit: Consume Fruit 1 or more times per day
0 - less than one time per day
1 - one or more times per day
NA - don't know, refused, missing
Veggies: Consume Vegetables 1 or more times per day
0 - less than one time per day
1 - one or more times per day
NA - don't know, refused, missing
# Exploratory data analysis
```{r}
colSums(is.na(survey1))/length(survey1$Heart_Disease)
# Many features have a large amount of missing values
survey1 <- survey1 %>% select(-c(Support, Down, Sodium_Intake, Aspirin, Meal_Money))
```
```{r}
# creating correlation matrix
corr_mat1 <- round(cor(survey1, use = "pairwise.complete.obs"),2)
# reduce the size of correlation matrix
melted_corr_mat1 <- melt(corr_mat1)
head(melted_corr_mat1)
# plotting the correlation heatmap
library(ggplot2)
covarian<- ggplot(data = melted_corr_mat1, aes(x=Var1, y=Var2,
fill=value)) +
geom_tile() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
covarian
ggheatmap <- ggplot(melted_corr_mat1, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+ # minimal theme
theme(axis.text.x = element_text(angle = 90, vjust = 1,
size = 9, hjust = 1))+
coord_fixed() + ggtitle("Figure 1: Heatmap of Covariance Matrix") + xlab("Variables") + ylab("Variables")
ggheatmap
```
Explore dataset
```{r}
head(survey1,10)
```
```{r}
row.names(survey1)
```
```{r}
names(survey1)
```
```{r}
apply(survey1, 2, mean, na.rm = TRUE)
```
Note: rows = 1, columns = 2
```{r}
apply(survey1, 2, var, na.rm = TRUE)
```
2. It is important to standardize the variables to have mean zero and standard deviation one before performing PCA.
```{r}
pr.out=prcomp(survey1, scale=TRUE, na.action = na.fail)
```
prcomp() function centers the variables to have mean zero. By using the option scale=TRUE, we scale the variables to have standard deviation one.
```{r}
names(pr.out)
```
```{r}
pr.out$center
```
```{r}
pr.out$scale^2
```
3. The rotation matrix provides the principal component loadings; each col- umn of pr.out$rotation contains the corresponding principal component loading vector.
```{r}
pr.out$rotation
```
Why are there 4 principal components?
4. The 50 × 4 matrix x has as its columns the principal component score vectors. That is, the kth column is the kth principal component score vector.
```{r}
dim(pr.out$x)
```
```{r}
pr.out$x
```
4. Plot the first 2 principal components.
```{r}
biplot(pr.out, scale=0)
```
score 0 is to ensure that the arrows are scaled to represent the loadings.
5. The prcomp() function also outputs the standard deviation of each prin- cipal component. For instance, on the survey1 data set, we can access these standard deviations as follows:
```{r}
pr.var=pr.out$sdev^2
pr.var
```
Proportion of variance explained.
```{r}
pve=pr.var/sum(pr.var)
```
# Plots
```{r}
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained ", ylim=c(0,1),type='b')
```
```{r}
plot(cumsum(pve), xlab="Principal Component ", ylab=" Cumulative Proportion of Variance Explained ", ylim=c(0,1), type='b')
```
```