-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathmodels-are-data.Rmd
331 lines (243 loc) · 10.5 KB
/
models-are-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
---
title: 'Models are data too'
---
```{r, include=FALSE}
library(tidyverse)
library(lmerTest)
library(apastats)
library(broom)
library(pander)
```
# Models are data {#models-are-data-too}
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/WPc-VEqBPHI?rel=0&showinfo=0" frameborder="0" allowfullscreen></iframe>
You might remember the episode of the Simpsons where Homer designs a car for
'the average man'. It doesn't end well. Traditional statistics packages are a
bit like Homer's car. They try to work for everyone, but in the process become
bloated and difficult to use.
This is particularly true of the _output_ of software like SPSS, which by
default produces multiple pages of 'results' for even relatively simple
statistical models. However, the problem is not just that SPSS is incredibly
verbose.
The real issue is that SPSS views the results of a model as the _end_ of a
process, rather than the beginning. The model SPSS has is something like:
1. Collect data
2. Choose analysis from GUI
3. Select relevant figures from pages of output and publish.
This is a problem because in real life it just doesn't work that way. In reality
you will want to do things like:
- Run the same model for different outcomes
- Re-run similar models as part of a sensitivity analysis
- Compare different models and produce summaries of results from multiple
models
All of this requires an _iterative process_, in which you may want to compare
and visualise the results of multiple models. In a traditional GUI, this quickly
becomes overwhelming.
However, if we treat modelling as a process which _both consumes and produces
data_, R provides many helpful tools.
This is an important insight: in R, the results of analyses are not the end
point — instead _model results are themselves data_, to be processed,
visualised, and compared.
### Storing models in variables {-}
This may seem obvious (and we have seen many examples in the sections above),
but because R variables can contain anything, we can use them to store the
results of our models.
This is important, because it means we can keep track of different versions of
the models we run, and compare them.
### Extracting results from models {- #extract-results-from-models}
One of the nice things about R is that the `summary()` function will almost
always provide a concise output of whatever model you send it, showing the key
features of an model you have run.
However, this text output isn't suitable for publication, and can even be too
verbose for communicating with colleagues. Often, when communicating with
others, you want to focus in on the important details from analyses and to do
this you need to extract results from your models.
Thankfully, there is almost always a method to extract results to a
[`dataframe`](#datasets-dataframes). For example, if we run a linear model:
```{r}
model.fit <- lm(mpg ~ wt + disp, data=mtcars)
summary(model.fit)
```
We can extract the parameter table from this model by saving the `summary()` of
it, and then using the `$` operator to access the `coefficients` table (actually
a matrix), which is stored within the summary object.
```{r}
model.fit.summary <- summary(model.fit)
model.fit.summary$coefficients
```
### 'Poking around' with `$` and `@` {-}
It's a useful trick to learn how to 'poke around' inside R objects using the `$`
and `@` operators (if you want the gory details
[see this guide](http://adv-r.had.co.nz/OO-essentials.html)).
In the video below, I use RStudio's autocomplete feature to find results buried
within a `lm` object:
<iframe src="https://player.vimeo.com/video/225529842" width="862" height="892" frameborder="0"></iframe>
For example, we could write the follwing to extract a table of coefficients,
test statistics and _p_ values from an `lm()` object (this is shown in the
video:
```{r}
model.fit.summary <- summary(model.fit)
model.fit.summary$coefficients %>%
as_data_frame()
```
### Save time: use a `broom` {- #broom}
The [`broom::` library](http://varianceexplained.org/r/broom-intro/) is worth
learning because it makes it really easy to turn model results into dataframes,
which is almost always what we want when working with data.
It takes a slightly different approach than simply poking around with \$ and @,
because it providing general methods to 'clean up' the output of many older R
functions.
For example, the `lm()` or `car::Anova` functions display results in the
console, but don't make it easy to extract results as a dataframe. `broom::`
provides a consistent way of extracting the key numbers from most R objects.
Let's say we have a regression model:
```{r}
(model.1 <- lm(mpg ~ factor(cyl) + wt + disp, data=mtcars))
```
We can extract model fit statistics --- that is, attributes of the model as a
whole --- with `glance()`. This produces a dataframe:
```{r}
glance(model.1)
```
If we want to extract information about the model coefficients we can use
`tidy`:
```{r}
tidy(model.1, conf.int = T) %>%
pander
```
Which can then be plotted easily (adding the `conf.int=T` includes 95%
confidence intervals for each parameter, which we can pass to `ggplot`):
```{r}
tidy(model.1, conf.int = T) %>%
ggplot(aes(term, estimate, ymin=conf.low, ymax=conf.high)) +
geom_pointrange() +
geom_hline(yintercept = 0)
```
Finally, we can use the `augment` function to get information on individual rows
in the modelled data: namely the fitted and residual values, plus common
diagnostic metrics like Cooks distances:
```{r}
augment(model.1) %>%
head() %>%
pander(split.tables=Inf)
```
Again these can be plotted:
```{r}
augment(model.1) %>%
ggplot(aes(x=.fitted, y=.resid)) +
geom_point() +
geom_smooth()
```
Because `broom` always returns a dataframe with a consistent set of column names
we can also combine model results into tables for comparison. In this plot we
see what happens to the regression coefficients in model 1 when we add `disp`,
`carb` and `drat` in model 2. We plot the coefficients side by side for ease of
comparison, and can see that the estimates for cyl1 and wt both shrink slightly
with the addition of these variables:
```{r}
# run a new model with more predictors
(model.2 <- lm(mpg ~ factor(cyl) + wt + disp + carb + drat, data=mtcars))
# make a single dataframe from both models
# addin a new `model` column with mutate to
# identify which coefficient came from which model
combined.results <- bind_rows(
tidy(model.1, conf.int = T) %>% mutate(model="1"),
tidy(model.2, conf.int = T) %>% mutate(model="2"))
```
```{r}
combined.results %>%
# remove the intercept to make plot scale more sane
filter(term != "(Intercept)") %>%
ggplot(aes(term, estimate, ymin=conf.low, ymax=conf.high, color=model)) +
geom_pointrange(position=position_dodge(width=.1)) +
geom_hline(yintercept = 0)
```
## 'Processing' results {- #process-model-results}
XXX TODO e.g.:
- Calculate VPC/ICC from an lmer models using
`model %>% summary %>% as_data_frame()$varcor`
## Printing tables {- #output-tables}
XXX TODO
- Pander and pandoc
- Dealing with rounding and string formatting issues
- Missing values/unequal length columns
- Point out that arbitrarily complex tables often not worth the candle, longer
easier than wider etc.
## APA formatting for free {- #apa-output}
A neat trick to avoid
[fat finger errors](https://en.wikipedia.org/wiki/Fat-finger_error) is to use
functions to automatically display results in APA format. Unfortunately, there
isn't a single package which works with all types of model, but it's not too
hard switch between them.
### Chi^2^ {-}
For basic stats the `apa::` package is simple to use. Below we use the
`apa::chisq_apa()` function to properly format the results of our chi^2^ test
([see the full chi^2^ example]#crosstabs)):
```{r, include=F}
lego.duplo.df <- readRDS("data/lego.RDS")
lego.table <- with(lego.duplo.df, table(age, prefers))
```
```{r}
lego.test <- chisq.test(lego.table)
lego.test
```
And we can format in APA like so:
```{r}
apa::apa(lego.test, print_n=T)
```
or using `apastats::` we also get Cramer's V, a measure of effect size:
```{r}
apastats::describe.chi(lego.table, addN=T)
```
#### Inserting results into your text {#inline-apa-format}
If you are using RMarkdown, you can drop formatted results into your text
without copying and pasting. Just type the following and the chi^2^ test result
is automatically inserted inline in your text:
![Example of inline call to R functions within the text. This is shown as an image, because it would otherwise be hidden in this output (because the function is evaluated when we knit the document)](media/inline-r-example.png)
[Age (4 vs 6 years) was significantly associated with preference for duplo v.s.
lego, `r apastats::describe.chi(lego.table, addN=T)`]{.apa-example}
### T-test {-}
```{r}
# run the t test
cars.test <- t.test(wt~am,data=mtcars, var.equal=T)
cars.test
```
And then we can format as APA
```{r}
apa::apa(cars.test)
```
[American cars were significantly heavier than foreign cars, mean
difference=`r round((cars.test$estimate[1]-cars.test$estimate[2])*1000)`lbs;
`r apa::apa(cars.test)`]{.apa-example}
### Anova {-}
```{r}
mpg.anova <- car::Anova(lm(mpg~am*cyl, data=mtcars))
# extract and format main effect
apastats::describe.Anova(mpg.anova, term="am")
# and the interaction
apastats::describe.Anova(mpg.anova, term="am:cyl")
```
[There was no interaction between location of manufacture and number of
cylinders, `r apastats::describe.Anova(mpg.anova, term="am:cyl")`, but there was
a main effect of location of manufacture,
`r apastats::describe.Anova(mpg.anova, term="am:cyl")`, such that US-made cars
had significantly higher fuel consumption than European or Japanese brands (see
[Figure X or Table X])]{.apa-example}
<!--
TODO add formatting of effect size estimates here
-->
### Multilevel models {-}
If you have loaded the `lmerTest` package `apastats` can output either
coefficients for single parameters, or F tests:
```{r}
sleep.model <- lmer(Reaction~factor(Days)+(1|Subject), data=lme4::sleepstudy)
#a single coefficient (this is a contrast from the reference category)
apastats::describe.glm(sleep.model, term="factor(Days)1")
# or describe the F test for the overall effect of Days
apastats::describe.lmtaov(anova(sleep.model), term='factor(Days)')
```
```{r, echo=F, include=F}
rtchanges <- apastats::describe.lmtaov(anova(sleep.model), term='factor(Days)')
```
[There were significant differences in reaction times across the 10 days of the
study, `r rtchanges` such that reaction latencies tended to increase in duration
(see [Figure X]).]{.apa-example}