% Basic Statistical Concepts
% R Bootcamp HTML Slides
% Jared Knowles

Introduction

We will review the following statistical concepts here through the lens of R:

Describing Data

- Data are themselves abstractions of real world concepts we care about

Spread of the Data

qplot(factor(cyl), data = mtcars) + labs(x = "cylinder", title = "Car models by Cylinder Count")

plot of chunk barplot

Let's try another

data(diamonds)
qplot(factor(cut), data = diamonds) + labs(x = "Cut", title = "Diamonds by Cut Quality")

plot of chunk diamondplot

How about?

qplot(factor(color), data = diamonds) + labs(x = "Color", title = "Diamonds by Color and Clarity") + 
    facet_wrap(~clarity, nrow = 2)

plot of chunk diamondplot2

Still more

qplot(carat, price, data = diamonds, color = color) + geom_smooth(aes(group = 1))

plot of chunk diamondplot3

Graphical Depictions of Data

Levels of Measurement

Quiz 1

- Car color is what level of measurement?

Levels of measurement matter

Describing Data with Numbers

Level of Meas. Stats
Nominal mode, Chi-squared
Ordinal median, percentile, (plus above)
Interval mean, standard deviation, correlation, ANOVA, plus above
Continuous geometric mean, harmonic mean, coefficient of variation, logarithms, plus above

Let's talk about these statistics

Measures of Central Tendency

qplot(hwy, data = mpg, geom = "density") + geom_vline(xintercept = median(mpg$hwy), 
    color = I("blue"), size = I(1.1)) + geom_vline(xintercept = mean(mpg$hwy), 
    color = I("gold"), size = I(1.1)) + geom_vline(xintercept = 26, color = I("orange"), 
    size = I(1.1)) + geom_text(aes(x = median(mpg$hwy) + 1.5, y = 0.08, label = "Median"), 
    size = I(4.5)) + geom_text(aes(x = mean(mpg$hwy) - 1.5, y = 0.06, label = "Mean"), 
    size = I(4.5)) + geom_text(aes(x = 26 + 1.5, y = 0.05, label = "Mode"), 
    size = I(4.5))

plot of chunk centraltend

library(xtable)
print(xtable(table(mpg$hwy)), type = "html")
## <!-- html table generated in R 2.15.1 by xtable 1.7-0 package -->
## <!-- Wed Sep 26 16:57:47 2012 -->
## <TABLE border=1>
## <TR> <TH>  </TH> <TH> V1 </TH>  </TR>
##   <TR> <TD align="right"> 12 </TD> <TD align="right">   5 </TD> </TR>
##   <TR> <TD align="right"> 14 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> 15 </TD> <TD align="right">  10 </TD> </TR>
##   <TR> <TD align="right"> 16 </TD> <TD align="right">   7 </TD> </TR>
##   <TR> <TD align="right"> 17 </TD> <TD align="right">  31 </TD> </TR>
##   <TR> <TD align="right"> 18 </TD> <TD align="right">  10 </TD> </TR>
##   <TR> <TD align="right"> 19 </TD> <TD align="right">  13 </TD> </TR>
##   <TR> <TD align="right"> 20 </TD> <TD align="right">  11 </TD> </TR>
##   <TR> <TD align="right"> 21 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> 22 </TD> <TD align="right">   7 </TD> </TR>
##   <TR> <TD align="right"> 23 </TD> <TD align="right">   7 </TD> </TR>
##   <TR> <TD align="right"> 24 </TD> <TD align="right">  13 </TD> </TR>
##   <TR> <TD align="right"> 25 </TD> <TD align="right">  15 </TD> </TR>
##   <TR> <TD align="right"> 26 </TD> <TD align="right">  32 </TD> </TR>
##   <TR> <TD align="right"> 27 </TD> <TD align="right">  14 </TD> </TR>
##   <TR> <TD align="right"> 28 </TD> <TD align="right">   7 </TD> </TR>
##   <TR> <TD align="right"> 29 </TD> <TD align="right">  22 </TD> </TR>
##   <TR> <TD align="right"> 30 </TD> <TD align="right">   4 </TD> </TR>
##   <TR> <TD align="right"> 31 </TD> <TD align="right">   7 </TD> </TR>
##   <TR> <TD align="right"> 32 </TD> <TD align="right">   4 </TD> </TR>
##   <TR> <TD align="right"> 33 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> 34 </TD> <TD align="right">   1 </TD> </TR>
##   <TR> <TD align="right"> 35 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> 36 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> 37 </TD> <TD align="right">   1 </TD> </TR>
##   <TR> <TD align="right"> 41 </TD> <TD align="right">   1 </TD> </TR>
##   <TR> <TD align="right"> 44 </TD> <TD align="right">   2 </TD> </TR>
##    </TABLE>

Session Info

It is good to include the session info, e.g. this document is produced with knitr version 0.8. Here is my session info:

print(sessionInfo(), locale = FALSE)
## R version 2.15.1 (2012-06-22)
## Platform: x86_64-pc-mingw32/x64 (64-bit)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] xtable_1.7-0    mgcv_1.7-21     ggplot2_0.9.2.1 knitr_0.8      
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2      
##  [4] evaluate_0.4.2     formatR_0.6        grid_2.15.1       
##  [7] gtable_0.1.1       labeling_0.1       lattice_0.20-10   
## [10] MASS_7.3-21        Matrix_1.0-9       memoise_0.1       
## [13] munsell_0.4        nlme_3.1-104       plyr_1.7.1        
## [16] proto_0.3-9.2      RColorBrewer_1.0-5 reshape2_1.2.1    
## [19] scales_0.2.2       stringr_0.6.1      tools_2.15.1

Attribution and License

Public Domain Mark
This work (R Tutorial for Education, by Jared E. Knowles), in service of the Wisconsin Department of Public Instruction, is free of known copyright restrictions.