% Basic Statistical Concepts
% R Bootcamp HTML Slides
% Jared Knowles

# Introduction

We will review the following statistical concepts here through the lens of R:

• Statistical distributions
• Attributes of distributions
• Samples
• Standard deviations and standard errors
• Statistical models
• Univariate Regression
• Multivariate Regression
• Causality

# Describing Data

• At it's most basic level, statistics is about summarizing and understanding data

# Spread of the Data

• Data spread describes how scattered a data set is
• One type of data, categorical data, describes groups
qplot(factor(cyl), data = mtcars) + labs(x = "cylinder", title = "Car models by Cylinder Count")


• What can we learn here?

# Let's try another

• What can we learn from this chart type about the data?
data(diamonds)
qplot(factor(cut), data = diamonds) + labs(x = "Cut", title = "Diamonds by Cut Quality")


• What else might we want to learn?

qplot(factor(color), data = diamonds) + labs(x = "Color", title = "Diamonds by Color and Clarity") +
facet_wrap(~clarity, nrow = 2)


# Still more

• With diamonds we immediately want to look at price
qplot(carat, price, data = diamonds, color = color) + geom_smooth(aes(group = 1))


• What do you see?
• Outliers?
• Data modes?
• Clusters?

# Graphical Depictions of Data

• These are ways to show data with graphics
• Graphical displays are driven by the concept of dimensions
• One dimension–a single category
• Two dimensions–two categories

# Levels of Measurement

• Any given dimension may be measured at different levels of measure
• Nominal: unordered categories of data
• Ordinal: ordered categories of data, relative size and degree of difference between categories is unknown
• Interval: ordered categories of data, fixed width, like discrete temperature scales
• Continuous (ratio): a measurement scale in a continuous space with a meaningful zero–physical measurements
• Derived by Stanley Smith Stevens in the 1940s and 50s

# Levels of measurement matter

• How you depict the data
• What you can calculate using the data

# Describing Data with Numbers

• What types of measures can we use to describe different levels of measurement?
Level of Meas. Stats
Nominal mode, Chi-squared
Ordinal median, percentile, (plus above)
Interval mean, standard deviation, correlation, ANOVA, plus above
Continuous geometric mean, harmonic mean, coefficient of variation, logarithms, plus above

# Let's talk about these statistics

• STATISTIC: a single measure of some attribute of a sample (e.g. its arithmetic mean value). It is calculated by applying a function (statistical algorithm) to the values of the items comprising the sample which are known together as a set of data. (Wikipedia)[http://en.wikipedia.org/wiki/Statistic]
• These statistics can measure a number of features of a dataset, but we tend to think of them as measuring either central tendency, spread, or association
• We'll focus on these today.

# Measures of Central Tendency

• These are the three canonical measures of central tendency:
• Mean
• Median
• Mode
• How are these different? What properties do they have? Why does this matter?
qplot(hwy, data = mpg, geom = "density") + geom_vline(xintercept = median(mpg$hwy), color = I("blue"), size = I(1.1)) + geom_vline(xintercept = mean(mpg$hwy),
color = I("gold"), size = I(1.1)) + geom_vline(xintercept = 26, color = I("orange"),
size = I(1.1)) + geom_text(aes(x = median(mpg$hwy) + 1.5, y = 0.08, label = "Median"), size = I(4.5)) + geom_text(aes(x = mean(mpg$hwy) - 1.5, y = 0.06, label = "Mean"),
size = I(4.5)) + geom_text(aes(x = 26 + 1.5, y = 0.05, label = "Mode"),
size = I(4.5))


library(xtable)
print(xtable(table(mpg\$hwy)), type = "html")

## <!-- html table generated in R 2.15.1 by xtable 1.7-0 package -->
## <!-- Wed Sep 26 16:57:47 2012 -->
## <TABLE border=1>
## <TR> <TH>  </TH> <TH> V1 </TH>  </TR>
##   <TR> <TD align="right"> 12 </TD> <TD align="right">   5 </TD> </TR>
##   <TR> <TD align="right"> 14 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> 15 </TD> <TD align="right">  10 </TD> </TR>
##   <TR> <TD align="right"> 16 </TD> <TD align="right">   7 </TD> </TR>
##   <TR> <TD align="right"> 17 </TD> <TD align="right">  31 </TD> </TR>
##   <TR> <TD align="right"> 18 </TD> <TD align="right">  10 </TD> </TR>
##   <TR> <TD align="right"> 19 </TD> <TD align="right">  13 </TD> </TR>
##   <TR> <TD align="right"> 20 </TD> <TD align="right">  11 </TD> </TR>
##   <TR> <TD align="right"> 21 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> 22 </TD> <TD align="right">   7 </TD> </TR>
##   <TR> <TD align="right"> 23 </TD> <TD align="right">   7 </TD> </TR>
##   <TR> <TD align="right"> 24 </TD> <TD align="right">  13 </TD> </TR>
##   <TR> <TD align="right"> 25 </TD> <TD align="right">  15 </TD> </TR>
##   <TR> <TD align="right"> 26 </TD> <TD align="right">  32 </TD> </TR>
##   <TR> <TD align="right"> 27 </TD> <TD align="right">  14 </TD> </TR>
##   <TR> <TD align="right"> 28 </TD> <TD align="right">   7 </TD> </TR>
##   <TR> <TD align="right"> 29 </TD> <TD align="right">  22 </TD> </TR>
##   <TR> <TD align="right"> 30 </TD> <TD align="right">   4 </TD> </TR>
##   <TR> <TD align="right"> 31 </TD> <TD align="right">   7 </TD> </TR>
##   <TR> <TD align="right"> 32 </TD> <TD align="right">   4 </TD> </TR>
##   <TR> <TD align="right"> 33 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> 34 </TD> <TD align="right">   1 </TD> </TR>
##   <TR> <TD align="right"> 35 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> 36 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> 37 </TD> <TD align="right">   1 </TD> </TR>
##   <TR> <TD align="right"> 41 </TD> <TD align="right">   1 </TD> </TR>
##   <TR> <TD align="right"> 44 </TD> <TD align="right">   2 </TD> </TR>
##    </TABLE>


# Session Info

It is good to include the session info, e.g. this document is produced with knitr version 0.8. Here is my session info:

print(sessionInfo(), locale = FALSE)

## R version 2.15.1 (2012-06-22)
## Platform: x86_64-pc-mingw32/x64 (64-bit)
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] xtable_1.7-0    mgcv_1.7-21     ggplot2_0.9.2.1 knitr_0.8
##
## loaded via a namespace (and not attached):
##  [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2
##  [4] evaluate_0.4.2     formatR_0.6        grid_2.15.1
##  [7] gtable_0.1.1       labeling_0.1       lattice_0.20-10
## [10] MASS_7.3-21        Matrix_1.0-9       memoise_0.1
## [13] munsell_0.4        nlme_3.1-104       plyr_1.7.1
## [16] proto_0.3-9.2      RColorBrewer_1.0-5 reshape2_1.2.1
## [19] scales_0.2.2       stringr_0.6.1      tools_2.15.1