# Introduction

We will review the following statistical concepts here through the lens of R:

• Statistical distributions
• Attributes of distributions
• Samples
• Standard deviations and standard errors
• Statistical models
• Univariate Regression
• Multivariate Regression
• Causality

# Describing Data

• At it's most basic level, statistics is about summarizing and understanding data

# Spread of the Data

• Data spread describes how scattered a data set is
• One type of data, categorical data, describes groups
qplot(factor(cyl), data = mtcars) + labs(x = "cylinder", title = "Car models by Cylinder Count")


• What can we learn here?

# Let's try another

• What can we learn from this chart type about the data?
data(diamonds)
qplot(factor(cut), data = diamonds) + labs(x = "Cut", title = "Diamonds by Cut Quality")


• What else might we want to learn?

qplot(factor(color), data = diamonds) + labs(x = "Color", title = "Diamonds by Color and Clarity") +
facet_wrap(~clarity, nrow = 2)


# Still more

• With diamonds we immediately want to look at price
qplot(carat, price, data = diamonds, color = color) + geom_smooth(aes(group = 1))


• What do you see?
• Outliers?
• Data modes?
• Clusters?

# Graphical Depictions of Data

• These are ways to show data with graphics
• Graphical displays are driven by the concept of dimensions
• One dimension–a single category
• Two dimensions–two categories

# Levels of Measurement

• Any given dimension may be measured at different levels of measure
• Nominal: unordered categories of data
• Ordinal: ordered categories of data, relative size and degree of difference between categories is unknown
• Interval: ordered categories of data, fixed width, like discrete temperature scales
• Continuous (ratio): a measurement scale in a continuous space with a meaningful zero–physical measurements
• Derived by Stanley Smith Stevens in the 1940s and 50s

# Levels of measurement matter

• How you depict the data
• What you can calculate using the data

# Describing Data with Numbers

• What types of measures can we use to describe different levels of measurement?
Level of Meas. Stats
Nominal mode, Chi-squared
Ordinal median, percentile, (plus above)
Interval mean, standard deviation, correlation, ANOVA, plus above
Continuous geometric mean, harmonic mean, coefficient of variation, logarithms, plus above

# Let's talk about these statistics

• STATISTIC: a single measure of some attribute of a sample (e.g. its arithmetic mean value). It is calculated by applying a function (statistical algorithm) to the values of the items comprising the sample which are known together as a set of data. (Wikipedia)[http://en.wikipedia.org/wiki/Statistic]
• These statistics can measure a number of features of a dataset, but we tend to think of them as measuring either central tendency, spread, or association
• We'll focus on these today.

# Measures of Central Tendency

• These are the three canonical measures of central tendency:
• Mean
• Median
• Mode
• How are these different? What properties do they have? Why does this matter?
qplot(hwy, data = mpg, geom = "density") + geom_vline(xintercept = median(mpg$hwy), color = I("blue"), size = I(1.1)) + geom_vline(xintercept = mean(mpg$hwy),
color = I("gold"), size = I(1.1)) + geom_vline(xintercept = 26, color = I("orange"),
size = I(1.1)) + geom_text(aes(x = median(mpg$hwy) + 1.5, y = 0.08, label = "Median"), size = I(4.5)) + geom_text(aes(x = mean(mpg$hwy) - 1.5, y = 0.06, label = "Mean"),
size = I(4.5)) + geom_text(aes(x = 26 + 1.5, y = 0.05, label = "Mode"),
size = I(4.5))


