Jared Knowles

# Overview

In this lesson we hope to learn:

• Aggregating data
• Organizing our data
• Manipulating vectors
• Dealing with missing data

# Again, read in our dataset

``````# Set working directory to the tutorial directory In RStudio can do
# this in 'Tools' tab
setwd("~/GitHub/r_tutorial_ed")
# Note if we don't assign data to 'df' R just prints contents of
# table``````

# Aggregation

• Sometimes we need to do some basic checking for the number of observations or types of observations in our dataset
• To do this quickly and easily - the `table` function is our friend
• Let's look at our observations by year and grade
``table(df\$grade, df\$year)``
``````
2000 2001 2002
3  200  100  200
4  100  200  100
5  200  100  200
6  100  200  100
7  200  100  200
8  100  200  100``````
• The first command gives the rows, the second gives the columns
• Ugly, but effective

# Aggregation can be more complex

• Let's aggregate by race and year
``table(df\$year, df\$race)``
``````
A   B   H   I   W
2000  16 370  93   7 414
2001  16 370  93   7 414
2002  16 370  93   7 414``````
• Race is consistent across years, interesting
• What if we want to only look at 3rd graders that year?

# More complicated still

``````with(df[df\$grade == 3, ], {
table(year, race)
})``````
``````      race
year     A   B   H   I   W
2000   4  78  22   4  92
2001   1  44   8   2  45
2002   0  74  20   1 105``````
• `with` specifies a data object to work on, in this case all elements of `df` where `grade==3`
• `table` is the same command as above, but since we specified the data object in the `with` statement, we don't need the `df\$` in front of the variables of interest
``````df2 <- subset(df, grade == 3)
table(df2\$year, df2\$race)``````
``````
A   B   H   I   W
2000   4  78  22   4  92
2001   1  44   8   2  45
2002   0  74  20   1 105``````
``rm(df2)``

# Quick exercise

• Can you find the number of black students in each grade in each year?
• hint: `with(df[df\$___==___,]...)`
• How many in year 2002, grade 6?
• 48
• How many in 2001, grade 7?
• 39

``with(df[df\$race=="B",],{table(year,grade)})``
``````      grade
year    3  4  5  6  7  8
2000 78 48 87 39 74 44
2001 44 78 48 87 39 74
2002 74 44 78 48 87 39``````
• Quick question, how can we understand the three types of closures we have in this function: () [] and {}

# Tables cont.

• This is really powerful for looking at the descriptive dimensions of the data, we can ask questions like:
• How many students are at each proficiency level each year?
``table(df\$year, df\$proflvl)``
``````
2000       56   313         143        388
2001      229   183          64        424
2002      503    27           3        367``````
• How many students are at each proficiency level by race?
``table(df\$race, df\$proflvl)``
``````
A       19     7           3         19
B      160   302         162        486
H       54    76          33        116
I        7     4           1          9
W      548   134          11        549``````

# Proportional Tables

• What if we aren't interested in counts?
• R makes it really easy to calculate proportions
``prop.table(table(df\$race, df\$proflvl))``
``````
A 0.0070370 0.0025926   0.0011111  0.0070370
B 0.0592593 0.1118519   0.0600000  0.1800000
H 0.0200000 0.0281481   0.0122222  0.0429630
I 0.0025926 0.0014815   0.0003704  0.0033333
W 0.2029630 0.0496296   0.0040741  0.2033333``````
• Hmmm, this is goofy. This tells us the proportion of each cell out of the total. Also, the digits are distracting. How can we fix this?

# Try number 2

``round(prop.table(table(df\$race, df\$proflvl), 1), digits = 3)``
``````
A    0.396 0.146       0.062      0.396
B    0.144 0.272       0.146      0.438
H    0.194 0.272       0.118      0.416
I    0.333 0.190       0.048      0.429
W    0.441 0.108       0.009      0.442``````
• The `1` tells R we want proportions rowise, a `2` goes columnwise
• `round` tells R to cut off some digits for us
• Proportions are just that, not in percentage terms (we need to multiply by 100 for this)
• Can you make this table express percentages instead of proportions? How might that code look?
• A few more problems arise - this pools all observations, including students across years
• To avoid these, we need to aggregate the data somehow

# Checking Understanding

• We have seen how to chain functions together
• We have also seen how to examine a dataframe by looking at the observations in it
• We are now going to move on to aggregating data so we can look at unique cases when we have more than one observation for each unit

# Aggregating Data

• One of the most common questions you need to answer is to compute aggregates of data
• R has an `aggregate` function that can be used and helps us avoid the clustering problems above
• This works great for simple aggregation like scale score by race, we just need a `formula` (think I want variable X by grouping factor Y) and the statistic we want to compute
``````# Reading Scores by Race
aggregate(readSS ~ race, FUN = mean, data = df)``````
``````  race readSS
1    A  508.7
2    B  460.2
3    H  473.2
4    I  485.2
5    W  533.2``````

# Aggregate (II)

• `aggregate` can take us a little further, we can use aggregate multiple variables at a time
``aggregate(cbind(readSS, mathSS) ~ race, data = df, mean)``
``````  race readSS mathSS
1    A  508.7  477.9
2    B  460.2  442.5
3    H  473.2  442.7
4    I  485.2  455.9
5    W  533.2  529.8``````
• We can add multiple grouping varialbes using the `formula` syntax
``````head(aggregate(cbind(readSS, mathSS) ~ race + grade, data = df, mean),
8)``````
``````  race grade readSS mathSS
1    A     3  397.8  454.8
2    B     3  409.8  371.6
3    H     3  417.7  364.2
4    I     3  407.6  449.3
5    W     3  481.1  450.7
6    A     4  456.0  438.2
7    B     4  426.9  408.1
8    H     4  418.8  404.6``````

# Crosstabs

• We can build a systematic cross-tab now
``````ag <- aggregate(readSS ~ race + grade, data = df, mean)
xtabs(readSS ~ ., data = ag)``````
``````    grade
race     3     4     5     6     7     8
A 397.8 456.0 479.1 539.5 600.4 605.3
B 409.8 426.9 447.6 470.9 492.3 523.5
H 417.7 418.8 481.2 489.1 500.3 534.2
I 407.6 531.1 547.6   0.0 405.5 518.0
W 481.1 498.5 517.1 546.6 565.2 596.1``````
• And prettier output
``ftable(xtabs(readSS ~ ., data = ag))``
``````     grade     3     4     5     6     7     8
race
A          397.8 456.0 479.1 539.5 600.4 605.3
B          409.8 426.9 447.6 470.9 492.3 523.5
H          417.7 418.8 481.2 489.1 500.3 534.2
I          407.6 531.1 547.6   0.0 405.5 518.0
W          481.1 498.5 517.1 546.6 565.2 596.1``````

• What is the mean reading score for 6th grade students with disabilities?

• 481.83

• How many points is this from non-disabled students?

• 29.877

``aggregate(cbind(readSS, mathSS) ~ disab + grade, data = df, mean)``
``````   disab grade readSS mathSS
1      0     3  449.9  418.3
2      1     3  421.1  376.3
3      0     4  464.0  454.2
4      1     4  438.2  425.1
5      0     5  484.9  470.2
6      1     5  475.1  431.0
7      0     6  511.7  507.9
8      1     6  481.8  476.9
9      0     7  532.0  532.0
10     1     7  516.1  474.3
11     0     8  567.6  567.7
12     1     8  518.8  534.1``````

# School Means

• Consider the case we want to turn our student level data into school level data
• Who hasn't had to do this?!?
• In `aggregate` we do:
``````z <- aggregate(readSS ~ dist, FUN = mean, data = df)
z``````
``````  dist readSS
1  205  496.5
2  402  500.5
3  495  491.6``````
• But I want more! I want to aggregate multiple variables. I want to do it across multiple groups. I want the output to be a dataframe I can work on.
• Thank you `plyr`

# Aggregate Isn't Enough

• `aggregate` is cool, but it isn't very flexible
• We can only use aggregate output as a table, which we have to convert to a data frame
• There is a better way; the `plyr` package
• `plyr` is a set of routines/logical structure for transforming, summarizing, reshaping, and reorganizing data objects of one type in R into another type (or the same type)
• We will focus here on summarizing and aggregating a data frame, but later in the bootcamp we'll apply functions to lists and turn lists into data frames as well
• This is cool!

# The Logic of plyr

• In R this is known as "split, apply, and combine"
• Why? First, we split the data into groups by some factor or logical operator
• Then we apply some function or another to that group (i.e. count the unique values of a variable, take the mean of a variable, etc.)
• Then we combine the data back together
• This has some advantages - unlike other methods, the data does not have to be ordered by our ID variable for this to work
• The disadvantage is that this method is computationally expensive, even in R, and requires copying our data frame using up RAM

• The `plyr` package has a number of utilities to help us split-apply-combine across data types for both input and output
• In R we can't just use `for` loops to iterate over groups of students, because in R `for` loops are slow, inefficient, and impractical
• `plyr` to the rescue, while not as fast as a compiled language, it is pretty dang good!

# The logic of plyr

• This shows how the dataframe is broken up into pieces and each piece then gets whatever functions, summaries, or transformations we apply to it

# How plyr works on dataframes

• And this shows the output `ddply` has before it combines it back for us when we do the call `ddply(df,.(sex,age),"nrow")`

# Using plyr

• `plyr` has a straightforward syntax
• All `plyr` functions are in the format XXply. The two X's specify what the input file we are applying a function to is, and then what way we would like it outputted.
• In `plyr` d = dataframe, l= list, m=matrix, and a=array. By far the most common usage is `ddply`
• From a dataframe, to a dataframe.
• We will see more of `plyr` in Tutorial 4 as well

# plyr in Action

``````  library(plyr)
mean_math=mean(mathSS,na.rm=T),
sd_math=sd(mathSS,na.rm=T),
count_math=length(mathSS))``````
• This looks complex, but it only has a few components.
• The first argument is the dataframe we are working on, the next argument is the level of identification we want to aggregate to
• `summarize` tells `ddply` what we are doing to the data frame
• Then we make a list of new variable names, and how to calculate them on each of the subsets in our large data frame
• That's it!

# Results

``head(myag)``
``````  dist grade mean_read mean_math sd_read sd_math count_read
1  205     3     451.7     406.1   93.52   72.45        200
2  205     4     438.9     459.9   77.76   79.10        100
3  205     5     487.9     462.6   85.30   75.10        200
4  205     6     514.7     526.8   76.83   66.04        100
5  205     7     530.0     521.5   84.82   74.85        200
6  205     8     575.5     581.2   79.58   83.45        100
count_math
1        200
2        100
3        200
4        100
5        200
6        100``````

# More plyr

• This is great, we can quickly build a summary dataset from individual records
• A few advanced tricks. How do we build counts and percentages into our dataset?
``````myag<-ddply(df, .(dist,grade),summarize,
mean_math=mean(mathSS,na.rm=T),
sd_math=sd(mathSS,na.rm=T),
count_math=length(mathSS),
count_black=length(race[race=='B']),
summary(myag[,7:10])            ``````
``````   count_read    count_math   count_black     per_black
Min.   :100   Min.   :100   Min.   :36.0   Min.   :0.360
1st Qu.:100   1st Qu.:100   1st Qu.:41.2   1st Qu.:0.386
Median :150   Median :150   Median :60.5   Median :0.412
Mean   :150   Mean   :150   Mean   :61.7   Mean   :0.411
3rd Qu.:200   3rd Qu.:200   3rd Qu.:82.0   3rd Qu.:0.439
Max.   :200   Max.   :200   Max.   :92.0   Max.   :0.480  ``````

# Note for SQL Junkies

• There is an alternate package to plyr called `data.table` which is really handy
• It allows SQL like querying of R data frames
• It is incredibly fast
• It will be incorporated into the next `plyr` version
• You can read up on it online

# Quick Exercises in ddply

• What if we want to compare how districts do on educating ELL students?
• What district ID has the highest mean score for 4th grade ELL students on reading? Math?

• 66 in reading, 105 in math
• How many students are in these classes?

• 12 and 7 respectively

``````myag2<-ddply(df, .(dist,grade,ell),summarize,
mean_math=mean(mathSS,na.rm=T),
sd_math=sd(mathSS,na.rm=T),
count_math=length(mathSS),
count_black=length(race[race=='B']),
``````   dist grade ell mean_read mean_math sd_read sd_math count_read
4   205     4   1     403.0     392.9   64.52   39.09         16
16  402     4   1     443.1     388.7   79.52   53.28         29
28  495     4   1     408.8     431.9   77.47   70.77         13
count_math count_black per_black
4          16           2   0.12500
16         29           6   0.20690
28         13           1   0.07692``````

# Sorting

• A key way to explore data in tabular form is to sort data
• Sorting data in R can be dangerous as you can reorder the vectors of a dataframe
• We use the `order` function to sort data
``````df.badsort <- order(df\$readSS, df\$mathSS)
``[1]  106 1026    2   56  122  118``
• Why is this wrong? What is R giving us?
• Rownames...

# Correct Example

• To fix it, we need to tell R to reorder the dataframe using the rownames in the order we want
``````df.sort <- df[order(df\$readSS, df\$mathSS, df\$attday), ]
``````   stuid attday readSS mathSS
1 149995    180  357.3  387.3
2  13495    180  263.9  302.6
3 106495    160  369.7  365.5
4  45205    168  346.6  344.5
5 142705    156  373.1  441.2
6  14995    157  436.8  463.4``````
``head(df.sort[, c(3, 23, 29, 30)])``
``````      stuid attday readSS mathSS
106  106705    160  251.5  277.0
1026  80995    176  263.2  377.8
2     13495    180  263.9  302.6
56   122402    180  264.3  271.7
122   79705    168  266.4  318.7
118   40495    173  266.9  275.0``````

# Let's clean it up a bit more

``head(df[with(df, order(-readSS, -attday)), c(3, 23, 29, 30)])``
``````      stuid attday readSS mathSS
1631 145205    137  833.2  828.4
1462 107705    180  773.3  746.6
2252 122902    180  744.0  621.6
2341  44902    175  741.7  676.3
1482 134705    180  739.2  705.4
1630  14495    162  738.9  758.2``````
• Here we find the high performing students, note that the `-` denotes we want descending order, R's default is ascending order
• This is easy to correct

• Sorting works differently on some data types like matrices
``````M <- matrix(c(1, 2, 2, 2, 3, 6, 4, 5), 4, 2, byrow = FALSE, dimnames = list(NULL,
c("a", "b")))
M[order(M[, "a"], -M[, "b"]), ]``````
``````     a b
[1,] 1 3
[2,] 2 6
[3,] 2 5
[4,] 2 4``````

• Tables are familiar
``````mytab <- table(df\$grade, df\$year)
mytab[order(mytab[, 1]), ]``````
``````
2000 2001 2002
4  100  200  100
6  100  200  100
8  100  200  100
3  200  100  200
5  200  100  200
7  200  100  200``````
``mytab[order(mytab[, 2]), ]``
``````
2000 2001 2002
3  200  100  200
5  200  100  200
7  200  100  200
4  100  200  100
6  100  200  100
8  100  200  100``````

# Filtering Data

• Filtering data is an incredibly powerful feature and we have already seen it used to do some interesting things
• Filtering data in R is loaded with trouble though, because the filtering arguments must be very carefully specified
• Filtering is like a mini-sort, and we've done it already
• Always, always, always check your work
• And remember, this is the place the NAs do the most damage
• Let's look at some examples

# Basic Filtering a Column

``````# Gives all rows that meet this requirement
``````           X school  stuid grade schid dist white black hisp indian
1631 1281061    852 145205     8    15  205     1     0    0      0
asian econ female ell disab sch_fay dist_fay luck ability
1631     0    0      1   0     0       0        0    0   108.3
measerr teachq year attday schoolscore district schoolhigh
1631   6.325  155.7 2001    137       227.7       19          0
schoolavg schoollow readSS mathSS  proflvl race
1631         1         0  833.2  828.4 advanced    W``````
``df\$grade[df\$mathSS > 800]``
``[1] 8``
``# Gives all values of grade that meet this requirement``
• Before the brackets we specify what we want returned, and within the brackets we present the logical expression to evaluate
• Behind the scenes R does a logical test and gets the row numbers that match the logical expression
• It then combines them back with the object in front of the brackets to return the values
• This seems basic enough, let's filter on multiple dimensions

# Multiple filters

``df\$grade[df\$black == 1 & df\$readSS > 650]``
`` [1] 8 7 8 6 6 7 8 7 8 8 8 4``
• The & operator tells R we want rows where both of these are true
• How would we tell R we wanted rows where either were true?
• What happens if we type `df\$black=1` or `black==1`?
• Why won't this work?

# Using filters to assign values

• We can also use filters to assign values as well
• This is how you recode variables and create new ones
• Let's create a variable `spread` indicating whether a district has high or low spread among its student scores
``````myag\$spread <- NA  # create variable
``````high  low
15    3 ``````
• How did we define spread in this block of code?

# How does it work?

• The previous block of code is a useful way to learn how to recode variables
``````myag\$spread <- NA  # create variable
• Create a new variable in `myag` called `schoolperf` for `mean_math` scores with the following coding scheme:
3 >425 "Hi"
4 >450 "Hi"
5 >475 "Hi"
6 >500 "Hi"
7 >525 "Hi"
8 >575 "Hi"
• All other values are coded as "lo"
• How many "high" and "lo" observations do we have?
• By `dist`?

# Results

``````myag\$schoolperf <- "lo"
myag\$schoolperf[myag\$grade == 3 & myag\$mean_math > 425] <- "hi"
myag\$schoolperf[myag\$grade == 4 & myag\$mean_math > 450] <- "hi"
myag\$schoolperf[myag\$grade == 5 & myag\$mean_math > 475] <- "hi"
myag\$schoolperf[myag\$grade == 6 & myag\$mean_math > 500] <- "hi"
myag\$schoolperf[myag\$grade == 7 & myag\$mean_math > 525] <- "hi"
myag\$schoolperf[myag\$grade == 8 & myag\$mean_math > 575] <- "hi"
myag\$schoolperf <- as.factor(myag\$schoolperf)
summary(myag\$schoolperf)``````
``````hi lo
9  9 ``````
``table(myag\$dist, myag\$schoolperf)``
``````
hi lo
205  3  3
402  3  3
495  3  3``````

# Let's replace data

• For district 6 let's negate the grade 3 scores by replacing them with missing data
``````myag\$mean_read[myag\$dist == 6 & myag\$grade == 3] <- NA
``````  dist grade mean_read mean_math
1  205     3     451.7     406.1
2  205     4     438.9     459.9``````
• Let's replace one data element with another
``````myag\$mean_read[myag\$dist == 6 & myag\$grade == 3] <- myag\$mean_read[myag\$dist ==
``````  dist grade mean_read mean_math
1  205     3     451.7     406.1
2  205     4     438.9     459.9``````
• Voila

# Why do NAs matter so much?

• Let's consider the case above but insert some NA values for all 3rd grade tests
``````myag\$mean_read[myag\$grade == 3] <- NA
``````   dist grade mean_read mean_math
1   205     3        NA     406.1
7   402     3        NA     431.9
13  495     3        NA     405.5
2   205     4     438.9     459.9
8   402     4     474.9     432.8
14  495     4     447.8     469.1``````

# NAs II

• Now let's calculate a few statistics:
``mean(myag\$mean_math)``
``[1] 490.7``
``mean(myag\$mean_read)``
``[1] NA``
• Remember, NA values propogate, so R assumes an NA value could take literally any value, and as such it is impossible to know the `mean` of a vector with NA
• We can override this though:
``mean(myag\$mean_math, na.rm = T)``
``[1] 490.7``
``mean(myag\$mean_read, na.rm = T)``
``[1] 507.5``

# Beyond the Mean

• But for other problems it is tricky
• What if we want to know the number of rows that have a `mean_read` of less than 500?
``length(myag\$dist[myag\$mean_read < 500])``
``[1] 10``
``head(myag\$mean_read[myag\$mean_read < 500])``
``[1]    NA 438.9 487.9    NA 474.9 472.5``
• And what if we want to add the standard deviation to these vectors?
``````badvar <- myag\$mean_read + myag\$sd_read
``````   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
517     566     592     587     614     655       3 ``````

# So we need to filter NAs explicitly

• Consider the case where two sets of variables have different missing elements
``````myag\$sd_read[myag\$count_read < 100 & myag\$mean_read < 550] <- NA
``[1] 16``
``length(myag\$mean_read[myag\$mean_read < 550 & !is.na(myag\$mean_read)])``
``[1] 13``
• What is `!is.na()` ?
• `is.na()` is a helpful function to identify TRUE if a value is missing
• `!` is the reverse operator
• We are asking R if this value is not a missing value, and to only give us non-missing values back

# Merging Data

• It is unlikely all the data we will want resides in a single dataset and often we have to combine data from several sources
• R makes this easy, but that simplicity comes at a cost - it can be easy to make mistakes if you don't specify things carefully
• Let's merge attributes about a student's school with the student row data
• We might want to do that if we want to evaluate the performance of students in different school climates, and school climate was measured in part by the mean performance

# Merging Data II

• We have two data objects `df` which has multiple rows per student and `myag` which has multiple rows per school
• What are the variables that link these two together?
``names(myag)``
`````` [1] "dist"        "grade"       "mean_read"   "mean_math"
[9] "count_black" "per_black"   "spread"      "schoolperf" ``````
``names(df[, c(2, 3, 4, 6)])``
``[1] "school" "stuid"  "grade"  "dist"  ``
• It looks like `dist` and `grade` are in common. Is this ok?
• Why might we want to consider re-aggregating with `year` as well?
• For this example we won't just yet

# Merge Options

• We have a few options with `merge` we want to consider with `?merge`
• In the simple case we let `merge` automagically combine the data
``````simple_merge <- merge(df, myag)
names(simple_merge)``````
`````` [1] "grade"       "dist"        "X"           "school"
[5] "stuid"       "schid"       "white"       "black"
[9] "hisp"        "indian"      "asian"       "econ"
[13] "female"      "ell"         "disab"       "sch_fay"
[17] "dist_fay"    "luck"        "ability"     "measerr"
[21] "teachq"      "year"        "attday"      "schoolscore"
[25] "district"    "schoolhigh"  "schoolavg"   "schoollow"
• It looks like it did a good job

# Merge Options

• In complicated cases, merge has some important options we should review
• First is the simple sounding 'by' argument:
• `simple_merge(df1,df2,by=c("id1","id2"))`
• We can also specify `simple_merge(df1,df2,by.x=c("id1","id2"),by.y=c("id1_a","id2_a"))`
• This allows us to have different names for our ID variables
• Now, what if we have two different sized objects and not all matches between them?
• `notsosimple_merge(df1,df2,all.x=TRUE,all.y=TRUE)`
• We can tell R whether we want to keep all of the `x` observations (df1), all the `y` observations (df2) or neither, or both

# Reshaping Data

• Reshaping data is a slightly different issue than aggregating data
• Let's review the two data types: long and wide
``head(df[, 1:10], 3)``
``````    X school  stuid grade schid dist white black hisp indian
1  44      1 149995     3   105  495     0     1    0      0
2  53      1  13495     3    45  495     0     1    0      0
3 116      1 106495     3    45  495     0     1    0      0``````
• Now let's look at wide:
``head(widedf[, c(1, 28:40)], 3)``
``````   stuid readSS.2000 mathSS.2000 proflvl.2000 race.2000  X.2001
1 149995       357.3       387.3        basic         B  441000
2  13495       263.9       302.6  below basic         B  531000
3 106495       369.7       365.5        basic         B 1161000
school.2001 grade.2001 schid.2001 dist.2001 white.2001 black.2001
1           1          4        105       495          0          1
2           1          4         45       495          0          1
3           1          4         45       495          0          1
hisp.2001 indian.2001
1         0           0
2         0           0
3         0           0``````
• How did we reshape this data?

# Wide Data v. Long Data

• The great debate
• Most econometrics, panel, and time series datasets come wide and so these seem familiar
• R for most cases prefers long data, including for most graphing and analysis functions
• So we have to learn both

# The reshape Function

• `reshape` is the way to move from wide to long
• The data stays the same, but the shape of it changes
• The long data had dimensions: 2700, 32
• The wide data has dimensions: 1200, 91
• How do we get to these numbers?
• The rows in the wide dataframe represent unique students

# Deconstructing reshape

``widedf <- reshape(df, timevar = "year", idvar = "stuid", direction = "wide")``
• `idvar` represents the unit we want to represent a single row, in this case each unique student gets a single row
• In this simple case `timevar` is the variable that differenaties between two rows with the same student ID
• Note that `timevar` needn't always represent time!
• `direction` tells R we are going to move to wide data
• As written all data will move, but using the `varying` argument we can tell R explicitly which items we want to move wide

# What about Wide to Long?

• We often need to do this to plot data in R
• Luckily the `reshape` function works well in both directions
``````longdf <- reshape(widedf, idvar = "stuid", timevar = "year", varying = names(widedf[,
2:91]), direction = "long", sep = ".")``````
• If our data is formatted nicely, R can do the guessing and identify the years for us by parsing the dataframe names

# Subsetting Data

• We have already seen a lot of subsetting examples above, which is what filtering is, but R provides some great shortcuts to this
• Let's look at the `subset` function to get only 4th grade scores
``````g4 <- subset(df, grade == 4)
dim(g4)``````
``[1] 400  32``
• This is equivalent to:
``g4_b <- df[df\$grade == 4, ]``
• These two elements are the same:
``identical(g4, g4_b)``
``[1] TRUE``

# That's it

• Now you can filter, subset, sort, recode, and aggregate data!
• Let's look at a few exercises to test these skills
• Once these skills are mastered, we can begin to understand how to automate R to clean data with known errors, and to recode data in R so it is ready to be used for analysis
• Then we can really take off!

# Exercises

1. Say we are unhappy about attributing the school/grade mean score across years to student-year observations like we did in this lesson. Let's fix it by first aggregating our student data frame to a school/grade/year data frame, and second by merging that new data frame with our student level data.

2. Sort the student-level data frame on `attday` and `ability` in descending order.

3. Find the highest proportion of black students in any school/grade/year combination.

# Session Info

It is good to include the session info, e.g. this document is produced with knitr version 0.9.6. Here is my session info:

``print(sessionInfo(), locale = FALSE)``
``````R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets
[7] methods   base

other attached packages:
[1] sandwich_2.2-9  quantreg_4.94   SparseM_0.96    gridExtra_0.9.1
[5] mgcv_1.7-22     eeptools_0.1    mapproj_1.2-0   maps_2.3-0
[9] proto_0.3-10    plyr_1.8        stringr_0.6.2   ggplot2_0.9.3
[13] lmtest_0.9-30   zoo_1.7-9       knitr_0.9.6

loaded via a namespace (and not attached):
[1] colorspace_1.2-0   dichromat_1.2-4    digest_0.6.0
[4] evaluate_0.4.3     formatR_0.7        gtable_0.1.2
[7] labeling_0.1       lattice_0.20-10    MASS_7.3-22
[10] Matrix_1.0-10      munsell_0.4        nlme_3.1-106
[13] RColorBrewer_1.0-5 reshape2_1.2.2     scales_0.2.3
[16] tools_2.15.2      ``````