% Tutorial 3: Manipulating Data in R % DPI R Bootcamp % Jared Knowles

# Overview

In this lesson we hope to learn:

• Aggregating data
• Organizing our data
• Manipulating vectors
• Dealing with missing data

# Again, read in our dataset

# Set working directory to the tutorial directory In RStudio can do
# this in 'Tools' tab
setwd("~/GitHub/r_tutorial_ed")
# Note if we don't assign data to 'df' R just prints contents of
# table


# Aggregation

• Sometimes we need to do some basic checking for the number of observations or types of observations in our dataset
• To do this quickly and easily - the table function is our friend
• Let's look at our observations by year and grade
table(df$grade, df$year)


2000 2001 2002
3  200  100  200
4  100  200  100
5  200  100  200
6  100  200  100
7  200  100  200
8  100  200  100

• The first command gives the rows, the second gives the columns
• Ugly, but effective

# Aggregation can be more complex

• Let's aggregate by race and year
table(df$year, df$race)


A   B   H   I   W
2000  16 370  93   7 414
2001  16 370  93   7 414
2002  16 370  93   7 414

• Race is consistent across years, interesting
• What if we want to only look at 3rd graders that year?

# More complicated still

with(df[df$grade == 3, ], { table(year, race) })   race year A B H I W 2000 4 78 22 4 92 2001 1 44 8 2 45 2002 0 74 20 1 105  • with specifies a data object to work on, in this case all elements of df where grade==3 • table is the same command as above, but since we specified the data object in the with statement, we don't need the df$ in front of the variables of interest
df2 <- subset(df, grade == 3)
table(df2$year, df2$race)


A   B   H   I   W
2000   4  78  22   4  92
2001   1  44   8   2  45
2002   0  74  20   1 105

rm(df2)


# Quick exercise

• Can you find the number of black students in each grade in each year?
• hint: with(df[df$___==___,]...) • How many in year 2002, grade 6? • 48 • How many in 2001, grade 7? • 39 # Answer with(df[df$race=="B",],{table(year,grade)})

      grade
year    3  4  5  6  7  8
2000 78 48 87 39 74 44
2001 44 78 48 87 39 74
2002 74 44 78 48 87 39

• Quick question, how can we understand the three types of closures we have in this function: () [] and {}

# Tables cont.

• This is really powerful for looking at the descriptive dimensions of the data, we can ask questions like:
• How many students are at each proficiency level each year?
table(df$year, df$proflvl)


2000       56   313         143        388
2001      229   183          64        424
2002      503    27           3        367

• How many students are at each proficiency level by race?
table(df$race, df$proflvl)


A       19     7           3         19
B      160   302         162        486
H       54    76          33        116
I        7     4           1          9
W      548   134          11        549


# Proportional Tables

• What if we aren't interested in counts?
• R makes it really easy to calculate proportions
prop.table(table(df$race, df$proflvl))


A 0.0070370 0.0025926   0.0011111  0.0070370
B 0.0592593 0.1118519   0.0600000  0.1800000
H 0.0200000 0.0281481   0.0122222  0.0429630
I 0.0025926 0.0014815   0.0003704  0.0033333
W 0.2029630 0.0496296   0.0040741  0.2033333

• Hmmm, this is goofy. This tells us the proportion of each cell out of the total. Also, the digits are distracting. How can we fix this?

# Try number 2

round(prop.table(table(df$race, df$proflvl), 1), digits = 3)


A    0.396 0.146       0.062      0.396
B    0.144 0.272       0.146      0.438
H    0.194 0.272       0.118      0.416
I    0.333 0.190       0.048      0.429
W    0.441 0.108       0.009      0.442

• The 1 tells R we want proportions rowise, a 2 goes columnwise
• round tells R to cut off some digits for us
• Proportions are just that, not in percentage terms (we need to multiply by 100 for this)
• Can you make this table express percentages instead of proportions? How might that code look?
• A few more problems arise - this pools all observations, including students across years
• To avoid these, we need to aggregate the data somehow

# Checking Understanding

• We have seen how to chain functions together
• We have also seen how to examine a dataframe by looking at the observations in it
• We are now going to move on to aggregating data so we can look at unique cases when we have more than one observation for each unit

# Aggregating Data

• One of the most common questions you need to answer is to compute aggregates of data
• R has an aggregate function that can be used and helps us avoid the clustering problems above
• This works great for simple aggregation like scale score by race, we just need a formula (think I want variable X by grouping factor Y) and the statistic we want to compute
# Reading Scores by Race
aggregate(readSS ~ race, FUN = mean, data = df)

  race readSS
1    A  508.7
2    B  460.2
3    H  473.2
4    I  485.2
5    W  533.2


# Aggregate (II)

• aggregate can take us a little further, we can use aggregate multiple variables at a time
aggregate(cbind(readSS, mathSS) ~ race, data = df, mean)

  race readSS mathSS
1    A  508.7  477.9
2    B  460.2  442.5
3    H  473.2  442.7
4    I  485.2  455.9
5    W  533.2  529.8

• We can add multiple grouping varialbes using the formula syntax
head(aggregate(cbind(readSS, mathSS) ~ race + grade, data = df, mean),
8)

  race grade readSS mathSS
1    A     3  397.8  454.8
2    B     3  409.8  371.6
3    H     3  417.7  364.2
4    I     3  407.6  449.3
5    W     3  481.1  450.7
6    A     4  456.0  438.2
7    B     4  426.9  408.1
8    H     4  418.8  404.6


# Crosstabs

• We can build a systematic cross-tab now
ag <- aggregate(readSS ~ race + grade, data = df, mean)
xtabs(readSS ~ ., data = ag)

    grade
race     3     4     5     6     7     8
A 397.8 456.0 479.1 539.5 600.4 605.3
B 409.8 426.9 447.6 470.9 492.3 523.5
H 417.7 418.8 481.2 489.1 500.3 534.2
I 407.6 531.1 547.6   0.0 405.5 518.0
W 481.1 498.5 517.1 546.6 565.2 596.1

• And prettier output
ftable(xtabs(readSS ~ ., data = ag))

     grade     3     4     5     6     7     8
race
A          397.8 456.0 479.1 539.5 600.4 605.3
B          409.8 426.9 447.6 470.9 492.3 523.5
H          417.7 418.8 481.2 489.1 500.3 534.2
I          407.6 531.1 547.6   0.0 405.5 518.0
W          481.1 498.5 517.1 546.6 565.2 596.1


• What is the mean reading score for 6th grade students with disabilities?

• 481.83
• How many points is this from non-disabled students?

• 29.877

aggregate(cbind(readSS, mathSS) ~ disab + grade, data = df, mean)

   disab grade readSS mathSS
1      0     3  449.9  418.3
2      1     3  421.1  376.3
3      0     4  464.0  454.2
4      1     4  438.2  425.1
5      0     5  484.9  470.2
6      1     5  475.1  431.0
7      0     6  511.7  507.9
8      1     6  481.8  476.9
9      0     7  532.0  532.0
10     1     7  516.1  474.3
11     0     8  567.6  567.7
12     1     8  518.8  534.1


# School Means

• Consider the case we want to turn our student level data into school level data
• Who hasn't had to do this?!?
• In aggregate we do:
z <- aggregate(readSS ~ dist, FUN = mean, data = df)
z

  dist readSS
1  205  496.5
2  402  500.5
3  495  491.6

• But I want more! I want to aggregate multiple variables. I want to do it across multiple groups. I want the output to be a dataframe I can work on.
• Thank you plyr

# Aggregate Isn't Enough

• aggregate is cool, but it isn't very flexible
• We can only use aggregate output as a table, which we have to convert to a data frame
• There is a better way; the plyr package
• plyr is a set of routines/logical structure for transforming, summarizing, reshaping, and reorganizing data objects of one type in R into another type (or the same type)
• We will focus here on summarizing and aggregating a data frame, but later in the bootcamp we'll apply functions to lists and turn lists into data frames as well
• This is cool!

# The Logic of plyr

• In R this is known as “split, apply, and combine”
• Why? First, we split the data into groups by some factor or logical operator
• Then we apply some function or another to that group (i.e. count the unique values of a variable, take the mean of a variable, etc.)
• Then we combine the data back together
• This has some advantages - unlike other methods, the data does not have to be ordered by our ID variable for this to work
• The disadvantage is that this method is computationally expensive, even in R, and requires copying our data frame using up RAM

• The plyr package has a number of utilities to help us split-apply-combine across data types for both input and output
• In R we can't just use for loops to iterate over groups of students, because in R for loops are slow, inefficient, and impractical
• plyr to the rescue, while not as fast as a compiled language, it is pretty dang good!

# The logic of plyr

• This shows how the dataframe is broken up into pieces and each piece then gets whatever functions, summaries, or transformations we apply to it

# How plyr works on dataframes

• And this shows the output ddply has before it combines it back for us when we do the call ddply(df,.(sex,age),"nrow")

# Using plyr

• plyr has a straightforward syntax
• All plyr functions are in the format XXply. The two X's specify what the input file we are applying a function to is, and then what way we would like it outputted.
• In plyr d = dataframe, l= list, m=matrix, and a=array. By far the most common usage is ddply
• From a dataframe, to a dataframe.
• We will see more of plyr in Tutorial 4 as well

# plyr in Action

  library(plyr)
mean_math=mean(mathSS,na.rm=T),
sd_math=sd(mathSS,na.rm=T),
count_math=length(mathSS))

• This looks complex, but it only has a few components.
• The first argument is the dataframe we are working on, the next argument is the level of identification we want to aggregate to
• summarize tells ddply what we are doing to the data frame
• Then we make a list of new variable names, and how to calculate them on each of the subsets in our large data frame
• That's it!

# Results

head(myag)


# More plyr

• This is great, we can quickly build a summary dataset from individual records
• A few advanced tricks. How do we build counts and percentages into our dataset?
myag<-ddply(df, .(dist,grade),summarize,
mean_math=mean(mathSS,na.rm=T),
sd_math=sd(mathSS,na.rm=T),
count_math=length(mathSS),
count_black=length(race[race=='B']),
summary(myag[,7:10])


# Note for SQL Junkies

• There is an alternate package to plyr called data.table which is really handy
• It allows SQL like querying of R data frames
• It is incredibly fast
• It will be incorporated into the next plyr version
• You can read up on it online

# Quick Exercises in ddply

• What if we want to compare how districts do on educating ELL students?
• What district ID has the highest mean score for 4th grade ELL students on reading? Math?

• 66 in reading, 105 in math
• How many students are in these classes?

• 12 and 7 respectively

myag2<-ddply(df, .(dist,grade,ell),summarize,
mean_math=mean(mathSS,na.rm=T),
sd_math=sd(mathSS,na.rm=T),
count_math=length(mathSS),
count_black=length(race[race=='B']),


# Sorting

• A key way to explore data in tabular form is to sort data
• Sorting data in R can be dangerous as you can reorder the vectors of a dataframe
• We use the order function to sort data
df.badsort <- order(df$readSS, df$mathSS)

[1]  106 1026    2   56  122  118

• Why is this wrong? What is R giving us?
• Rownames…

# Correct Example

• To fix it, we need to tell R to reorder the dataframe using the rownames in the order we want
df.sort <- df[order(df$readSS, df$mathSS, df$attday), ] head(df[, c(3, 23, 29, 30)])   stuid attday readSS mathSS 1 149995 180 357.3 387.3 2 13495 180 263.9 302.6 3 106495 160 369.7 365.5 4 45205 168 346.6 344.5 5 142705 156 373.1 441.2 6 14995 157 436.8 463.4  head(df.sort[, c(3, 23, 29, 30)])   stuid attday readSS mathSS 106 106705 160 251.5 277.0 1026 80995 176 263.2 377.8 2 13495 180 263.9 302.6 56 122402 180 264.3 271.7 122 79705 168 266.4 318.7 118 40495 173 266.9 275.0  # Let's clean it up a bit more head(df[with(df, order(-readSS, -attday)), c(3, 23, 29, 30)])   stuid attday readSS mathSS 1631 145205 137 833.2 828.4 1462 107705 180 773.3 746.6 2252 122902 180 744.0 621.6 2341 44902 175 741.7 676.3 1482 134705 180 739.2 705.4 1630 14495 162 738.9 758.2  • Here we find the high performing students, note that the - denotes we want descending order, R's default is ascending order • This is easy to correct # About sorting • Sorting works differently on some data types like matrices M <- matrix(c(1, 2, 2, 2, 3, 6, 4, 5), 4, 2, byrow = FALSE, dimnames = list(NULL, c("a", "b"))) M[order(M[, "a"], -M[, "b"]), ]   a b [1,] 1 3 [2,] 2 6 [3,] 2 5 [4,] 2 4  # About Sorting • Tables are familiar mytab <- table(df$grade, df$year) mytab[order(mytab[, 1]), ]   2000 2001 2002 4 100 200 100 6 100 200 100 8 100 200 100 3 200 100 200 5 200 100 200 7 200 100 200  mytab[order(mytab[, 2]), ]   2000 2001 2002 3 200 100 200 5 200 100 200 7 200 100 200 4 100 200 100 6 100 200 100 8 100 200 100  # Filtering Data • Filtering data is an incredibly powerful feature and we have already seen it used to do some interesting things • Filtering data in R is loaded with trouble though, because the filtering arguments must be very carefully specified • Filtering is like a mini-sort, and we've done it already • Always, always, always check your work • And remember, this is the place the NAs do the most damage • Let's look at some examples # Basic Filtering a Column # Gives all rows that meet this requirement df[df$readSS > 800, ]

           X school  stuid grade schid dist white black hisp indian
1631 1281061    852 145205     8    15  205     1     0    0      0
asian econ female ell disab sch_fay dist_fay luck ability
1631     0    0      1   0     0       0        0    0   108.3
measerr teachq year attday schoolscore district schoolhigh
1631   6.325  155.7 2001    137       227.7       19          0
schoolavg schoollow readSS mathSS  proflvl race
1631         1         0  833.2  828.4 advanced    W

df$grade[df$mathSS > 800]

[1] 8

# Gives all values of grade that meet this requirement

• Before the brackets we specify what we want returned, and within the brackets we present the logical expression to evaluate
• Behind the scenes R does a logical test and gets the row numbers that match the logical expression
• It then combines them back with the object in front of the brackets to return the values
• This seems basic enough, let's filter on multiple dimensions

# Multiple filters

df$grade[df$black == 1 & df$readSS > 650]   [1] 8 7 8 6 6 7 8 7 8 8 8 4  • The & operator tells R we want rows where both of these are true • How would we tell R we wanted rows where either were true? • What happens if we type df$black=1 or black==1?
• Why won't this work?

# Using filters to assign values

• We can also use filters to assign values as well
• This is how you recode variables and create new ones
• Let's create a variable spread indicating whether a district has high or low spread among its student scores
myag$spread <- NA # create variable myag$spread[myag$sd_read < 75] <- "low" myag$spread[myag$sd_read > 75] <- "high" myag$spread <- as.factor(myag$spread) summary(myag$spread)

• How did we define spread in this block of code?

# How does it work?

• The previous block of code is a useful way to learn how to recode variables
myag$spread <- NA # create variable myag$spread[myag$sd_read < 75] <- "low" myag$spread[myag$sd_read > 75] <- "high" myag$spread <- as.factor(myag$spread)  • Create a new variable in myag called schoolperf for mean_math scores with the following coding scheme: Grade Score Range Code 3 >425 “Hi” 4 >450 “Hi” 5 >475 “Hi” 6 >500 “Hi” 7 >525 “Hi” 8 >575 “Hi” • All other values are coded as “lo” • How many “high” and “lo” observations do we have? • By dist? # Results myag$schoolperf <- "lo"
myag$schoolperf[myag$grade == 3 & myag$mean_math > 425] <- "hi" myag$schoolperf[myag$grade == 4 & myag$mean_math > 450] <- "hi"
myag$schoolperf[myag$grade == 5 & myag$mean_math > 475] <- "hi" myag$schoolperf[myag$grade == 6 & myag$mean_math > 500] <- "hi"
myag$schoolperf[myag$grade == 7 & myag$mean_math > 525] <- "hi" myag$schoolperf[myag$grade == 8 & myag$mean_math > 575] <- "hi"
myag$schoolperf <- as.factor(myag$schoolperf)
summary(myag$schoolperf) table(myag$dist, myag$schoolperf)  # Let's replace data • For district 6 let's negate the grade 3 scores by replacing them with missing data myag$mean_read[myag$dist == 6 & myag$grade == 3] <- NA

• Let's replace one data element with another
myag$mean_read[myag$dist == 6 & myag$grade == 3] <- myag$mean_read[myag$dist == 6 & myag$grade == 4]

• Voila

# Why do NAs matter so much?

• Let's consider the case above but insert some NA values for all 3rd grade tests
myag$mean_read[myag$grade == 3] <- NA
head(myag[order(myag$grade), 1:4])  # NAs II • Now let's calculate a few statistics: mean(myag$mean_math)
mean(myag$mean_read)  • Remember, NA values propogate, so R assumes an NA value could take literally any value, and as such it is impossible to know the mean of a vector with NA • We can override this though: mean(myag$mean_math, na.rm = T)
mean(myag$mean_read, na.rm = T)  # Beyond the Mean • But for other problems it is tricky • What if we want to know the number of rows that have a mean_read of less than 500? length(myag$dist[myag$mean_read < 500]) head(myag$mean_read[myag$mean_read < 500])  • And what if we want to add the standard deviation to these vectors? badvar <- myag$mean_read + myag$sd_read summary(badvar)  # So we need to filter NAs explicitly • Consider the case where two sets of variables have different missing elements myag$sd_read[myag$count_read < 100 & myag$mean_read < 550] <- NA
length(myag$mean_read[myag$mean_read < 550])
length(myag$mean_read[myag$mean_read < 550 & !is.na(myag$mean_read)])  • What is !is.na() ? • is.na() is a helpful function to identify TRUE if a value is missing • ! is the reverse operator • We are asking R if this value is not a missing value, and to only give us non-missing values back # Merging Data • It is unlikely all the data we will want resides in a single dataset and often we have to combine data from several sources • R makes this easy, but that simplicity comes at a cost - it can be easy to make mistakes if you don't specify things carefully • Let's merge attributes about a student's school with the student row data • We might want to do that if we want to evaluate the performance of students in different school climates, and school climate was measured in part by the mean performance # Merging Data II • We have two data objects df which has multiple rows per student and myag which has multiple rows per school • What are the variables that link these two together? names(myag) names(df[, c(2, 3, 4, 6)])  [1] "school" "stuid" "grade" "dist"  • It looks like dist and grade are in common. Is this ok? • Why might we want to consider re-aggregating with year as well? • For this example we won't just yet # Merge Options • We have a few options with merge we want to consider with ?merge • In the simple case we let merge automagically combine the data simple_merge <- merge(df, myag) names(simple_merge)  • It looks like it did a good job # Merge Options • In complicated cases, merge has some important options we should review • First is the simple sounding 'by' argument: • simple_merge(df1,df2,by=c("id1","id2")) • We can also specify simple_merge(df1,df2,by.x=c("id1","id2"),by.y=c("id1_a","id2_a")) • This allows us to have different names for our ID variables • Now, what if we have two different sized objects and not all matches between them? • notsosimple_merge(df1,df2,all.x=TRUE,all.y=TRUE) • We can tell R whether we want to keep all of the x observations (df1), all the y observations (df2) or neither, or both # Reshaping Data • Reshaping data is a slightly different issue than aggregating data • Let's review the two data types: long and wide head(df[, 1:10], 3)   X school stuid grade schid dist white black hisp indian 1 44 1 149995 3 105 495 0 1 0 0 2 53 1 13495 3 45 495 0 1 0 0 3 116 1 106495 3 45 495 0 1 0 0  • Now let's look at wide: head(widedf[, c(1, 28:40)], 3)   stuid readSS.2000 mathSS.2000 proflvl.2000 race.2000 X.2001 1 149995 357.3 387.3 basic B 441000 2 13495 263.9 302.6 below basic B 531000 3 106495 369.7 365.5 basic B 1161000 school.2001 grade.2001 schid.2001 dist.2001 white.2001 black.2001 1 1 4 105 495 0 1 2 1 4 45 495 0 1 3 1 4 45 495 0 1 hisp.2001 indian.2001 1 0 0 2 0 0 3 0 0  • How did we reshape this data? # Wide Data v. Long Data • The great debate • Most econometrics, panel, and time series datasets come wide and so these seem familiar • R for most cases prefers long data, including for most graphing and analysis functions • So we have to learn both # The reshape Function • reshape is the way to move from wide to long • The data stays the same, but the shape of it changes • The long data had dimensions: 2700, 32 • The wide data has dimensions: 1200, 91 • How do we get to these numbers? • The rows in the wide dataframe represent unique students # Deconstructing reshape widedf <- reshape(df, timevar = "year", idvar = "stuid", direction = "wide")  • idvar represents the unit we want to represent a single row, in this case each unique student gets a single row • In this simple case timevar is the variable that differenaties between two rows with the same student ID • Note that timevar needn't always represent time! • direction tells R we are going to move to wide data • As written all data will move, but using the varying argument we can tell R explicitly which items we want to move wide # What about Wide to Long? • We often need to do this to plot data in R • Luckily the reshape function works well in both directions longdf <- reshape(widedf, idvar = "stuid", timevar = "year", varying = names(widedf[, 2:91]), direction = "long", sep = ".")  • If our data is formatted nicely, R can do the guessing and identify the years for us by parsing the dataframe names # Subsetting Data • We have already seen a lot of subsetting examples above, which is what filtering is, but R provides some great shortcuts to this • Let's look at the subset function to get only 4th grade scores g4 <- subset(df, grade == 4) dim(g4)  [1] 400 32  • This is equivalent to: g4_b <- df[df$grade == 4, ]

• These two elements are the same:
identical(g4, g4_b)

[1] TRUE


# That's it

• Now you can filter, subset, sort, recode, and aggregate data!
• Let's look at a few exercises to test these skills
• Once these skills are mastered, we can begin to understand how to automate R to clean data with known errors, and to recode data in R so it is ready to be used for analysis
• Then we can really take off!

# Exercises

1. Say we are unhappy about attributing the school/grade mean score across years to student-year observations like we did in this lesson. Let's fix it by first aggregating our student data frame to a school/grade/year data frame, and second by merging that new data frame with our student level data.

2. Sort the student-level data frame on attday and ability in descending order.

3. Find the highest proportion of black students in any school/grade/year combination.

# Session Info

It is good to include the session info, e.g. this document is produced with knitr version 0.9.6. Here is my session info:

print(sessionInfo(), locale = FALSE)

R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)

attached base packages:
[1] splines   grid      stats     graphics  grDevices utils
[7] datasets  methods   base

other attached packages:
[1] R2SWF_0.4        snow_0.3-10      gbm_1.6-3.2
[4] reshape_0.8.4    caret_5.15-048   foreach_1.4.0
[7] cluster_1.14.3   reshape2_1.2.2   lme4_0.999999-0
[10] Matrix_1.0-10    lattice_0.20-10  xtable_1.7-0
[13] vcd_1.2-13       colorspace_1.2-0 MASS_7.3-22
[16] Hmisc_3.10-1     survival_2.37-2  sandwich_2.2-9
[19] quantreg_4.94    SparseM_0.96     gridExtra_0.9.1
[22] mgcv_1.7-22      eeptools_0.1     mapproj_1.2-0
[25] maps_2.3-0       proto_0.3-10     plyr_1.8
[28] stringr_0.6.2    ggplot2_0.9.3    lmtest_0.9-30
[31] zoo_1.7-9        knitr_0.9.6

loaded via a namespace (and not attached):
[1] codetools_0.2-8    compiler_2.15.2    dichromat_1.2-4
[4] digest_0.6.0       evaluate_0.4.3     formatR_0.7
[7] gtable_0.1.2       iterators_1.0.6    labeling_0.1
[10] markdown_0.5.3     munsell_0.4        nlme_3.1-106
[13] RColorBrewer_1.0-5 scales_0.2.3       stats4_2.15.2
[16] tools_2.15.2