--- title: "dplyr & ggplot2 pipeline with survey data" author: "Ryan Peek" date: "7/19/2017" output: html_document: # put on it's own line to add additional options toc: true # add a table of contents toc_depth: 2 # show only first two levels in TOC --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # **MORNING**: Data Pipeline for Ecology Data First load libraries ```{r libraries} suppressPackageStartupMessages(library(tidyverse)) # keep messages from appearing in html output or on your screen ``` ## Get The Data & Inspect Download the data using the url technique: ```{r getdata, eval=FALSE} # create a directory to put the data in dir.create("data") # make sure we have a data dir to download file to # download a copy locally download.file(url="https://mikoontz.github.io/data-carpentry-week/data/combined.csv", destfile = "data/combined.csv") # this downloads and names your file ``` > One extra note about `.Rmd` files, if you saved your `Rmd` in a "docs" folder, or in a folder of any name, you'll need to provide a `../` in front of your path, because when you run code in `Rmd` it defaults to the directory that the `Rmd` lives. `..` means go UP a directory. The base directory or root directory is wherever your .Rproj file lives. R scripts will always default to wherever the .Rproj file is, whereas .Rmd files default to wherever they actually are. ```{r readdata} # read in that data to your environment survey <- read.csv("data/combined.csv") ``` ### Summarize Data Now let's figure out what we have using `summary` and `glimpse` functions. ```{r inspect} summary(survey) glimpse(survey) # a way to view counts of categories in a variable table(survey$species_id) # this is a factor vector # test if table() works on character data types table(as.character(survey$species_id)) # works on character vectors # let's look at sex column for NA's and blanks table(survey$sex) ``` So it looks like many different species, but abundances (counts) are widely variable. Some have 1, others have as many as 2000. Also appears to have many NA's in the `hindfoot_length` and `weight` columns. And we have "blanks" in the `sex` column. ## Filter Data ### Remove NAs and Blanks Let's get rid of NA's from **`weight`** & **`hindfoot_length`** columns, and the mysterious blanks (`""`) from the sex column. Write that object to a new dataframe called "*surveys_complete*". ```{r filterNas} surveys_complete <- survey %>% filter(!is.na(weight), # remove NAs in weight !is.na(hindfoot_length), # remove NAs in hindfoot_length sex!="", # could also use !sex=="" species_id!="") # remove missing species_id ``` ### Filter Rare Species & `%in%` Use `dplyr` to create a separate dataframe of the species that have been observed more than 50 times. Then filter the `surveys_complete` based on those counts using the **`%in%`** command in `filter`. ```{r filterRareSpp} ## Extract the most common species_id species_counts <- surveys_complete %>% group_by(species_id) %>% # grouping here so the count/tally will be by this group tally() %>% filter(n >= 50) # filter to common species # quick histogram of species obs hist(species_counts$n) ## Only keep the most common species surveys_complete2 <- surveys_complete %>% filter(species_id %in% species_counts$species_id) ``` Some additional examples using `%in%`, a way to look for matches in one list vs. another. ```{r %in% example} animals<- c("racoon", "rabbit", "cat", "dog", "frog","penguin") friendlyanimals <- c("frog", "dog") animals %in% friendlyanimals # finds match with friendlyanimals only !animals %in% friendlyanimals # finds only animals that don't match friendly animals friendlyanimals %in% animals # will work either way, # usually use % going from "big table" to "lookup/little table" ``` #### Factors and Levels Quick note about weird factor things. ```{r convrtfactorlevels} # notice there are levels with no observations table(surveys_complete2$species_id) # factors retain levels even when no observations exist # how to get rid of "empty factor levels" surveys_complete2$species_id <- factor(surveys_complete2$species_id) # that should have dropped unused factors from this column and re-assigned to the dataframe table(surveys_complete2$species_id) ``` **Note** when you save this to `.csv` the data will not retain all these unused factors, so you won't need to convert or "drop levels" if you are exporting and reading a csv back in. ## Export/Save Data as CSV Export without rownames using `write.csv` ```{r exportcsv} write.csv(surveys_complete2, file = "data_output/surveys_complete.csv", row.names = FALSE) ``` # Visualize Data with `ggplot2` Need to read in our data and libraries first. ```{r loadcleandata} library(tidyverse) surveys_complete <- read.csv("data_output/surveys_complete.csv") ``` Now we have a clean dataset and we want to start looking at it. Let's make some visualizations and plot our data. ## Boxplot & Violin Plots - Create a boxplot/violinplot to visualize the distribution of the `weight` variable for each species. - Create a different box/violinplot to show the distribution of the `hindfoot_length` variable for each species. *Let's look at `weight`* ```{r boxplotWeight} surveys_complete %>% ggplot() + geom_boxplot(aes(x=species_id, y=weight), fill="skyblue") + theme_bw() ``` ```{r violplotWeight} surveys_complete %>% ggplot() + geom_violin(aes(x=species_id, y=weight), fill="skyblue") + theme_classic() ``` *Ok! Now let's look at `hindfoot_length`* ```{r boxHindfoot} surveys_complete %>% ggplot() + geom_boxplot(aes(x=species_id, y=hindfoot_length), fill="salmon") + theme_bw() ``` ```{r violHindfoot} surveys_complete %>% ggplot() + geom_violin(aes(x=species, y=hindfoot_length), fill="salmon") + theme_bw() ``` ## Boxplots with Points and `geom_jitter` Boxplots are great for showing a lot of information about the spread of a variable, but some information from the raw data is lost. - Add a point geometry to your box/violinplot to visualze the raw `hindfoot_length data` ```{r boxptHindfoot} surveys_complete %>% ggplot(aes(x=species_id, y=hindfoot_length)) + geom_boxplot() + geom_point(color="forestgreen") + theme_bw() ``` Unfortunately, adding the raw data makes our figure “overplotted”. There are too many points overlapping and information is lost this way, too. - Make a new boxplot of `hindfoot_length` for each species, but this time plot a jittered version of the raw data as a geometry on top of your boxplot ```{r boxptHindfootJitt} surveys_complete %>% ggplot(aes(x=species_id, y=hindfoot_length)) + geom_boxplot() + geom_jitter() ``` Color by plot id: ```{r boxptHindfootJitt2} surveys_complete %>% ggplot(aes(x=species_id, y=hindfoot_length, color=plot_id)) + geom_jitter(alpha=0.5) + geom_boxplot() ``` Whoa, plotted the `plot_id` as a numeric range instead of independent observations or locations. How do we fix? ```{r boxptHindfootJitt3} surveys_complete %>% ggplot(aes(x=species_id, y=hindfoot_length)) + # change to factor for plotting only geom_jitter(aes(color=as.factor(plot_id)),alpha=0.5) + geom_boxplot() + # add a better title and axis label xlab("Species ID") + ylab("Hindfoot Length (mm)")+ ggtitle("A Crazy Plot") ``` ## Changing Labels & Legend Titles? How to change legend title? ```{r boxptHindfootJitt4} surveys_complete %>% ggplot(aes(x=species_id, y=hindfoot_length)) + # change to factor for plotting only geom_jitter(aes(color=as.factor(plot_id)),alpha=0.5) + geom_boxplot() + # add a better title and axis label xlab("Species ID") + ylab("Hindfoot Length (mm)")+ ggtitle("A Crazy Plot") + labs(color="Plot ID") # OR scale_color_discrete(name="Plot ID") ``` # **AFTERNOON**: Custom Functions What is a function? Something that takes something (data), and does something (a calculation) on that data. ```{r barebonesFx} # Bare bones function call hypotenuse <- function(a, b) { myHypotenuse <- sqrt(a^2 + b^2) return(myHypotenuse) } hypotenuse(4, 5) ``` Difference between `hypotenuse` and `myHypotenuse` is: - `hypotenuse` is the *name of the function* - `myHypotenuse` is the name of the *local* object that I've assigned within the function. ```{r Fx objects} hypotenuse <- function(a, b) { bookshelves <- sqrt(a^2 + b^2) return(bookshelves) } hypotenuse(4, 5) # Objects defined within the function stay within the function! # bookshelves # this won't work because it doesn't exist, # because I defined it within a function # you can also pass an operation directly to return() without any need # to define an intermediate variable hypotenuse <- function(a, b) { return(sqrt(a^2 + b^2)) # but probably better to be explicit as above } hypotenuse(43, 5) ``` In R, you don't need to use a return function, the function will *return* whatever the output of the calculation is: ```{r Fx_return} hypotenuse <- function(a, b) { sqrt(a^2 + b^2) # but probably better to be explicit as above } hypotenuse(43, 5) ``` ## Default Arguments to a Function You can name your arguments a bit more explicitly, which is helpful when sharing use the function. More information about what the pieces are doing. **Naming your arguments** ```{r namingFxArgs} hypotenuse(a=43, b=5) hypotenuse(b=5, a=43) ``` You can explicitly put a default value in the function. The user of that function can then explicitly add a value, or just run with the default. If you have a default value, you can still pass a different value for each time you call your function. ```{r fxdefault, eval=TRUE} hypotenuse <- function(a, b = 12) { theHypotenuse <- sqrt(a^2 + b^2) return(theHypotenuse) } hypotenuse(a = 5) # hypotenuse(b = 4) # throws an error because no default for "a" hypotenuse(a = 3, b = 4) ``` What if you just put `5` instead of `a = 5`? ```{r} hypotenuse(5) ``` Looks like that still works...so we *don't have to be* explicit, as long as a default value exists. ## Defensive programming How do we add errors or failsafe's to keep the function from being changed? ```{r ifstopFx} hypotenuse <- function(a, b) { if(a <= 0) { stop("Triangle side must have a length greater than 0") } theHypotenuse <- sqrt(a^2 + b^2) return(theHypotenuse) } hypotenuse(a = 2, b = 12) ``` ```{r ifstopError, eval=F, echo=T} hypotenuse(a = -4, b = 12) # will throws error: ``` `Error in hypotenuse(a = -4, b = 12) : Triangle side must have a length greater than 0` ## Temperature conversion Add to the code below to make a function that converts Fahrenheit to Kelvin. Remember that temperature in **K** is equal to the **F** temperature minus 32 * 5/9 and the whole thing plus 273.15. $K = (F - 32) * \left(\frac{5}{9}\right) + 273.15$ ```{r FtoX} F_to_X <- function(temp) { K <- ((temp - 32) * 5/9) + 273.15 return(K) } # Freezing point of water in Kelvin F_to_X(32) # this works across a whole fector as well: myTemps <- c(0, 32, 98.6, 212) F_to_X(myTemps) ``` ```{r KtoC} # A function that converts K to C # subtract 273.15 from K to get to C K_to_C <- function(temp){ C <- temp - 273.5 return(C) } K_to_C(100) ``` You can also nest functions and combine them at will: ```{r nestingFxs} K_to_C(F_to_X(98.6)) ``` ## Sourcing Functions We can save a function as it's own standalone script, and then use `source()`. This will essentially allow use to load the function for use in whatever script we may want. For example, let's say we saved a copy of these functions in a "`temperature_conversion.R`" file in our `scripts` folder. If we were starting in an entirely different script or starting a new session (but within same project), we could use the following lines of code to make those functions available: ```{r sourcing, eval=F} source("scripts/temperature_conversion.R") # should see the functions K_to_C and F_to_X in our Environment Tab under "Functions". # we can then apply those functions as needed: K_to_C(F_to_X(110)) # how hot was Davis a few weeks ago in Kelvin? ``` ## Create Gapminder Function to Plot Combining functions to make some plots! Write a new function that: - Takes two arguments, the `gapminder` data.frame and the name of a `country` - Plots the change in the country’s population over time - The return value from the function should be a ggplot object. It is often easier to modify existing code than to start from scratch. Feel free to start with the `plotPopGrowth` function code. *Note: I've added some error checking here as an example...you wouldn't need that piece for this to works* ```{r plotPopGrowthFx} library(tidyverse) library(gapminder) plotPopGrowth <- function(countryToPlot, theData = gapminder) { # this filters data to a specific country oneCountry <- theData %>% filter(country == countryToPlot) # can add some error checking if you want if(!nrow(oneCountry)>0) { # checks if there are values for country stop("No Data for this Country, please try again") } # this runs the plot ggplot(data=oneCountry) + geom_line(aes(x=year, y=pop, color=country)) } ``` Now can we use it? ```{r usePlotPopFx} # plotPopGrowth("Afghanistans") # gives error plotPopGrowth("Afghanistan") plotPopGrowth("United States") ``` ```{r} countries <- unique(gapminder$country) for(i in seq_along(countries)){ plotPopGrowth(countries[i]) ggsave(filename = paste0("figures/plot_",countries[i], ".png")) } ```