---
title: "dplyr & ggplot2 pipeline with survey data"
author: "Ryan Peek"
date: "7/19/2017"
output: 
  html_document: # put on it's own line to add additional options
    toc: true # add a table of contents
    toc_depth: 2 # show only first two levels in TOC
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# **MORNING**: Data Pipeline for Ecology Data

First load libraries

```{r libraries}
suppressPackageStartupMessages(library(tidyverse)) # keep messages from appearing in html output or on your screen
```

## Get The Data & Inspect

Download the data using the url technique:

```{r getdata, eval=FALSE}

# create a directory to put the data in
dir.create("data") # make sure we have a data dir to download file to

# download a copy locally
download.file(url="https://mikoontz.github.io/data-carpentry-week/data/combined.csv", destfile = "data/combined.csv") # this downloads and names your file

```

 > One extra note about `.Rmd` files, if you saved your `Rmd` in a "docs" folder, or in a folder of any name, you'll need to provide a `../` in front of your path, because when you run code in `Rmd` it defaults to the directory that the `Rmd` lives. `..` means go UP a directory. The base directory or root directory is wherever your .Rproj file lives. R scripts will always default to wherever the .Rproj file is, whereas .Rmd files default to wherever they actually are. 

```{r readdata}

# read in that data to your environment
survey <- read.csv("data/combined.csv")

```

### Summarize Data

Now let's figure out what we have using `summary` and `glimpse` functions.

```{r inspect}

summary(survey)
glimpse(survey)

# a way to view counts of categories in a variable
table(survey$species_id) # this is a factor vector

# test if table() works on character data types
table(as.character(survey$species_id)) # works on character vectors

# let's look at sex column for NA's and blanks
table(survey$sex)

```

So it looks like many different species, but abundances (counts) are widely variable. Some have 1, others have as many as 2000.

Also appears to have many NA's in the `hindfoot_length` and `weight` columns. And we have "blanks" in the `sex` column.

## Filter Data

### Remove NAs and Blanks

Let's get rid of NA's from **`weight`** & **`hindfoot_length`** columns, and the mysterious blanks (`""`) from the sex column. Write that object to a new dataframe called "*surveys_complete*".

```{r filterNas}

surveys_complete <- survey %>% 
  filter(!is.na(weight), # remove NAs in weight
         !is.na(hindfoot_length), # remove NAs in hindfoot_length
         sex!="", # could also use !sex==""
         species_id!="") # remove missing species_id
```

### Filter Rare Species & `%in%`

Use `dplyr` to create a separate dataframe of the species that have been observed more than 50 times. Then filter the `surveys_complete` based on those counts using the **`%in%`** command in `filter`.

```{r filterRareSpp}

## Extract the most common species_id
species_counts <- surveys_complete %>%
  group_by(species_id) %>% # grouping here so the count/tally will be by this group
  tally() %>%
  filter(n >= 50) # filter to common species

# quick histogram of species obs
hist(species_counts$n)

## Only keep the most common species
surveys_complete2 <- surveys_complete %>%
  filter(species_id %in% species_counts$species_id)

```

Some additional examples using `%in%`, a way to look for matches in one list vs. another.

```{r %in% example}

animals<- c("racoon", "rabbit", "cat", "dog", "frog","penguin")

friendlyanimals <- c("frog", "dog")

animals %in% friendlyanimals # finds match with friendlyanimals only

!animals %in% friendlyanimals # finds only animals that don't match friendly animals

friendlyanimals %in% animals # will work either way,
# usually use % going from "big table" to "lookup/little table"

```

#### Factors and Levels

Quick note about weird factor things.

```{r convrtfactorlevels}

# notice there are levels with no observations
table(surveys_complete2$species_id) 

# factors retain levels even when no observations exist

# how to get rid of "empty factor levels"
surveys_complete2$species_id <- factor(surveys_complete2$species_id)

# that should have dropped unused factors from this column and re-assigned to the dataframe
table(surveys_complete2$species_id) 

```

**Note** when you save this to `.csv` the data will not retain all these unused factors, so you won't need to convert or "drop levels" if you are exporting and reading a csv back in. 

## Export/Save Data as CSV

Export without rownames using `write.csv`

```{r exportcsv}

write.csv(surveys_complete2, file = "data_output/surveys_complete.csv", row.names = FALSE)
```


# Visualize Data with `ggplot2`

Need to read in our data and libraries first.

```{r loadcleandata}
library(tidyverse)
surveys_complete <- read.csv("data_output/surveys_complete.csv")

```


Now we have a clean dataset and we want to start looking at it. Let's make some visualizations and plot our data.

## Boxplot & Violin Plots 

 - Create a boxplot/violinplot to visualize the distribution of the `weight` variable for each species.  
 - Create a different box/violinplot to show the distribution of the `hindfoot_length` variable for each species.
 
*Let's look at `weight`*
```{r boxplotWeight}

surveys_complete %>% 
  ggplot() + geom_boxplot(aes(x=species_id, y=weight), fill="skyblue") +
  theme_bw()

```

```{r violplotWeight}
surveys_complete %>% 
  ggplot() + geom_violin(aes(x=species_id, y=weight), fill="skyblue") +
  theme_classic()

```

*Ok! Now let's look at `hindfoot_length`*
```{r boxHindfoot}
surveys_complete %>% 
  ggplot() + geom_boxplot(aes(x=species_id, y=hindfoot_length), fill="salmon") +
  theme_bw()

```

```{r violHindfoot}
surveys_complete %>% 
  ggplot() + geom_violin(aes(x=species, y=hindfoot_length), fill="salmon") +
  theme_bw()

```

## Boxplots with Points and `geom_jitter`

Boxplots are great for showing a lot of information about the spread of a variable, but some information from the raw data is lost. 

 - Add a point geometry to your box/violinplot to visualze the raw `hindfoot_length data`
 
```{r boxptHindfoot}
surveys_complete %>% 
  ggplot(aes(x=species_id, y=hindfoot_length)) +
  geom_boxplot() +
  geom_point(color="forestgreen") +
  theme_bw()

```

Unfortunately, adding the raw data makes our figure “overplotted”. There are too many points overlapping and information is lost this way, too.

 - Make a new boxplot of `hindfoot_length` for each species, but this time plot a jittered version of the raw data as a geometry on top of your boxplot

```{r boxptHindfootJitt}
surveys_complete %>% 
  ggplot(aes(x=species_id, y=hindfoot_length)) +
  geom_boxplot() +
  geom_jitter()
  
```

Color by plot id:

```{r boxptHindfootJitt2}
surveys_complete %>% 
  ggplot(aes(x=species_id, y=hindfoot_length, color=plot_id)) +
  geom_jitter(alpha=0.5) +
  geom_boxplot()

```

Whoa, plotted the `plot_id` as a numeric range instead of independent observations or locations. How do we fix?

```{r boxptHindfootJitt3}
surveys_complete %>% 
  ggplot(aes(x=species_id, y=hindfoot_length)) + # change to factor for plotting only
  geom_jitter(aes(color=as.factor(plot_id)),alpha=0.5) +
  geom_boxplot() + 
  # add a better title and axis label
  xlab("Species ID") + ylab("Hindfoot Length (mm)")+
  ggtitle("A Crazy Plot")

```

## Changing Labels & Legend Titles?

How to change legend title?

```{r boxptHindfootJitt4}
surveys_complete %>% 
  ggplot(aes(x=species_id, y=hindfoot_length)) + # change to factor for plotting only
  geom_jitter(aes(color=as.factor(plot_id)),alpha=0.5) +
  geom_boxplot() + 
  # add a better title and axis label
  xlab("Species ID") + ylab("Hindfoot Length (mm)")+
  ggtitle("A Crazy Plot") +
  labs(color="Plot ID") # OR scale_color_discrete(name="Plot ID")


```

# **AFTERNOON**: Custom Functions

What is a function? Something that takes something (data), and does something (a calculation) on that data. 

```{r barebonesFx}
# Bare bones function call
hypotenuse <- function(a, b) {
  myHypotenuse <- sqrt(a^2 + b^2)
  return(myHypotenuse)
}

hypotenuse(4, 5)
```

Difference between `hypotenuse` and `myHypotenuse` is:
 - `hypotenuse` is the *name of the function*
 - `myHypotenuse` is the name of the *local* object that I've assigned within the function.
 
```{r Fx objects}

hypotenuse <- function(a, b) {
  bookshelves <- sqrt(a^2 + b^2)
  return(bookshelves)
}

hypotenuse(4, 5)

# Objects defined within the function stay within the function!
# bookshelves # this won't work because it doesn't exist, 
# because I defined it within a function

# you can also pass an operation directly to return() without any need
# to define an intermediate variable
hypotenuse <- function(a, b) {
  return(sqrt(a^2 + b^2)) # but probably better to be explicit as above
}

hypotenuse(43, 5)
```

In R, you don't need to use a return function, the function will *return* whatever the output of the calculation is:

```{r Fx_return}
hypotenuse <- function(a, b) {
  sqrt(a^2 + b^2) # but probably better to be explicit as above
}

hypotenuse(43, 5)
```

## Default Arguments to a Function

You can name your arguments a bit more explicitly, which is helpful when sharing use the function. More information about what the pieces are doing.

**Naming your arguments**
```{r namingFxArgs}
hypotenuse(a=43, b=5)
hypotenuse(b=5, a=43)
```

You can explicitly put a default value in the function. The user of that function can then explicitly add a value, or just run with the default.
If you have a default value, you can still pass a different value for each time you call your function.

```{r fxdefault, eval=TRUE}
hypotenuse <- function(a, b = 12) {
  theHypotenuse <- sqrt(a^2 + b^2)
  return(theHypotenuse)
}

hypotenuse(a = 5)
# hypotenuse(b = 4) # throws an error because no default for "a"
hypotenuse(a = 3, b = 4)

```

What if you just put `5` instead of `a = 5`?

```{r}
hypotenuse(5)
```
Looks like that still works...so we *don't have to be* explicit, as long as a default value exists.


## Defensive programming

How do we add errors or failsafe's to keep the function from being changed?

```{r ifstopFx}

hypotenuse <- function(a, b) {
  if(a <= 0) {
    stop("Triangle side must have a length greater than 0")
  }
  theHypotenuse <- sqrt(a^2 + b^2)
  return(theHypotenuse)
}

hypotenuse(a = 2, b = 12)
```

```{r ifstopError, eval=F, echo=T}
hypotenuse(a = -4, b = 12) # will throws error:
```

<span style="color:red">
`Error in hypotenuse(a = -4, b = 12) : 
  Triangle side must have a length greater than 0`
</span>


## Temperature conversion 

Add to the code below to make a function that converts Fahrenheit to Kelvin. Remember that temperature in **K** is equal to the **F** temperature minus 32 * 5/9 and the whole thing plus 273.15.

$K = (F - 32) * \left(\frac{5}{9}\right) + 273.15$

```{r FtoX}

F_to_X <- function(temp) {
  K <- ((temp - 32) * 5/9) + 273.15
  return(K)
}

# Freezing point of water in Kelvin
F_to_X(32)

# this works across a whole fector as well:
myTemps <- c(0, 32, 98.6, 212)

F_to_X(myTemps)
```

```{r KtoC}

# A function that converts K to C
# subtract 273.15 from K to get to C
K_to_C <- function(temp){
  C <- temp - 273.5
  return(C)
}

K_to_C(100)

```

You can also nest functions and combine them at will:

```{r nestingFxs}

K_to_C(F_to_X(98.6))

```

## Sourcing Functions

We can save a function as it's own standalone script, and then use `source()`. This will essentially allow use to load the function for use in whatever script we may want. For example, let's say we saved a copy of these functions in a "`temperature_conversion.R`" file in our `scripts` folder.

If we were starting in an entirely different script or starting a new session (but within same project), we could use the following lines of code to make those functions available:

```{r sourcing, eval=F}
source("scripts/temperature_conversion.R")
# should see the functions K_to_C and F_to_X in our Environment Tab under "Functions".

# we can then apply those functions as needed:

K_to_C(F_to_X(110)) # how hot was Davis a few weeks ago in Kelvin?

```

## Create Gapminder Function to Plot

Combining functions to make some plots!

Write a new function that: 
 - Takes two arguments, the `gapminder` data.frame and the name of a `country`
 - Plots the change in the country’s population over time
 - The return value from the function should be a ggplot object. 

It is often easier to modify existing code than to start from scratch. Feel free to start with the `plotPopGrowth` function code.


*Note: I've added some error checking here as an example...you wouldn't need that piece for this to works*
```{r plotPopGrowthFx}
library(tidyverse)
library(gapminder)

plotPopGrowth <- function(countryToPlot, theData = gapminder) {
  # this filters data to a specific country
  oneCountry <- theData %>%
    filter(country == countryToPlot)
  
  # can add some error checking if you want
  if(!nrow(oneCountry)>0) { # checks if there are values for country
    stop("No Data for this Country, please try again")
  }

  # this runs the plot
  ggplot(data=oneCountry) + 
           geom_line(aes(x=year, y=pop, color=country))
}
```

Now can we use it?
```{r usePlotPopFx}

# plotPopGrowth("Afghanistans") # gives error
plotPopGrowth("Afghanistan") 
plotPopGrowth("United States")
```

```{r}
countries <- unique(gapminder$country)
for(i in seq_along(countries)){
  plotPopGrowth(countries[i])
  ggsave(filename = paste0("figures/plot_",countries[i], ".png"))
}
```