--- title: "Web Scraping" author: "JJB + Course" date: "07/16/2018" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Pipe Operator ## Example: Piping Operator ```{r} # install.packages("magrittr") library("magrittr") 4 %>% # Take the number four and, then sqrt() # find the square root # Same as # sqrt(4) c(7, 42, 1, 25) %>% # Combine four elements and, then log() %>% # take the natural log and, then round(2) %>% # round to the second decimal and, then diff() # take the difference between consecutive elements # Same as # diff(round(log(c(7,42,1,25)), 2)) ``` # HTTP Requests ## Example: Downloading a File ```{r} #install.packages("readxl") # URL of file to retrieve url = "http://www.dmi.illinois.edu/stuenr/class/enrfa17.xls" # http://www.dmi.illinois.edu/stuenr/class/enrfa16.xls # Save this file as… destfile = "enrfa17.xls" # Download the file download.file(url = url, destfile = destfile) # Read the file into R enrollfa17 = readxl::read_excel(destfile, skip = 4) # Adding skip = 4 to read statement allows for the # proper headers to be set for the data as it skips # the initial four rows that contain no information. head(enrollfa17) View(enrollfa17) ``` ### Exercise: Understanding HTML 1. Identify all tags and what can be extracted 2. Determine what tags have attributes and what their properties are ```html Title of Page

First order heading (large)

Paragraph for text with a link!

Top Beverages

  1. Tea
  2. Coffee
  3. Milkshakes

``` # Scraping Information ```{r} #install.packages("rvest") library("rvest") sample_html = ' Title of Page

First order heading (large)

Paragraph for text with a link!

Top Beverages

  1. Tea
  2. Coffee
  3. Milkshakes

' my_webpage = read_html(sample_html) # Or... # my_webpage = read_html("http://domain.com/path/to/sample_html.html") my_webpage ``` ## Example: Extract Node or Nodes ```{r} my_webpage = read_html(sample_html) my_webpage %>% html_nodes("li") my_webpage %>% html_node("li") my_webpage %>% html_nodes("li") %>% html_text() ``` ## Example: Retrieve Attributes ```{r} # Retrieve specific attribute my_webpage %>% html_nodes("a") %>% html_attr("href") # Retrieve all attributes my_webpage %>% html_nodes("a") %>% html_attrs() ``` ```{r} pbs_url = "https://www.pbs.org/newshour/" ## Note the selector we found is: # .card-sm__title span , .playlist__title, .card-md__title span, .home-hero__title a pbs_webpage = read_html(pbs_url) pbs_webpage %>% html_nodes(".card-sm__title span , .playlist__title, .card-md__title span, .home-hero__title a") %>% html_text() ``` ```{r} gnews = read_html("https://news.google.com") gnews gnews %>% html_nodes(".kWyHVd .ME7ew") %>% html_text() ``` ## Exercises: Find the top listed stars of [The Thomas Crown Affair](www.imdb.com/title/tt0155267/) ```{r} library("rvest") # 1. Load in the HTML page into R imdb_page = read_html("http://www.imdb.com/title/tt2294629") # Thomas Crown Affair # read_html("http://www.imdb.com/title/tt0155267/") # Frozen movie # read_html("http://www.imdb.com/title/tt2294629") # The selector for actor names and character names in movie # .cast_list ## Selector using Web Developer Tools # #titleCast > table actor_table = imdb_page %>% html_node(".cast_list") %>% html_table(header = TRUE) # actor_table[, c(-1, 3)] str(actor_table) actor_table = actor_table[, c(-1, -3)] colnames(actor_table) = c("Actor Names", "Character Names") actor_table actor_table[, 2] = gsub("[[:space:]]", "", actor_table[, 2]) actor_table # 2. Determine the selectorgadget values # ... # 3. Extract the contents # ... ``` Extract the weather information in the hourly table for Champaign, IL from Hint: You will need to use `html_table(x)` ```{r} ## Picked the selector of: # #history-observation-table wu_webpage = read_html("https://www.wunderground.com/history/daily/us/il/champaign/KCMI/date/2018-7-17") wu_webpage %>% html_node("#history-observation-table :nth-child(1)") ``` **Note:** This is an example of page that uses JavaScript # Advanced Web Scraping ## Example: Grabbed HTML Output ```html ``` ```html ``` To generalize, we're aiming to find some attribute on the HTML tag that appears multiple times. If we can find such an attribute, then we can construct a CSS selector of `tag[attribute=value]`. ```{r} # Read in the Movie imdb_movie = read_html("https://www.imdb.com/title/tt0155267/") # Create a CSS selector based on two or more HTML attributes. imdb_movie %>% html_nodes("td[itemprop=\"actor\"] span[itemprop=\"name\"]") %>% html_text() ``` ```{r} ?as.POSIXct x = c("July 5, 2018 05:08 AM") as.POSIXct(x, format = "%B %e, %Y %H:%M %p") ```