Introduction

The aim of this project is to build machine learning models to predict the forest cover type (the predominant kind of tree cover) based on cartographic variables. I will be using a dataset from a Kaggle Competition, which sourced its data from the UC Irvine Machine Learning Repository. For this project, I will employ multiple machine learning techniques to determine the best model for this multi-class classification problem. The four models I will be using are KNN, Multinomial Regression, Random Forest, and Boosted Trees.

Loading Packages

library(tidyverse)
library(corrplot)
library(tidymodels)
library(ISLR)
library(ISLR2)
library(ggthemes)
library(yardstick)
library(kknn)
library(xgboost)
library(ranger)
library(naniar)
library(dplyr)

Motives and reason

Growing up on the East Coast, one thing that always caught my attention was the vast number of trees surrounding the highway on the way back home from school. Furthermore, I have enjoyed driving through the forest during the day and particularly at night, as it gives me a feeling of an adventurous journey. Thus, one thing I noticed when I went to California was that while there were plenty of trees, many of them were pine trees. Intrigued by this difference, I selected this model in order to understand what types of conditions fit each cover type and how I can use cartographic variables to predict the cover type.

Data Description and importance

The forest cover type in this dataset was determined for a 30 x 30 meter cell, based on data from the US Forest Service (USFS) Region 2 Resource Information System. Furthermore, the independent variables were derived from data obtained from the US Geological Survey and the USFS. This data is in its raw form (not scaled) and contains a number of binary columns for qualitative variables such as wilderness areas and soil type. Thus, there will be many dummy variables in the model, and it seems like I don’t need to pivot the dummy variables as this is already done. This study area also includes four wilderness areas located in the Roosevelt National Forest of Northern Colorado, an area with minimal human-caused disturbance. The motive behind this is that these cover types are a result of ecological processes rather than forest management practices and human factors.

Since forests are crucial for absorbing carbon dioxide and greenhouse gas emissions and preserving our earth, this project plays a critical role in ecological understanding, forest resource management, and environmental conservation efforts. The exploratory data analysis will help us understand the distributions of different forest cover types and can help us understand the different factors that cause a certain forest type in a particular place. Furthermore, the model ultimately selected can be used to predict forest cover types in different areas using local environmental and geographical data, an ability that makes this a valuable tool for land use and forest planning.

Project Roadmap

Now that we understand the background of the data, as well as the importance and reasoning behind it, I will lay out the plan for this model-building project. First, we will perform some basic exploratory data analysis. We will check the dimensions of our data and the variables provided in the dataset. Next, we will examine the relationships between different variables or predictors to see if we need to handle any correlated predictors. Our goal is to use the predictors of this dataset to accurately predict the cover type. After the initial exploratory data analysis, we will split our dataset into training and testing sets. Since this dataset is from a Kaggle competition, I will only be using the training set, as the testing set does not include outcomes, preventing us from performing supervised learning on it. Next, I will create a recipe to determine which variables to use, and set folds for the 10-fold cross-validation we will implement. I will first train with four to five models: Multinomial regression, KNN, random forest, and boosted trees. After comparing the training MSE and cross-validation metrics, we will use the best-performing model to fit our testing set and evaluate the effectiveness of our model.

Exploratory Data Analysis

Read data

First, we will load the data through reading a CSV. Since the dataset is from a Kaggle competition, it is relatively clean, and there isn’t much cleaning for us to do. There are also no missing values, so it is a pretty clean and tidy dataset. From the dimensions of the dataset, we see that there are 15,120 observations with 56 predictors, making this a relatively large dataset with a large number of predictors. However, remember that there will be a lot of dummy variables, so it won’t be as complicated as it might seem.

data <- read.csv("./train.csv")

Check for missingness

vis_miss(data)

dim(data)
## [1] 15120    56

Describing the Outcome and predictors

After understanding the basic structure of the dataset, I will list out the different outcomes that we are trying to predict, followed by the predictors that I will be using in the dataset so that we can better understand which predictors we are actually using. The following are the 7 different cover type outcomes we will be trying to predict.

  1. Spruce/Fir: Spruce/Fir forests
  2. Lodgepole Pine: Lodgepole Pine forests
  3. Ponderosa Pine: Ponderosa Pine forests
  4. Cottonwood/Willow: Cottonwood/Willow forests
  5. Aspen: Aspen forests
  6. Douglas-fir: Douglas-fir forests
  7. Krummholz: Krummholz forests

The following are the predictors I will be using:

  • Elevation: Elevation in meters
  • Aspect: Aspect in degrees azimuth
  • Slope: Slope in degrees
  • Horizontal Distance To Hydrology: Horizontal distance to nearest surface water features
  • Vertical Distance To Hydrology: Vertical distance to nearest surface water features
  • Horizontal Distance To Roadways: Horizontal distance to nearest roadway
  • Hillshade 9am (0 to 255 index): Hillshade index at 9am, summer solstice
  • Hillshade Noon (0 to 255 index): Hillshade index at noon, summer solstice
  • Hillshade 3pm (0 to 255 index): Hillshade index at 3pm, summer solstice
  • Horizontal Distance To Fire Points: Horizontal distance to nearest wildfire ignition points
  • Wilderness Area (4 binary columns, 0 = absence or 1 = presence): Wilderness area designation
  • Soil Type (40 binary columns, 0 = absence or 1 = presence): Soil Type designation

To expand on the two categorical and dummy variables, I have listed each of the wilderness area as well as the soil type in the following:

Wilderness Areas:

  1. Rawah Wilderness Area
  2. Neota Wilderness Area
  3. Comanche Peak Wilderness Area
  4. Cache la Poudre Wilderness Area

Soil Types:

  1. Cathedral family: Rock outcrop complex, extremely stony.
  2. Vanet - Ratake families complex: Very stony.
  3. Haploborolis - Rock outcrop complex: Rubbly.
  4. Ratake family - Rock outcrop complex: Rubbly.
  5. Vanet family - Rock outcrop complex complex: Rubbly.
  6. Vanet - Wetmore families - Rock outcrop complex: Stony.
  7. Gothic family.
  8. Supervisor - Limber families complex.
  9. Troutville family: Very stony.
  10. Bullwark - Catamount families - Rock outcrop complex: Rubbly.
  11. Bullwark - Catamount families - Rock land complex: Rubbly.
  12. Legault family - Rock land complex: Stony.
  13. Catamount family - Rock land - Bullwark family complex: Rubbly.
  14. Pachic Argiborolis - Aquolis complex.
  15. Unspecified in the USFS Soil and ELU Survey.
  16. Cryaquolis - Cryoborolis complex.
  17. Gateview family - Cryaquolis complex.
  18. Rogert family: Very stony.
  19. Typic Cryaquolis - Borohemists complex.
  20. Typic Cryaquepts - Typic Cryaquolls complex.
  21. Typic Cryaquolls - Leighcan family, till substratum complex.
  22. Leighcan family, till substratum, extremely bouldery.
  23. Leighcan family, till substratum - Typic Cryaquolls complex.
  24. Leighcan family, extremely stony.
  25. Leighcan family, warm, extremely stony.
  26. Granile - Catamount families complex: Very stony.
  27. Leighcan family, warm - Rock outcrop complex: Extremely stony.
  28. Leighcan family - Rock outcrop complex: Extremely stony.
  29. Como - Legault families complex: Extremely stony.
  30. Como family - Rock land - Legault family complex: Extremely stony.
  31. Leighcan - Catamount families complex: Extremely stony.
  32. Catamount family - Rock outcrop - Leighcan family complex: Extremely stony.
  33. Leighcan - Catamount families - Rock outcrop complex: Extremely stony.
  34. Cryorthents - Rock land complex: Extremely stony.
  35. Cryumbrepts - Rock outcrop - Cryaquepts complex.
  36. Bross family - Rock land - Cryumbrepts complex: Extremely stony.
  37. Rock outcrop - Cryumbrepts - Cryorthents complex: Extremely stony.
  38. Leighcan - Moran families - Cryaquolls complex: Extremely stony.
  39. Moran family - Cryorthents - Leighcan family complex: Extremely stony.
  40. Moran family - Cryorthents - Rock land complex: Extremely stony.

Visual EDA

Now, before we go strait to fitting different models for our dataset. Lets first use exploratory analysis to explore the data. Lets try to unravel the story that these data is trying to tell. For me, one of the most intresting part of this project was the exploratory data analysis, since this section will give more context to the dataset and the problem that we are trying to tackle.

Distribution of types:

Let’s first check the distribution of the outcome variables to see if one type of cover type is heavily skewed. For my following code, I converted the cover type to its actual name (since in the dataset they used numbers) and created a bar plot to see the distribution. After creating the plot, we see that the distribution of the outcome variable is even across the dataset. This further shows that this dataset is symmetrical for the outcome predictors.

data_type <- data %>%
  mutate(Cover_Type = case_when(
    Cover_Type == 1 ~ "Spruce/Fir",
    Cover_Type == 2 ~ "Lodgepole Pine",
    Cover_Type == 3 ~ "Ponderosa Pine",
    Cover_Type == 4 ~ "Cottonwood/Willow",
    Cover_Type == 5 ~ "Aspen",
    Cover_Type == 6 ~ "Douglas-fir",
    Cover_Type == 7 ~ "Krummholz",
    TRUE ~ as.character(Cover_Type)  # This line handles any values that don't match 1-7
  ))

ggplot(data_type, aes(x = Cover_Type)) +
  geom_bar(aes(fill = Cover_Type)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Correlation matrix

Following the distribution of the outcome variable, lets create a correlation heat map to get a sense of their relationship with one another. For this correlation matrix, I left out the two categorical variables or else there will be too many variables, and the correlation heat map will be hard to interpret.

selected_data <- data %>%
  select(Elevation, Aspect, Slope, Horizontal_Distance_To_Hydrology, 
         Vertical_Distance_To_Hydrology, Horizontal_Distance_To_Roadways, 
         Hillshade_9am, Hillshade_Noon, Hillshade_3pm, 
         Horizontal_Distance_To_Fire_Points)

selected_data  %>%
  select(is.numeric) %>%
  cor() %>%
  corrplot(type = 'lower', diag = FALSE, method = 'color', tl.cex = 0.6, tl.srt = 45)
## Warning: Use of bare predicate functions was deprecated in tidyselect 1.1.0.
## ℹ Please use wrap predicates in `where()` instead.
##   # Was:
##   data %>% select(is.numeric)
## 
##   # Now:
##   data %>% select(where(is.numeric))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## wildness area and soil typer bar chart 

I will highlight a few notable relationships observed in the heatmap. First, there is a strong negative correlation between Hillshade_9am and Hillshade_3pm, which is expected as the sun rises in the east and sets in the west throughout the day. Consequently, areas that receive strong sunlight in the morning will naturally experience less sunlight in the afternoon as the sun shifts position. Interestingly, Hillshade_Noon shows a strong positive correlation with Hillshade_3pm but not with Hillshade_9am. This could be attributed to the position of the sun at these times, where noontime sun is more indicative of the afternoon light conditions than the morning. Another compelling relationship is the strong positive correlation between Vertical Distance to Hydrology and Horizontal Distance to Hydrology, suggesting that as the vertical distance increases, the horizontal distance tends to increase as well. This relationship may be explained by the nature of mountainous terrain, where elevation changes are often accompanied by changes in proximity to water sources. Furthermore, Elevation exhibits a slight positive correlation with Horizontal Distance to Roadways, Fire Points, and Hydrology. This makes sense, as higher elevations may be more distant from these features, possibly due to the topography and reduced human intervention at higher altitudes.

Wilderness area and Soil type bar chart

Since we removed wilderness area and soil type from our correlation matrix, lets look at the distribution of these these categorical vairables. Since the dataset was already pivoted, I had to remelt these columsn of data into one column, where I then plotted a histogram of each of the variables.

soil_data <- data %>%
  pivot_longer(
    cols = starts_with("Soil_Type"),
    names_to = "Soil_Type",
    values_to = "Presence"
  ) %>%
  filter(Presence == 1) %>%
  select(-Presence)

soil_data <- soil_data %>% select(Id,Soil_Type)


wilderness_data <- data %>%
  pivot_longer(
    cols = starts_with("Wilderness_Area"),
    names_to = "Wilderness_Area",
    values_to = "Presence"
  ) %>%
  filter(Presence == 1) %>%
  select(-Presence)

wilderness_data <- wilderness_data%>% select(Id, Wilderness_Area)
 
melted_data<- inner_join(soil_data,wilderness_data, by = 'Id')
melted_data %>%
  ggplot(aes(x = fct_infreq(Soil_Type))) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8), # Adjust text angle and size
        plot.title = element_text(hjust = 0.5),
        plot.margin = margin(5, 5, 5, 5)) + # Adjust margin if needed
  labs(title = "Count of Each Soil Type",
       x = "Type",
       y = "Count")

melted_data %>%
  ggplot(aes(x = fct_infreq(Wilderness_Area)))  +
  geom_bar() +
  theme(plot.title = element_text(hjust = 0.5))

wilderness_sums <- data %>%
  summarise(across(starts_with("Wilderness_Area"), sum, na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(starts_with("Wilderness_Area"), sum, na.rm = TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
wilderness_sums
##   Wilderness_Area1 Wilderness_Area2 Wilderness_Area3 Wilderness_Area4
## 1             3597              499             6349             4675
soil_sums <- data %>%
  summarise(across(starts_with("Soil_Type"), sum, na.rm = TRUE))
soil_sums
##   Soil_Type1 Soil_Type2 Soil_Type3 Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type7
## 1        355        623        962        843        165        650          0
##   Soil_Type8 Soil_Type9 Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13
## 1          1         10        2142         406         227         476
##   Soil_Type14 Soil_Type15 Soil_Type16 Soil_Type17 Soil_Type18 Soil_Type19
## 1         169           0         114         612          60          46
##   Soil_Type20 Soil_Type21 Soil_Type22 Soil_Type23 Soil_Type24 Soil_Type25
## 1         139          16         345         757         257           1
##   Soil_Type26 Soil_Type27 Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31
## 1          54          15           9        1291         725         332
##   Soil_Type32 Soil_Type33 Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37
## 1         690         616          22         102          10          34
##   Soil_Type38 Soil_Type39 Soil_Type40
## 1         728         657         459

From the two tables I created, we finally see some discrepancies in the data. In terms of soil type, we observed a large number of samples for soil types 10, 29, 3, 4, and 23, whereas there were fewer samples from soil types 8, 25, 28, 9, and 36. Looking at the wilderness area distribution, there are more samples from wilderness areas 3 and 4, compared to fewer samples from Area 2. It’s interesting to note that the number of samples from wilderness area 2 is significantly less than those from the other three areas. This suggests that collecting samples from wilderness area 2 might be more challenging, while it seems relatively easier to gather samples from wilderness areas 3 and 4. Furthermore, it’s possible that soil types 10, 29, 3, 4, and 23 originate predominantly from wilderness areas 3 and 4, whereas soil types 8, 25, 28, 9, and 36 are likely found in wilderness area 2. The abundance of certain soil types could be due to the higher availability of samples from these particular wilderness areas.

Additionally, there’s something interesting to note in terms of coding logistics. I spent a lot of time trying to melt the wilderness and soil type data columns back into one column to perform the histogram analysis. However, I could have simply counted the sum of each categorical variable and then plotted that. This approach would have been much easier. It demonstrates that, with the same analytical goal in mind, different coding paths can be taken.

Elevation vs Cover_type

ggplot(data, aes(factor(Cover_Type), Elevation)) +
  geom_boxplot() + 
  geom_jitter(alpha = 0.05) +
  xlab("Cover Type")

While exploring the data, I wondered if a certain type of tree might be more suited to a specific elevation. Thus, I plotted tree cover type against elevation. To my expectations, the average elevation for each cover type was entirely different. While cover types 1 and 7 seem to prefer elevations greater than 3003, cover types 3 and 4 appear to favor elevations under 2500. Moreover, it seems that cover type 4 has the least variation in elevation, while cover types 1 and 6 exhibit the greatest variation. An interesting point to note is that cover type 7 has the most “outliers”, while cover type 4 did not have any.

Distance from Water vs Cover type

ggplot(data, aes(factor(Cover_Type), Horizontal_Distance_To_Hydrology)) +
  geom_boxplot() + 
  geom_jitter(alpha = 0.05) +
  xlab("Cover Type")

ggplot(data, aes(factor(Cover_Type), Vertical_Distance_To_Hydrology)) +
  geom_boxplot() + 
  geom_jitter(alpha = 0.05) +
  xlab("Cover Type")

After looking at the elevation versus the cover type, it seemed that different cover types had varying elevation preferences. With this observation in mind, I wanted to examine the different cover types’ distances to water. I graphed both the vertical and horizontal distances to water against the cover type. From the two charts shown above, there doesn’t appear to be a significant difference in either vertical or horizontal distance to water among different cover types. This was surprising to me, as there was a notable difference in elevation, and there is a positive correlation between elevation and horizontal distance to water.

Setting up the models

Now that we have explored our data, let’s begin by setting up the model to fit different predictive models to our dataset. First, we will split our dataset into a training set and a testing set. Next, we will create a recipe and establish cross-validation within our models.

Split Data

The first step in building our model is to split our data into training data and testing data. The goal of this is to use a larger proportion of our data to train, in order to maximize our model’s accuracy, and then apply the model to the testing data to check our model’s accuracy. For this project, I selected 75% of our data to be training data while 25% of our data to be testing data. I also stratified our split so that there is an equal proportion of of each cover type in both set, which is important for preserving balance and generalization across model. Furthermore, I established a seed number so that our project can be reproduced by another statistician if they select the same seed number as us.

set.seed(123)  # For reproducibility
data$Cover_Type <- as.factor(data$Cover_Type)
data_split <- initial_split(data, prop = 0.75, strata = Cover_Type)
train_data <- training(data_split)
test_data <- testing(data_split)

dim(train_data)
## [1] 11340    56
dim(test_data)
## [1] 3780   56

After splitting our data, we can see that there is 11340 samples for our training data and 3780 data for our testing data. Since our dataset is large, 11340 samples is more than good enough for our training dataset.

Creating Recipe

We will now build the recipe for all of our models. This recipe will indicate the response variables, predictor variables, and model conditions. You can think of the recipe as the skeleton of a human. The shape and structure of each human’s skeleton are similar; however, the height, widths, and mass of each human varies—this is similar to the different models. While all the models use the same recipe, the prediction results vary depending on which model is used. The different predictor variables are akin to the different parts of the skeleton: for example, the arm, the leg, the torso, etc. Although from our EDA it seemed that our data has a lot of predictors, since most of them are dummy variables, we will be using all of the variables in our recipe. If we don’t double count the dummy variables, there are actually only 12 variables that we will be using. For our data, since the dataset has already pivoted the tables, which means the categorical variables are already dummy variables, our recipe only needs to include all of the columns, which makes the recipe creation very easy. However, it’s important to acknowledge that this approach increases the complexity of our models. By including a large number of predictors, we risk overfitting, particularly for models like multiple regression.

recipe <- recipe(Cover_Type ~ ., data = train_data)

K-Folds Cross Validation

cv_folds <- vfold_cv(train_data, v = 10, strata = Cover_Type)

##save(cv_folds, file = 'cv_folds.Rda')
# cv_folds <- load('./cv_folds.Rda')

Model Building

This is now the most important part of the project—building our models. We will be using four models to fit our data and to predict our outcome variables, all using the same recipe. This task was quite time-consuming since tuning our parameters for random forest and boosted trees, for example, required a lot of computational time. I will show all of the model building in this file since my computer was relatively strong enough to run all of the models. The four models that I will be using are KNN, Multimodel regression, random forest, and boosted trees. While the multimodel regression was relatively easier to fit and run, the other three required some tuning work in order to find the best parameters.

Most model-building processes are similar. The general workflow of model building is the following:

  1. We first set up the model by indicating which model we are using and set its engine as well as its mode. For this project, the mode will be classification since we are trying to predict the cover type.
  2. Then, we set up the workflow by attaching our different models and the same recipe. For steps 3-5, they will only apply to the harder models (random forest, boosted trees, and KNN), since we have to fine-tune the right parameters. For the easier models, we can skip straight to step 6.
  3. After we finish setting up our workflow, we now need to set up the tuning grid with different parameters that we want to tune, as well as the ranges for how many levels of tuning we want to try for each parameter.
  4. Now, we can tune each model with the specific hyperparameters of choice.
  5. After tuning, we should select the most accurate model from the tuning grid and finalize the workflow with those specific tuning parameters.
  6. Once the workflow is determined, we want to fit our model with the workflow to our training dataset for comparison with other models.

Performance Metric

To evaluate the performance of each of our models, two useful metrics in our collect_metric function are accuracy and roc_auc. The one that we will be using for comparison is accuracy because it is more straightforward in terms of correctness for our multimodel classification problem. To explain specifically, we will count the number of correct predictions divided by the total predictions to see how accurate our models are. An important point to note, ROC would be better to use if our classes are imbalanced, but in our EDA, we saw that the distribution of our cover type was even across the board. Thus, choosing accuracy isn’t a big problem in our project.

KNN

I chose KNN as one of the four models because of its simplicity and effectiveness in classification tasks. Essentially, what KNN does is take the average of its kth closest neighbors as its prediction. However, since its performance is sensitive to the number of neighbors, I will be tuning the number of neighbors (k) using a grid search to find the optimal balance between bias and variance.

knn_model <- nearest_neighbor(neighbors = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

knn_wflow <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(recipe)

# neighbors_grid <- grid_regular(neighbors(range = c(1, 20)), levels = 10)

# knn_fit <- tune_grid(object = knn_wflow,
#                          resamples = cv_folds,
#                          grid = neighbors_grid)
# 
# write_rds(knn_fit, file = 'knn_tune.rds')

knn_fit <- read_rds('./knn_tune.rds')


best_neighbors <- select_by_one_std_err(knn_fit, desc(neighbors), metric = "accuracy")

final_wf <- finalize_workflow(knn_wflow, best_neighbors)

final_fit <- fit(final_wf, train_data)

knn_accuracy <- best_neighbors[4]

#Multinominal Regression

We learned how to use logistic regression in class; however, since our response variable has seven different outcomes, using a multinomial regression model is more suited. The multinomial regression model is similar to the logistic model in that it facilitates easy analysis. Furthermore, since it is a relatively straightforward model, there won’t be a need to fine-tune any parameters to improve it. Thus, we will use the default settings as the baseline.

multinom_model <- multinom_reg() %>%
  set_engine("nnet") %>%
  set_mode("classification")

workflow <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(multinom_model)


# reg_fit <- fit_resamples(
#   workflow,
#   resamples = cv_folds
# )
# 
#  write_rds(reg_fit, file = 'reg_fit.rds')
 
 reg_fit <- read_rds('./reg_fit.rds')

reg_metrics <- collect_metrics(reg_fit)
reg_accuracy <- reg_metrics[1,3]

Random Forest

For Random Forest, it is an excellent model for handling a large number of predictors, as is the case with our dataset. Moreover, it is quite resilient to overfitting, making it well-suited for complex datasets like ours. One drawback, however, is that this model requires extensive tuning, particularly in terms of the number of trees and their depth. The tuning process, akin to KNN, will involve a combination of grid and random searches to efficiently explore various parameters. It’s worth noting that Random Forest is often the best-performing model. Therefore, if Random Forest ultimately emerges as the top model, it would not be surprising.

rf_model <- rand_forest(
    mtry = tune(),
    trees = tune(),
    min_n = tune()
  ) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

rf_workflow <- workflow() %>% 
  add_model(rf_model) %>% 
  add_recipe(recipe)

# rf_grid <- grid_regular(
#   mtry(range = c(1, 8)),
#   trees(range = c(100, 500)),
#   min_n(range = c(5, 15)),
#   levels = 3
# )


# rf_tune_results <- tune_grid(
#   rf_workflow,
#   resamples = cv_folds,
#   grid = rf_grid
# )
# 
# write_rds(rf_tune_results, file = 'rf_tune.rds')

rf_tune_results <- read_rds('./rf_tune.rds')


rf_accuracy <- show_best(rf_tune_results, n = 1,metric = "accuracy")[6]

best_hps <- select_best(rf_tune_results, metric = "accuracy")

final_rf_workflow <- finalize_workflow(rf_workflow, best_hps)

final_rf_fit <- fit(final_rf_workflow, train_data)

Boosted trees

Boosted tree is a model known for its high accuracy, and can iteratively improve performance but requires intensive tuning, similar to random forest. Like random forest, we will tune the number of trees, depth of trees, and learning rate for boosted trees. Boosted trees also perform quite well as a machine learning model, so if this model ultimately wins, it would not be surprising.

boosted_trees_model<- boost_tree(mtry = tune(), 
                           trees = tune(), 
                           learn_rate = tune()) %>%
  set_engine("xgboost") %>% 
  set_mode("classification")

workflow <- workflow() %>% 
  add_model(boosted_trees_model) %>% 
  add_recipe(recipe)

# bt_grid <- grid_regular(mtry(range = c(1, 8)),
#                         trees(range = c(100, 500)),
#                         learn_rate(range = c(-10, -1)),
#                         levels = 3)
# 
# bt_tune_results <- tune_grid(
#   workflow,
#   resamples = cv_folds,
#   grid = bt_grid
# )

# write_rds(bt_tune_results, file = 'bt_tune.rds')

bt_tune_results <- read_rds('./bt_tune.rds')
bt_accuracy <- show_best(bt_tune_results, n = 1,metric = "accuracy")[6]


best_model <- select_best(bt_tune_results, metric = "accuracy")

final_model <- finalize_workflow(workflow, best_model)

bt_final_fit = fit(final_model, train_data)

Results of our models

Model Autoplots

One of the most useful functions in tidymodels is Autoplots. It can display the change in the performance metric as we tune each hyperparameter. I will present the autoplots of the tuning process for the KNN, random forest, and boosted tree models to showcase how their performance changes with respect to parameters.

autoplot(bt_tune_results) 

autoplot(rf_tune_results) 

autoplot(knn_fit)

From the autoplots of booted tree and random forest, we can see that there isn’t much different between the different number of tree shown in the legend. Thus, we will need to use the show_best function to find the exact one for both Random Forest and Boosted Tree. For KNN, we can tell that based on accuracy, the best neighbor is one neighbors.

Accuracy Comparison

accuracy_1 <- c(knn_accuracy$mean,reg_accuracy$mean,rf_accuracy$mean,bt_accuracy$mean)
accuracy_1
## [1] 0.7866843 0.6885362 0.8510582 0.8658730
mod_names <- c("KNN",
            "Multimodel Regression",
            "Random Forest",
            "Boosted Trees")

comp_results <- tibble(Model = mod_names,
                             Accuracy = accuracy_1)

comp_results
## # A tibble: 4 × 2
##   Model                 Accuracy
##   <chr>                    <dbl>
## 1 KNN                      0.787
## 2 Multimodel Regression    0.689
## 3 Random Forest            0.851
## 4 Boosted Trees            0.866
p1 <- ggplot(comp_results, aes(x = Model, y = Accuracy)) +
      geom_bar(stat = "identity", fill = "steelblue") +
      coord_flip() +
      theme_minimal() +
      labs(title = "Model Accuracy Comparison", x = "Model", y = "Accuracy")
p1 

p2 <- ggplot(comp_results, aes(x = Model, y = Accuracy, group = 1)) +
      geom_line(color = "darkorange", size = 1) +
      geom_point(color = "darkorange", size = 3) +
      theme_minimal() +
      labs(title = "Model Accuracy Trend", x = "Model", y = "Accuracy")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2

To compare the accuracy of the four models I used, I created a tibble that provides the mean accuracy for each model. I will create two charts to display this accuracy: one bar plot and one line plot. From observing the accuracy, it seems to fluctuate quite a bit, with the boosted tree at 0.613 and random forest at 0.84. It appears that the random forest is the best model, and I will use this model to predict the testing data to see how it performs.

Results from the best model

We delve deeper into the analysis of Random Forest and Boosted Tree models, as they had the highest training accuracies. Looking at the chart below, we see that both models may take time to run due to the number of trees selected. Since we have already chosen the best models, there’s no need to alter our workflows or models. Our task now is to examine the best Random Forest and Boosted Tree models and use them to predict and analyze our testing data.

The best model for Boosted Tree is… Boosted Tree number 27. The boosted tree model #27 appears to have performed the best out of all of our Boosted Tree models. Furthermore, this model is the best we have tried so far, even considering the other three models we have tested.

show_best(bt_tune_results, n = 1,metric = "accuracy")
## # A tibble: 1 × 9
##    mtry trees learn_rate .metric  .estimator  mean     n std_err .config        
##   <int> <int>      <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>          
## 1     8   500        0.1 accuracy multiclass 0.866    10 0.00332 Preprocessor1_…

Now that we have identified our best model, it is time to test this model against the testing dataset and evaluate our prediction accuracy.

final_bt_predictions <- augment(bt_final_fit,test_data)
bt_accuracy <- final_bt_predictions %>%
    accuracy(truth = Cover_Type, estimate = .pred_class)
bt_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.864

With a final accuracy of 0.8643 on our testing dataset, our boosted tree model has performed better than expected, even surpassing our training accuracies. An accuracy over 80% is generally considered fair and acceptable. While this is commendable, there’s still room for improvement.

The best model for Random Forest is… Random Forest model number 09 is our best performer in this category, ranking second overall among the models we tested.

show_best(rf_tune_results, n = 1,metric = "accuracy")
## # A tibble: 1 × 9
##    mtry trees min_n .metric  .estimator  mean     n std_err .config             
##   <int> <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1     8   500     5 accuracy multiclass 0.851    10 0.00385 Preprocessor1_Model…

Now that we have identified our Random Forest model, it’s time to test this model against the testing dataset and evaluate our prediction accuracy.

final_rf_predictions <- augment(final_rf_fit, test_data)
rf_accuracy <- final_rf_predictions %>%
    accuracy(truth = Cover_Type, estimate = .pred_class)
rf_accuracy
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.852

Our Random Forest model achieved a final accuracy of 0.8516 on the testing dataset, performing well, though slightly less so than the boosted tree. This aligns with the higher training accuracies of the boosted tree. However, like the boosted tree, our testing accuracy exceeds our training accuracies. While over 80% accuracy is fair and acceptable, there is still room for improvement in our model.

Conclusion

In this project, we meticulously demonstrated a complete machine learning procedure using our cover type dataset and various machine learning models. Our objective was to utilize cartographic variables to predict the predominant tree cover in a given area.

Through comprehensive exploratory data analysis, we gained valuable insights into the dataset. The even distribution of cover types and the significant correlations among variables such as elevation, hydrology, and wilderness areas highlighted the intricate balance in forest ecosystems. Our analysis uncovered interesting patterns, like the preference of certain cover types for specific elevations and their relationships with water sources.

In the model fitting section, we employed multiple models – KNN, Multinomial Regression, Random Forest, and Boosted Trees – testing them on the training dataset. Within this fitting, the Random Forest model stood out for its robustness and ability to manage a large number of predictors, achieving an accuracy of 84.79% on the testing set. This performance was far superior to the other three models we tested. A key takeaway from this project is the critical role of model tuning for complex models. Our experiences with parameter tuning illustrated the delicate balance between variance and bias. It was fascinating to observe how different models yielded varied predictions within the same dataset.

However, there are definitely areas where this project can improve. For instance, we could explore additional models or more advanced ensembling techniques that might yield even more accurate predictions. The insights gained from this project could be invaluable for ecological conservation efforts, contributing to informed forest management and land use planning. “We are moving slowly into an era where big data is the starting point, not the end,” Pearl Zhu, author of ‘Digital Masters,’ insightfully remarks, capturing the essence of our evolving digital landscape. This quote resonates deeply with me as I conclude this project. It has revealed the transformative power of data science—a field critical in addressing unexplored questions. As we enter this era of data abundance, understanding and interpreting data becomes a crucial skill, key to unlocking solutions to the most complex and unanswered problems. Embracing this data revolution, I am excited to be part of a discipline at the forefront of discovery and innovation.