Fitting Exercise

Author

Cassia Roth

Published

April 11, 2024

This exercise practices model fitting using the tidymodels framework.

Here, we are going to fit a model using the nlmixr2data package and data. First, let’s read in the .csv file and check its structure.

#Loading packages
library(tidyverse) #This includes ggplot2, tidyr, readr, dplyr, stringr, purr, forcats
library(knitr)
library(here) # For file paths
library(kableExtra) # For tables
library(gt) # For tables
library(gtsummary) # For tables
library(renv) # For package management
library(patchwork) # For combining plots

#Reading in the csv file
mavo <- read_csv(here("fitting-exercise", "Mavoglurant_A2121_nmpk.csv"))

#Checking the packaging (displaying first and last few rows of data)
nrow(mavo)
[1] 2678
ncol(mavo)
[1] 17
#Showing the structure
str(mavo)
spc_tbl_ [2,678 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID  : num [1:2678] 793 793 793 793 793 793 793 793 793 793 ...
 $ CMT : num [1:2678] 1 2 2 2 2 2 2 2 2 2 ...
 $ EVID: num [1:2678] 1 0 0 0 0 0 0 0 0 0 ...
 $ EVI2: num [1:2678] 1 0 0 0 0 0 0 0 0 0 ...
 $ MDV : num [1:2678] 1 0 0 0 0 0 0 0 0 0 ...
 $ DV  : num [1:2678] 0 491 605 556 310 237 147 101 72.4 52.6 ...
 $ LNDV: num [1:2678] 0 6.2 6.41 6.32 5.74 ...
 $ AMT : num [1:2678] 25 0 0 0 0 0 0 0 0 0 ...
 $ TIME: num [1:2678] 0 0.2 0.25 0.367 0.533 0.7 1.2 2.2 3.2 4.2 ...
 $ DOSE: num [1:2678] 25 25 25 25 25 25 25 25 25 25 ...
 $ OCC : num [1:2678] 1 1 1 1 1 1 1 1 1 1 ...
 $ RATE: num [1:2678] 75 0 0 0 0 0 0 0 0 0 ...
 $ AGE : num [1:2678] 42 42 42 42 42 42 42 42 42 42 ...
 $ SEX : num [1:2678] 1 1 1 1 1 1 1 1 1 1 ...
 $ RACE: num [1:2678] 2 2 2 2 2 2 2 2 2 2 ...
 $ WT  : num [1:2678] 94.3 94.3 94.3 94.3 94.3 94.3 94.3 94.3 94.3 94.3 ...
 $ HT  : num [1:2678] 1.77 1.77 1.77 1.77 1.77 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   CMT = col_double(),
  ..   EVID = col_double(),
  ..   EVI2 = col_double(),
  ..   MDV = col_double(),
  ..   DV = col_double(),
  ..   LNDV = col_double(),
  ..   AMT = col_double(),
  ..   TIME = col_double(),
  ..   DOSE = col_double(),
  ..   OCC = col_double(),
  ..   RATE = col_double(),
  ..   AGE = col_double(),
  ..   SEX = col_double(),
  ..   RACE = col_double(),
  ..   WT = col_double(),
  ..   HT = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Data cleaning

Next, let’s get a quick visual idea of the data by plotting the outcome variable DV as a function of time, stratified by DOSE and using ID as a grouping factor.

# Plotting the outcome variable (DV) as a function of time, stratified by DOSE and using ID as a grouping factor. ChatGPT helped me with this code.
mavo %>%
  ggplot(aes(x = TIME, y = DV, color = factor(DOSE))) +
  geom_line() +
  facet_wrap(~ID) +
  labs(x = "Time", y = "Outcome Variable (DV)", color = "DOSE") +
  theme_minimal()

Per Dr. Handel’s guidance, we can see that the formatting of the dataset still looks off because some individuals received the drug more than one (OCC = 1 and OCC = 2). We will now remove all entries with OCC = 2.

# Filter data to include only the first dose and store it in mavo_filtered
mavo_filtered <- mavo %>%
  filter(OCC == 1)  # Filter observations where OCC is equal to 1

# Plotting the outcome variable (DV) as a function of time, stratified by DOSE and using ID as a grouping factor
mavo_filtered %>%
  ggplot(aes(x = TIME, y = DV, color = factor(DOSE))) +
  geom_line() +
  facet_wrap(~ID) +
  labs(x = "Time", y = "Outcome Variable (DV)", color = "DOSE") +
  theme_minimal()

Next, let’s remove the TIME=0 values for each individual and compute the sum of DV for each individual, called variable Y. We will also write code that keeps only the TIME=0 entries for each individual. Finally, we will combine the two datasets so you have a single one that has the time=0 entry for everyone, and an extra column for total drug concentration in a data frame size 120 by 18.

# Remove TIME=0 values for each individual and compute the sum of DV for each individual
mavo_filteredDV <- mavo_filtered %>%
  filter(TIME != 0) %>%  # Remove TIME=0 values for each individual
  group_by(ID) %>%       # Group by ID
  summarize(Y = sum(DV)) # Compute the sum of DV for each individual

# Keep only the TIME=0 entries for each individual
mavo_timeZero <- mavo_filtered %>%
  filter(TIME == 0)

# Combine two datasets
mavo_combined <- left_join(mavo_timeZero, mavo_filteredDV, by = "ID")

Now, we will do more cleaning by only including the following variables: Y, DOSE, RATE, AGE, SEX, RACE, WT, HT. We also must convert RACE and SEX to factor variables.

# Selecting only the specified variables from mavo_combined
mavo_combined_subset <- mavo_combined %>%
  select(Y, DOSE, RATE, AGE, SEX, RACE, WT, HT) %>% # Select only the specified variables
  mutate(RACE = factor(RACE), # Convert RACE and SEX variables to factor variables
         SEX = factor(SEX))

# View the structure of the updated dataframe
str(mavo_combined_subset)
tibble [120 × 8] (S3: tbl_df/tbl/data.frame)
 $ Y   : num [1:120] 2691 2639 2150 1789 3126 ...
 $ DOSE: num [1:120] 25 25 25 25 25 25 25 25 25 25 ...
 $ RATE: num [1:120] 75 150 150 150 150 150 150 150 150 150 ...
 $ AGE : num [1:120] 42 24 31 46 41 27 23 20 23 28 ...
 $ SEX : Factor w/ 2 levels "1","2": 1 1 1 2 2 1 1 1 1 1 ...
 $ RACE: Factor w/ 4 levels "1","2","7","88": 2 2 1 1 2 2 1 4 2 1 ...
 $ WT  : num [1:120] 94.3 80.4 71.8 77.4 64.3 ...
 $ HT  : num [1:120] 1.77 1.76 1.81 1.65 1.56 ...

EDA

Now that we have cleaned the data, we can perform some exploratory data analysis. I will start by creating a summary table using gtsummary().

# Create gtsummary table
table1 <- 
  mavo_combined_subset %>%
  tbl_summary(include = c(Y, DOSE, RATE, AGE, SEX, RACE, WT, HT))
table1
Characteristic N = 1201
Y 2,349 (1,701, 3,050)
DOSE
    25 59 (49%)
    37.5 12 (10%)
    50 49 (41%)
RATE
    75 1 (0.8%)
    150 58 (48%)
    225 12 (10%)
    300 49 (41%)
AGE 31 (26, 40)
SEX
    1 104 (87%)
    2 16 (13%)
RACE
    1 74 (62%)
    2 36 (30%)
    7 2 (1.7%)
    88 8 (6.7%)
WT 82 (73, 90)
HT 1.77 (1.70, 1.81)
1 Median (IQR); n (%)

Now, I will visualize the distribution of our continuous variables in density plots: Y, AGE, WT, and HT.

# Create density plots for continuous variables. CoPilot helped with this code.
plot1 <-
  mavo_combined_subset %>%
  select(Y, AGE, WT, HT) %>%
  pivot_longer(everything()) %>%
  ggplot(aes(x = value, fill = name)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~name, scales = "free") +
  theme_minimal()
plot1

#Saving figure
figure_file <- here("fitting-exercise", "densityplots.png")
ggsave(filename = figure_file, plot=plot1, bg="white")
Saving 7 x 5 in image

Now, let’s create boxplots for variables DOSE and RATE.

# Create boxplots for DOSE and RATE. CoPilot assisted with this code.
plot2 <-
  mavo_combined_subset %>%
  ggplot(aes(x = factor(DOSE), y = Y)) +
  geom_boxplot() +
  labs(x = "Dose", y = "Y") +
  theme_minimal()
plot2

#Saving figure
figure_file <- here("fitting-exercise", "boxplot_dose.png")
ggsave(filename = figure_file, plot=plot2, bg="white")
Saving 7 x 5 in image
plot3 <-
  mavo_combined_subset %>%
  ggplot(aes(x = factor(RATE), y = Y)) +
  geom_boxplot() +
  labs(x = "Rate", y = "Y") +
  theme_minimal()
plot3

#Saving figure
figure_file <- here("fitting-exercise", "boxplot_rate.png")
ggsave(filename = figure_file, plot=plot3, bg="white")
Saving 7 x 5 in image

From these boxplots, we see that we have some outliers in the Y variable. If doing an actual project, we might want to remove them, but I will maintain them for this exercise.

Now let’s create some scatterplots to visualize relationships between Y and AGE, WT, and HT.

# Create scatterplots for Y vs. AGE, WT, and HT. CoPilot helped with this code.
plot4 <-
  mavo_combined_subset %>%
  ggplot(aes(x = AGE, y = Y)) +
  geom_point() +
  labs(x = "Age", y = "Y") +
  theme_minimal()
plot4

plot5 <-
  mavo_combined_subset %>%
  ggplot(aes(x = WT, y = Y)) +
  geom_point() +
  labs(x = "Weight", y = "Y") +
  theme_minimal()
plot5

plot6 <-
  mavo_combined_subset %>%
  ggplot(aes(x = HT, y = Y)) +
  geom_point() +
  labs(x = "Height", y = "Y") +
  theme_minimal()
plot6

# Laying out in grid
pairplot <- plot4 / plot5 / plot6
pairplot

#Saving figures
figure_file <- here("fitting-exercise", "scatterplot_age.png")
ggsave(filename = figure_file, plot=plot4, bg="white")
Saving 7 x 5 in image
figure_file <- here("fitting-exercise", "scatterplot_weight.png")
ggsave(filename = figure_file, plot=plot5, bg="white")
Saving 7 x 5 in image
figure_file <- here("fitting-exercise", "scatterplot_height.png")
ggsave(filename = figure_file, plot=plot6, bg="white")
Saving 7 x 5 in image
figure_file <- here("fitting-exercise", "pairplot.png")
ggsave(filename = figure_file, plot=pairplot, bg="white")
Saving 7 x 5 in image

Model fitting

Linear models

Now let’s move to model fitting using tidymodels framework. First let’s install the packages and load them.

# Loading packages
library(tidymodels)
library(broom.mixed)
library(dotwhisker)

Now, let’s fit a linear model to the continuous outcome Y using the main predictor of interest, DOSE.

# ChatGPT and CoPilot helped with this code. I also used this website: https://www.tidymodels.org/start/models/ and this part of TMWR https://www.tmwr.org/performance (9.2 Regression Metrics) to guide me through the process.

# One option is create a linear model using the lm() function from base R
#mavo_lm <- lm(Y ~ DOSE, data = mavo_combined_subset)

# Other option using linear_reg() (default engine, OLS) from tidymodels
mavo_lm1 <- linear_reg() %>% set_engine("lm") %>% fit(Y ~ DOSE, data = mavo_combined_subset)

# Use broom::tidy() to tidy results
tidy(mavo_lm1)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    323.     199.        1.62 1.07e- 1
2 DOSE            58.2      5.19     11.2  2.69e-20
# Produce predictions for RMSE and R-squared
mavo_lm1_pred <- predict(mavo_lm1, new_data = mavo_combined_subset %>% select(-Y))
mavo_lm1_pred
# A tibble: 120 × 1
   .pred
   <dbl>
 1 1778.
 2 1778.
 3 1778.
 4 1778.
 5 1778.
 6 1778.
 7 1778.
 8 1778.
 9 1778.
10 1778.
# ℹ 110 more rows
# Predicted numeric outcome named .pred. Now we will match predicted values with corresponding observed outcome values
mavo_lm1_pred <- bind_cols(mavo_lm1_pred, mavo_combined_subset %>% select(Y))
mavo_lm1_pred
# A tibble: 120 × 2
   .pred     Y
   <dbl> <dbl>
 1 1778. 2691.
 2 1778. 2639.
 3 1778. 2150.
 4 1778. 1789.
 5 1778. 3126.
 6 1778. 2337.
 7 1778. 3007.
 8 1778. 2796.
 9 1778. 3866.
10 1778. 1762.
# ℹ 110 more rows
# Plot data before computing metrics
ggplot(mavo_lm1_pred, aes(x = Y, y = .pred)) +
  geom_abline(lty = 2) + # Add a dashed line to represent the 1:1 line
  geom_point(alpha = 0.5) +
  labs(x = "Observed (log10)", y = "Predicted (log10)") + #Scale and size x- and y-axis uniformly
  coord_obs_pred()

# Create metric set including RMSE and R-squared
metrics_lm1 <- metric_set(rmse, rsq, mae) # MAE for fun
metrics_lm1(mavo_lm1_pred, truth = Y, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     666.   
2 rsq     standard       0.516
3 mae     standard     517.   

Here, we can see from our R-squared statistic, that the independent variables explain about 63% of the model, although this doesn’t necessarily tell us if our model is good for the data at hand. Our RMSE, which is the square root of the average of squared differences between observed and predicted values, is usually supposed to be lower (less distance between observed and predicted). Here, our RMSE is quite high, 583.

Next, let’s fit a linear model to the continuous outcome Y using all predictors. Then we will go through the same process and get RMSE and R-squared.

# I used the Tidymodels website: https://www.tidymodels.org/learn/statistics/tidy-analysis/ for this code

# Create a linear model using linear_reg() from tidymodels
mavo_lm2 <- linear_reg() %>% set_engine("lm") %>% fit(Y ~ DOSE + RATE + AGE + SEX + RACE + WT + HT, data = mavo_combined_subset)

# Use broom::tidy() to tidy results
output_mavo_lm2 <- tidy(mavo_lm2)

# Use dotwhisker::dwplot() to visualize the results
dwplot(mavo_lm2)

# Produce predictions for RMSE and R-squared
mavo_lm2_pred <- predict(mavo_lm2, new_data = mavo_combined_subset %>% select(-Y))
mavo_lm2_pred
# A tibble: 120 × 1
   .pred
   <dbl>
 1 2691.
 2 1919.
 3 1974.
 4 1644.
 5 2135.
 6 2023.
 7 1549.
 8 2177.
 9 2354.
10 1196.
# ℹ 110 more rows
# Predicted numeric outcome named .pred. Now we will match predicted values with corresponding observed outcome values
mavo_lm2_pred <- bind_cols(mavo_lm2_pred, mavo_combined_subset %>% select(Y))
mavo_lm2_pred
# A tibble: 120 × 2
   .pred     Y
   <dbl> <dbl>
 1 2691. 2691.
 2 1919. 2639.
 3 1974. 2150.
 4 1644. 1789.
 5 2135. 3126.
 6 2023. 2337.
 7 1549. 3007.
 8 2177. 2796.
 9 2354. 3866.
10 1196. 1762.
# ℹ 110 more rows
# Plot data before computing metrics
ggplot(mavo_lm2_pred, aes(x = Y, y = .pred)) +
  geom_abline(lty = 2) + # Add a dashed line to represent the 1:1 line
  geom_point(alpha = 0.5) +
  labs(x = "Observed (log10)", y = "Predicted (log10)") + #Scale and size x- and y-axis uniformly
  coord_obs_pred()

# Create metric set including RMSE and R-squared
metrics_lm2 <- metric_set(rmse, rsq, mae) # MAE for fun
metrics_lm2(mavo_lm2_pred, truth = Y, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     583.   
2 rsq     standard       0.629
3 mae     standard     435.   

I got R-squared and RMSE statistics that were very similar for this model as for the previous model. We can see from our R-squared statistic, that the independent variables explain about 63% of the model, although this doesn’t necessarily tell us if our model is good for the data at hand. Our RMSE, which is the square root of the average of squared differences between observed and predicted values, is usually supposed to be lower (less distance between observed and predicted). Here, our RMSE is quite high, 583.1.

Logistic models

Now we will consider SEX as the outcome of interest and fit a logistic model to the categorical/binary outcome SEX using the main predictor of interest, DOSE. Then we will compute accuracy and ROC-AUC and print them.

# https://www.datacamp.com/courses/modeling-with-tidymodels-in-r I used videos from this online course to help me with this code, specifically the Classification module.

# Data resampling.
mavo_log1_split <- initial_split(mavo_combined_subset, prop = 0.75, strata = SEX) # Data is similar to training and testing datasets

# Create training and test datasets
mavo_log1_train <- mavo_log1_split %>% training()

mavo_log1_test <- mavo_log1_split %>% testing()

# Create a logistic model using logistic_reg() from parsnip
mavo_log1 <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification") %>% #specify model
  fit(SEX ~ DOSE, data = mavo_log1_train) #Pass model object to fit(), specify model formula, provide training data

# Use broom::tidy() to tidy results
output_mavo_log1 <- tidy(mavo_log1)

# Predicting outcome categories
class_preds <- mavo_log1 %>% predict(new_data = mavo_log1_test, type = 'class') # New_data specifies dataset on which to predict new values; type class provides categorical predictions; standardized output from predict()
class_preds
# A tibble: 30 × 1
   .pred_class
   <fct>      
 1 1          
 2 1          
 3 1          
 4 1          
 5 1          
 6 1          
 7 1          
 8 1          
 9 1          
10 1          
# ℹ 20 more rows
# Estimating probabilities. Set type to prob to provide estimated probabilities for each outcome category
prob_preds <- mavo_log1 %>% predict(new_data = mavo_log1_test, type = 'prob')
prob_preds
# A tibble: 30 × 2
   .pred_1 .pred_2
     <dbl>   <dbl>
 1   0.884  0.116 
 2   0.884  0.116 
 3   0.884  0.116 
 4   0.884  0.116 
 5   0.930  0.0704
 6   0.930  0.0704
 7   0.930  0.0704
 8   0.930  0.0704
 9   0.930  0.0704
10   0.930  0.0704
# ℹ 20 more rows
# Combining results using model evaluation with yardstick package
mavo_log1_results <- mavo_log1_test %>% select(SEX) %>% bind_cols(class_preds, prob_preds) # Bind columns of two datasets together

# Confusion matrix with yardstick
conf_mat(mavo_log1_results, truth = SEX, estimate = .pred_class)
          Truth
Prediction  1  2
         1 26  4
         2  0  0
# Classification accuracy (generally not the best metric)
accuracy <- accuracy(mavo_log1_results, truth = SEX, estimate = .pred_class)
print(accuracy)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.867
# ROC (ROC used to visualize performance across probability thresholds)
mavo_log1_results %>%
  roc_curve(truth = SEX, .pred_1)
# A tibble: 5 × 3
  .threshold specificity sensitivity
       <dbl>       <dbl>       <dbl>
1   -Inf             0         1    
2      0.814         0         1    
3      0.884         0.5       0.577
4      0.930         0.5       0.423
5    Inf             1         0    
# Plot ROC curve
mavo_log1_results %>%
  roc_curve(truth = SEX, .pred_1) %>%
  autoplot()

# Calculate ROC AUC
roc_auc <- roc_auc(mavo_log1_results, truth = SEX, .pred_1)
print(roc_auc)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary           0.5

The accuracy is 0.866, and the ROC-AUC performance is 0.65, which gives us a D classification performance out of a scale of A to F. Clearly, this is not a good model fit.

Now we will fit a logistic model to the categorical/binary outcome SEX using all predictors of interest except RATE. Then we will compute accuracy and ROC-AUC and print them.

#For this code, I asked ChatGPT to take the previous code I had written with the help of Data Camp and change it to include all the predictor variables and rename it to mavo_log2 wherever it had been mavo_log1.

# Data resampling
mavo_log2_split <- initial_split(mavo_combined_subset, prop = 0.75, strata = SEX) 

# Create training and test datasets
mavo_log2_train <- mavo_log2_split %>% training()
mavo_log2_test <- mavo_log2_split %>% testing()

# Create a logistic model using logistic_reg() from parsnip
mavo_log2 <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification") %>% 
  fit(SEX ~ DOSE + AGE + RACE + RATE + WT + HT + Y, data = mavo_log2_train)
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Use broom::tidy() to tidy results
output_mavo_log2 <- tidy(mavo_log2)

# Predicting outcome categories
class_preds <- mavo_log2 %>% 
  predict(new_data = mavo_log2_test, type = 'class')

# Estimating probabilities
prob_preds <- mavo_log2 %>% 
  predict(new_data = mavo_log2_test, type = 'prob')

# Combining results using model evaluation with yardstick package
mavo_log2_results <- mavo_log2_test %>% 
  select(SEX) %>% 
  bind_cols(class_preds, prob_preds)

# Confusion matrix with yardstick
conf_mat(mavo_log2_results, truth = SEX, estimate = .pred_class)
          Truth
Prediction  1  2
         1 25  1
         2  1  3
# Classification accuracy
accuracy <- accuracy(mavo_log2_results, truth = SEX, estimate = .pred_class)
print(accuracy)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.933
# ROC (ROC used to visualize performance across probability thresholds)
roc_curve_data <- mavo_log2_results %>%
  roc_curve(truth = SEX, .pred_1) 

# Plot ROC curve
autoplot(roc_curve_data)

# Calculate ROC AUC
roc_auc <- roc_auc(mavo_log2_results, truth = SEX, .pred_1)
print(roc_auc)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.856

Here, our accuracy is 0.875 and the ROC-AUC performance is 0.875, which gives us a B classification performance out of a scale of A to F. This is a better model fit than the previous one, but still not great.

Exercise Part 2 (Module 10: Model Improvement)

Model Performance Assessment 1

Let’s set the random seed using Dr. Handel’s last method. Then we will compute predictions for the two models (the first using only dose as a predictor and the second using all predictors), using observed and predicted values to compute RMSE of the best-fitting model.

# Fix the random numbers by setting the seed 
# This enables the analysis to be reproducible when random numbers are used 
rngseed = 1234
set.seed(rngseed)

# Select only the specified variables from mavo_combined_subset (not Race)
mavo_combined_subset <- mavo_combined_subset %>%
  select(Y, DOSE, AGE, SEX, WT, HT) # Select only the specified variables

# View the structure of the updated dataframe
str(mavo_combined_subset)
tibble [120 × 6] (S3: tbl_df/tbl/data.frame)
 $ Y   : num [1:120] 2691 2639 2150 1789 3126 ...
 $ DOSE: num [1:120] 25 25 25 25 25 25 25 25 25 25 ...
 $ AGE : num [1:120] 42 24 31 46 41 27 23 20 23 28 ...
 $ SEX : Factor w/ 2 levels "1","2": 1 1 1 2 2 1 1 1 1 1 ...
 $ WT  : num [1:120] 94.3 80.4 71.8 77.4 64.3 ...
 $ HT  : num [1:120] 1.77 1.76 1.81 1.65 1.56 ...
# Create training data
# Put 3/4 of the data into the training set 
data_split <- initial_split(mavo_combined_subset, prop = 3/4)

# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

Now we will run two linear models on our continuous outcome of interest (Y). The first model will only use DOSE as a predictor. The second model will us all predictors. For both models, the metric to optimize will be RMSE.

# Create a linear model using linear_reg() from tidymodels using dose as a predictor
mavo_lm3 <- linear_reg() %>% set_engine("lm") %>% set_mode("regression") %>% #specify model
  fit(Y ~ DOSE, data = train_data) #Pass model object to fit(), specify model formula, provide training data
mavo_lm3
parsnip model object


Call:
stats::lm(formula = Y ~ DOSE, data = data)

Coefficients:
(Intercept)         DOSE  
     535.45        53.42  
# Use broom::tidy() to tidy results
tidy(mavo_lm3)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    535.     244.        2.19 3.08e- 2
2 DOSE            53.4      6.29      8.50 4.41e-13
# Produce predictions for RMSE and R-squared
mavo_lm3_pred <- predict(mavo_lm3, new_data = train_data %>% select(-Y))
mavo_lm3_pred
# A tibble: 90 × 1
   .pred
   <dbl>
 1 3207.
 2 1871.
 3 2539.
 4 1871.
 5 3207.
 6 1871.
 7 1871.
 8 1871.
 9 1871.
10 2539.
# ℹ 80 more rows
# Predicted numeric outcome named .pred. Now we will match predicted values with corresponding observed outcome values
mavo_lm3_pred <- bind_cols(mavo_lm3_pred, train_data %>% select(Y))
mavo_lm3_pred
# A tibble: 90 × 2
   .pred     Y
   <dbl> <dbl>
 1 3207. 3004.
 2 1871. 1347.
 3 2539. 2772.
 4 1871. 2028.
 5 3207. 2353.
 6 1871.  826.
 7 1871. 3866.
 8 1871. 3126.
 9 1871. 1108.
10 2539. 2815.
# ℹ 80 more rows
# Create metric set including RMSE and R-squared
metrics_lm3 <- metric_set(rmse, rsq, mae) # MAE for fun
metrics_lm3(mavo_lm3_pred, truth = Y, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     703.   
2 rsq     standard       0.451
3 mae     standard     546.   
# Create a linear model using linear_reg() from tidymodels using all predictors (Dr Handel's code somewhat)
mavo_lm4 <- linear_reg() %>% set_engine("lm") %>% fit(Y ~ ., data = train_data)
tidy(mavo_lm4)
# A tibble: 6 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  4397.      2170.      2.03   4.59e- 2
2 DOSE           55.3        5.83    9.49   6.09e-15
3 AGE            -0.417      9.50   -0.0439 9.65e- 1
4 SEX2         -569.       285.     -1.99   4.95e- 2
5 WT            -22.6        7.65   -2.96   4.00e- 3
6 HT          -1130.      1358.     -0.832  4.08e- 1
# Compute the RMSE and R squared for null model using Dr. Handel's Code
metrics_lm4 <- mavo_lm4 %>% 
  predict(train_data) %>% 
  bind_cols(train_data) %>% 
  metrics(truth = Y, estimate = .pred)
metrics_lm4
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     627.   
2 rsq     standard       0.562
3 mae     standard     486.   
# Run null-model (predict the mean outcome for each observation without predictor information). For the following code I used ChatGPT, and I also looked at my classmate Taylor Glass's code.
mavo_lm0 <- null_model(mode = "regression") %>% 
    set_engine("parsnip") %>%
    fit(Y ~ 1, data = train_data)

# Compute the RMSE and R squared for null model using Dr. Handel's Code
metrics_lm0 <- mavo_lm0 %>% 
  predict(train_data) %>% 
  bind_cols(train_data) %>% 
  metrics(truth = Y, estimate = .pred)
metrics_lm0
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        948.
2 rsq     standard         NA 
3 mae     standard        765.

For model 1 (mavo_lm3), the RMSE is 702; for model 2 (mavo_lm4), the RMSE is 627, and for the null model (mavo_lm0), the RMSE is 948.

Model Performance Assessment 2

Now we will perform a 10-fold cross-validation , fitting the 2 models to the data 10 times.

The first model is the simple linear model, only using DOSE as a predictor.

# Simple linear model
# Set the seed for reproducibility
set.seed(rngseed)

# Create 10-fold cross-validation
folds <- vfold_cv(train_data, v = 10)
folds
#  10-fold cross-validation 
# A tibble: 10 × 2
   splits         id    
   <list>         <chr> 
 1 <split [81/9]> Fold01
 2 <split [81/9]> Fold02
 3 <split [81/9]> Fold03
 4 <split [81/9]> Fold04
 5 <split [81/9]> Fold05
 6 <split [81/9]> Fold06
 7 <split [81/9]> Fold07
 8 <split [81/9]> Fold08
 9 <split [81/9]> Fold09
10 <split [81/9]> Fold10
# Set model specification
mavo_spec <- linear_reg() %>% set_engine("lm")

# Create and initialize workflow that bundles together model specification and formula. ChatGPT helped me here.
mavo_wf <- workflow() %>% 
  add_model(mavo_spec) %>% 
  add_formula(Y ~ DOSE)
  fit_resamples(mavo_wf, resamples = folds)
# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits         id     .metrics         .notes          
   <list>         <chr>  <list>           <list>          
 1 <split [81/9]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
 2 <split [81/9]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
 3 <split [81/9]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
 4 <split [81/9]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
 5 <split [81/9]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
 6 <split [81/9]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
 7 <split [81/9]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
 8 <split [81/9]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
 9 <split [81/9]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
10 <split [81/9]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>
# Resample the model
mavo_wf_rs1 <- fit_resamples(mavo_wf, resamples = folds)
mavo_wf_rs1
# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits         id     .metrics         .notes          
   <list>         <chr>  <list>           <list>          
 1 <split [81/9]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
 2 <split [81/9]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
 3 <split [81/9]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
 4 <split [81/9]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
 5 <split [81/9]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
 6 <split [81/9]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
 7 <split [81/9]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
 8 <split [81/9]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
 9 <split [81/9]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
10 <split [81/9]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>
# Get model metrics
collect_metrics(mavo_wf_rs1)
# A tibble: 2 × 6
  .metric .estimator    mean     n std_err .config             
  <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
1 rmse    standard   691.       10 67.5    Preprocessor1_Model1
2 rsq     standard     0.512    10  0.0592 Preprocessor1_Model1

We can see that the RMSE for the linear model only using DOSE as a predictor decreased to 690 (from 702). Further, we can see the standard error (67.5), which allows us to understand variability. It appears that the CV method works better than the train and test method.

Now let’s do the same thing for the complex linear model using all predictors.

# Set the seed for reproducibility
set.seed(rngseed)

# Create and initialize workflow that bundles together model specification and formula. ChatGPT helped me here.
mavo_wf2 <- workflow() %>% 
  add_model(mavo_spec) %>% 
  add_formula(Y ~ .,)
  fit_resamples(mavo_wf2, resamples = folds)
# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits         id     .metrics         .notes          
   <list>         <chr>  <list>           <list>          
 1 <split [81/9]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
 2 <split [81/9]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
 3 <split [81/9]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
 4 <split [81/9]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
 5 <split [81/9]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
 6 <split [81/9]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
 7 <split [81/9]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
 8 <split [81/9]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
 9 <split [81/9]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
10 <split [81/9]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>
# Resample the model
mavo_wf_rs2 <- fit_resamples(mavo_wf2, resamples = folds)
mavo_wf_rs2
# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits         id     .metrics         .notes          
   <list>         <chr>  <list>           <list>          
 1 <split [81/9]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
 2 <split [81/9]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
 3 <split [81/9]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
 4 <split [81/9]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
 5 <split [81/9]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
 6 <split [81/9]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
 7 <split [81/9]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
 8 <split [81/9]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
 9 <split [81/9]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
10 <split [81/9]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>
# Get model metrics
collect_metrics(mavo_wf_rs2)
# A tibble: 2 × 6
  .metric .estimator    mean     n std_err .config             
  <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
1 rmse    standard   646.       10 64.8    Preprocessor1_Model1
2 rsq     standard     0.573    10  0.0686 Preprocessor1_Model1

For the complex linear model, we have an RMSE of 645 with a standard deviation of 57. This is a better fit than the simple linear model using CV (690). However, the RMSE of 645 (with a standard error of 64.8) is higher than the complex linear model tested using train and test, which was 627. I am not sure why this is the case, so if any of my classmates know, that would be great.

Finally, we will run the code that creates the CV again, but we will choose a different value for the random seed. Although our RMSE values change, the overall patterns between the RMSE values for the fits for training data with and without CV should stay the same.

# Set new seed for reproducibility
set.seed(13579)

# Simple linear model
# Create and initialize workflow that bundles together model specification and formula. ChatGPT helped me here.
mavo_wf <- workflow() %>% 
  add_model(mavo_spec) %>% 
  add_formula(Y ~ DOSE)
  fit_resamples(mavo_wf, resamples = folds)
# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits         id     .metrics         .notes          
   <list>         <chr>  <list>           <list>          
 1 <split [81/9]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
 2 <split [81/9]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
 3 <split [81/9]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
 4 <split [81/9]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
 5 <split [81/9]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
 6 <split [81/9]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
 7 <split [81/9]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
 8 <split [81/9]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
 9 <split [81/9]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
10 <split [81/9]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>
# Resample the model
mavo_wf_rs3 <- fit_resamples(mavo_wf, resamples = folds)
mavo_wf_rs3
# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits         id     .metrics         .notes          
   <list>         <chr>  <list>           <list>          
 1 <split [81/9]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
 2 <split [81/9]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
 3 <split [81/9]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
 4 <split [81/9]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
 5 <split [81/9]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
 6 <split [81/9]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
 7 <split [81/9]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
 8 <split [81/9]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
 9 <split [81/9]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
10 <split [81/9]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>
# Get model metrics
collect_metrics(mavo_wf_rs3)
# A tibble: 2 × 6
  .metric .estimator    mean     n std_err .config             
  <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
1 rmse    standard   691.       10 67.5    Preprocessor1_Model1
2 rsq     standard     0.512    10  0.0592 Preprocessor1_Model1
# Complex linear model
# Create and initialize workflow that bundles together model specification and formula. ChatGPT helped me here.
mavo_wf <- workflow() %>% 
  add_model(mavo_spec) %>% 
  add_formula(Y ~ .,)
  fit_resamples(mavo_wf, resamples = folds)
# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits         id     .metrics         .notes          
   <list>         <chr>  <list>           <list>          
 1 <split [81/9]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
 2 <split [81/9]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
 3 <split [81/9]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
 4 <split [81/9]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
 5 <split [81/9]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
 6 <split [81/9]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
 7 <split [81/9]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
 8 <split [81/9]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
 9 <split [81/9]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
10 <split [81/9]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>
# Resample the model with a different seed
mavo_wf_rs4 <- fit_resamples(mavo_wf, resamples = folds)
mavo_wf_rs4
# Resampling results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits         id     .metrics         .notes          
   <list>         <chr>  <list>           <list>          
 1 <split [81/9]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
 2 <split [81/9]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
 3 <split [81/9]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
 4 <split [81/9]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
 5 <split [81/9]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
 6 <split [81/9]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
 7 <split [81/9]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
 8 <split [81/9]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
 9 <split [81/9]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
10 <split [81/9]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>
# Get model metrics
collect_metrics(mavo_wf_rs4)
# A tibble: 2 × 6
  .metric .estimator    mean     n std_err .config             
  <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
1 rmse    standard   646.       10 64.8    Preprocessor1_Model1
2 rsq     standard     0.573    10  0.0686 Preprocessor1_Model1

For the simple linear model, we get an RMSE of 690 and a standard error of 67.5. This is the same as I got for using a different random seed previously. For the complex linear model, we get an RMSE of 645 and a standard error of 64.8. This is also the same as I got using a different random seed. The multiple linear regression continues to outperform the simple linear regression.

Rachel Robertson contributed to this section of the fitting exercise

We will begin by adding the predictions made by the models to the original data frame that contains observations for each Y. They will then be plotted together with a 45 degree slope to compare the predictions with the outcome (Y).

# Augment linear regressions into data frame
mavo_lm3_aug <- augment(mavo_lm3, new_data = train_data) %>%
  select("Y", ".pred", ".resid") # select only the Y column and predictions
names(mavo_lm3_aug)[names(mavo_lm3_aug) == '.pred'] <- 'lm3_pred' # change name of prediction variable
names(mavo_lm3_aug)[names(mavo_lm3_aug) == '.resid'] <- 'lm3_resid' # change name of residual variable

mavo_lm4_aug <- augment(mavo_lm4, new_data = train_data)%>%
  select("Y", ".pred", ".resid")
names(mavo_lm4_aug)[names(mavo_lm4_aug) == '.pred'] <- 'lm4_pred'
names(mavo_lm4_aug)[names(mavo_lm4_aug) == '.resid'] <- 'lm4_resid'

mavo_lm0_aug <- augment(mavo_lm0, new_data = train_data) %>%
  select("Y", ".pred", ".resid") # select y and prediction column
names(mavo_lm0_aug)[names(mavo_lm0_aug) == '.pred'] <- 'lm0_pred'
names(mavo_lm0_aug)[names(mavo_lm0_aug) == '.resid'] <- 'lm0_resid'

combined_pred <- mavo_lm3_aug %>%
  left_join(mavo_lm4_aug, by='Y') %>%  
  left_join(mavo_lm0_aug, by='Y') # left join all three augmented data frames by Y
str(combined_pred) # Check the structure of the combined data frame
tibble [90 × 7] (S3: tbl_df/tbl/data.frame)
 $ Y        : num [1:90] 3004 1347 2772 2028 2353 ...
 $ lm3_pred : num [1:90] 3207 1871 2539 1871 3207 ...
 $ lm3_resid: num [1:90] -202 -524 233 157 -853 ...
 $ lm4_pred : num [1:90] 3303 1953 2745 2081 2894 ...
 $ lm4_resid: num [1:90] -298.8 -605.9 26.8 -53.6 -540.8 ...
 $ lm0_pred : num [1:90] 2509 2509 2509 2509 2509 ...
 $ lm0_resid: num [1:90] 495 -1163 263 -482 -156 ...
# Plot the combined data frame
combinedplot <- ggplot(combined_pred, aes(x = Y)) + # add observed Y values
  geom_point(aes(y = lm3_pred, color = "Dose v. Y model")) + # add aes layer for lm3 predictions with a color for model name
  geom_point(aes(y = lm4_pred, color = "Full model")) + # add aes layer for lm4 predictions with a color for model name
  geom_point(aes(y = lm0_pred, color = "Null model")) + # add aes layer for lm0 predictions with a color for model name
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") + # create 45 degree line
  labs(x = "Outcome (Y)", y = "Model Predictions", title = "Observed Value of Y versus Model Predictions") + # add axis labels and title
  scale_color_manual(values = c("Dose v. Y model" = "red", "Full model" = "blue", "Null model" = "green")) + # add colors for each model name
  theme_minimal()
combinedplot

# Save figure
figure_file <- here("fitting-exercise", "combinedplot.png")
ggsave(filename = figure_file, plot=combinedplot, bg="white")

By plotting all three models together, we can see that the null model produces a singlular horizontal line and the model including just dose produces three horizontal lines. This is likely because there are 3 values of dose that were possibly administered. Lastly, the full model (including all the predictors) produces a more randomly distributed scatter plot, which best captures the trend of the Y variable. The lower predictions of the full model seem to be closer to the line than the larger values, which suggests that the model still might not be capturing a certain trend within the observed values.

# Residual plot
residualplot <- ggplot(mavo_lm4_aug, aes(x = lm4_pred, y = lm4_resid)) +
  geom_point() +  # Scatter plot of residuals
  geom_hline(yintercept = 0, linetype = "dashed") +  # add a dashed horizontal line at y = 0
  labs(x = "Predicted Values", y = "Residuals", title = "Residual Plot") +
  coord_cartesian(ylim = c(-2500, 2500)) +  # set y-axis range to -2500, 2500
  theme_minimal()
residualplot

# Save figure
figure_file <- here("fitting-exercise", "residualplot.png")
ggsave(filename = figure_file, plot=residualplot, bg="white")

We can see from the residual plot of the full model, that there is some pattern. The dots seem more scattered as the predicted values increase. There are also outliers where there are larger predicted values.

We will now examine the model fit by bootstrapping the model. We will break it up into 100 resamples using the rsample package and then perform a loop to bootstrap all the samples. The predictions will be stored in an array vector and then can be plotted with metrics against the original Y values to assess the model fit.

# set seed
set.seed(rngseed)

# bootstrapping
lm4_boot <- bootstraps(train_data, times = 100)  # bootstrap the train data 100 times

boot_pred <- array(NA, dim = c(nrow(train_data), length(lm4_boot))) # create an array vector to store the bootstrap data

# creating a loop
for (i in 1:length(lm4_boot)) {
  
  bootstrap <- analysis(lm4_boot$splits[[i]]) # extract the splits from the bootstrap 
  model <- lm(Y ~ ., data = bootstrap) # fit the model for the bootstrap using full model formula and bootstrap sample data
  pred <- predict(model, newdata = train_data) # predict the new data from training data
  boot_pred[,i] <- pred # put the predictions into the array vector
}

dat_bs <- analysis(lm4_boot$splits[[i]]) # extract individual bootstrap samples
print(dat_bs)
# A tibble: 90 × 6
       Y  DOSE   AGE SEX      WT    HT
   <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
 1  826.    25    30 1     105.   1.88
 2 3594.    50    23 1      96.3  1.82
 3 3594.    50    23 1      96.3  1.82
 4 3609.    50    29 1      68.8  1.81
 5 3623.    50    32 1      92.2  1.80
 6 1949.    50    39 1      75.7  1.78
 7 2345.    50    31 1      85.9  1.72
 8  826.    25    30 1     105.   1.88
 9 2093.    25    39 2      58.2  1.62
10 1347.    25    41 1      81    1.75
# ℹ 80 more rows
# compute the mean and confidence intervals 
mean_ci <- function(x) { # use functions(x) to list the functions to calculate both the mean and CI
  mean <- mean(x) #mean(x) applies mean function to unspecified vector
  ci <- quantile(x, c(0.025, 0.5, 0.975)) #applies 95% CI quantiles to vector
  return(c(mean, ci))
}

preds <- apply(boot_pred, 2, mean_ci) # apply the customized function for mean and ci

preds_t <- t(preds) # t() transposes the matrix into a data frame

# Placing the bootstrap data and predictions into a new data frame
plot_data <- data.frame( #make a new data frame to contain the observed Y, predictions from bootstrap, and bounds
  observed = c(train_data$Y),  # Observed values
  point_estimates = c(mavo_lm4_aug$lm4_pred),  # Point estimates from full model
  median = preds_t[, 2],  # Median from bootstrap sampling
  lower_CI = preds_t[, 3],  # Lower confidence limit from bootstrap sampling
  upper_CI = preds_t[, 4]   # Upper confidence limit from bootstrap sampling
)

# plot the point estimates and bounds along with the observed Y
ggplot(plot_data, aes(x = observed, y = point_estimates)) +
  geom_point(color = "black") +  # add layer for point estimates
  geom_point(aes(y = median), color = "blue") +  # add layer for median
  geom_errorbar(aes(ymin = lower_CI, ymax = upper_CI), color = "red", width = 0.2) +  # add layer for confidence intervals
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +  # add 45 degree line
  labs(x = "Y Values Observed", y = "Mean Predicted Values", title = "Observed vs Predicted Values with Bootstrap Confidence Intervals") +  # Axes labels and title
  theme_minimal() 

The plot above shows the predicted values (black), mean predicted values from the bootstrap (blue), lower and upper limits of the median predicted values (red), compared to a 45 degree line (black, dotted). The predicted values follow the dotted line closely, showing a fairly good fit, however the mean and CI from bootstrapping follow a horizontal line and are more dense at lower values than higher values.

This plot shows that the model fit predictions from the cross validation are a better representation than the mean predictions made from bootstrapping. This may be because the mean can be shifted by skewness. The predictions also seem skewed as they become more sparse at higher values. The confidence interval bars (red) show the upper and lower limits predicted by bootstrapping. Although they follow a horizontal line, the CI bars are more dense towards the lower end of the y-axis, which also might represent skewness, or a trend that is not captured by the model. The mean values also seem to get sparser on the high end of the y-axis, indicating the same problem as the CI distribution. This might be fixed by adding more variables (columns) to the data set or increasing the sample size (rows) of the data in the model.

Exercise Part 3: Back to Cassia

Now, we will use the fit of model 2 from the training data to make predictions for the test data. This will give us an indication of how the model generalizes to data that wasn’t used to construct the model.

# ChatGPT helped me with this code. Assuming mavo_lm4 is already trained and test_data is prepared
# Make predictions on training data
train_pred <- predict(mavo_lm4, new_data = train_data)
print(train_pred)
# A tibble: 90 × 1
   .pred
   <dbl>
 1 3303.
 2 1953.
 3 2745.
 4 2081.
 5 2894.
 6 1265.
 7 2429.
 8 1976.
 9 1561.
10 2549.
# ℹ 80 more rows
# Make predictions on test data
test_pred <- predict(mavo_lm4, new_data = test_data)
print(test_pred)
# A tibble: 30 × 1
   .pred
   <dbl>
 1 1871.
 2 2665.
 3 2357.
 4 2954.
 5 2101.
 6 3479.
 7 3399.
 8 3214.
 9 2857.
10 1826.
# ℹ 20 more rows
# Create dataframes for plotting
train_data_plot <- data.frame(
  Observed = train_data$Y,
  Predicted = train_pred$.pred
)

test_data_plot <- data.frame(
  Observed = test_data$Y,
  Predicted = test_pred$.pred
)

# Create the plot
finalplot <- ggplot() +
  geom_point(data = train_data_plot, aes(x = Observed, y = Predicted, shape = "train_observed"), color = "blue", alpha = 0.5) +  # Use a specific shape for observed in training data
  geom_point(data = train_data_plot, aes(x = Observed, y = Predicted, shape = "train_predicted"), color = "blue", alpha = 0.5) +  # Use a different shape for predicted in training data
  geom_point(data = test_data_plot, aes(x = Observed, y = Predicted, shape = "test_observed"), color = "red", alpha = 0.5) +  # Use a specific shape for observed in test data
  geom_point(data = test_data_plot, aes(x = Observed, y = Predicted, shape = "test_predicted"), color = "red", alpha = 0.5) +  # Use a different shape for predicted in test data
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +  # Add identity line
  scale_shape_manual(values = c("train_observed" = 1, "train_predicted" = 2, "test_observed" = 3, "test_predicted" = 4)) +  # Assign specific shapes
  labs(x = "Observed", y = "Predicted", title = "Predicted vs Observed") +
  theme_minimal()
finalplot

# Save figure
figure_file <- here("fitting-exercise", "finalplot.png")
ggsave(filename = figure_file, plot=finalplot, bg="white")
Saving 7 x 5 in image

As Dr. Handel wrote, we can see that the observed/predicted values for the test data are mixed in with the observed/predicted values for the train data, which means the model is fitting well, and is not overfitting.

Now let’s do some thinking about the overall model assessment: 1. All of our models perform better than the null model given the RMSEs computed above and the plotting. 2. Model 1, which only uses DOSE as a predictor variable, improves the results over the null model because it has a lower RMSE. These results make sense given that the null model is only computing the mean. However, Model 1 is not necessarily usable in a practical sense in that there is a lot of unexplained aspects of the model; the results might be “significant” but they aren’t necessarily meaningful. 3. Model 2, with all the predictor further improves the results (based on the RMSE and based on our visualizations of the predictors). The CV and bootstrap resampling methods further demonstrated that the model is robust. The results make sense. We are capturing a clearer picture of what predictor variables are influencing outcome. However, I wouldn’t necessarily consider this model usable for any real purpose, mainly because the underlying data we are using (sample size and how we fiddled around with variables) isn’t necessarily logical. As Dr. Handel said in Week 8, how we created our dependent Y variable, by computing the total amount of drug for each individual by adding all the DV values instead of doing a time-series analysis, “is a pretty bad idea.”