Model Updates

knitr::opts_chunk$set(echo = FALSE)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggridges)
library(kableExtra)

Evaluation Results

Last iteration of these models: 2020-12-31

Change Summary

New coverages (@Lucas for details)
Predictions bound to [0, Inf) (minimal improvement)
Tax parcel data used as predictors
- At the moment, only X00-level being used, with 400, 500, 700, and 800 rolled into collective “other” due to low representation
- This is due to time constraints and is likely not optimal.

	RF (ranger)	GBM (LightGBM)	SVM (kernlab)	Ensemble (model weighted)	Ensemble (RMSE weighted)
RMSE	38.617	38.496	38.574	37.546	37.798
MBE	-1.313	-0.878	-5.357	-2.230	-2.547
R2	0.761	0.761	0.768	0.774	0.773

AGB Distribution

summary(bind_rows(training, testing)$agb_mgha)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   9.645  86.795  91.792 148.679 425.363

Bootstrapping Results

Across 1000 bootstrap iterations, our ensemble model had a mean RMSE of 37.97 \(\pm\) 0.355.

RMSE Distribution

Plot Errors

Validation Results

RMSE	Min	Median	Max
Rf	34.981	39.895	45.285
Lgb	36.127	40.427	46.676
Svm	35.101	39.619	45.833
Ensemble	35.316	39.255	44.952

R2	Min	Median	Max
rf	0.684	0.740	0.792
lgb	0.661	0.732	0.791
svm	0.674	0.750	0.808
ensemble	0.684	0.749	0.801

Metadata

Ensembles

RMSE-weighted model weights:

      lgb        rf       svm 
0.3288537 0.3307084 0.3404378

Linear model weights:


Call:
lm(formula = agb_mgha ~ rf_pred * lgb_pred * svm_pred, data = pred_values)

Residuals:
     Min       1Q   Median       3Q      Max 
-140.574  -20.252   -0.105   12.575  211.587 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -8.703e-01  5.548e-01  -1.569   0.1167    
rf_pred                    3.020e-01  6.630e-02   4.556 5.25e-06 ***
lgb_pred                  -4.926e-03  6.565e-02  -0.075   0.9402    
svm_pred                   7.300e-01  5.500e-02  13.272  < 2e-16 ***
rf_pred:lgb_pred           7.808e-04  3.958e-04   1.973   0.0485 *  
rf_pred:svm_pred          -8.177e-04  4.952e-04  -1.651   0.0987 .  
lgb_pred:svm_pred         -1.133e-04  5.587e-04  -0.203   0.8393    
rf_pred:lgb_pred:svm_pred  9.406e-07  1.291e-06   0.729   0.4662    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.31 on 23792 degrees of freedom
Multiple R-squared:  0.7489,    Adjusted R-squared:  0.7489 
F-statistic: 1.014e+04 on 7 and 23792 DF,  p-value: < 2.2e-16

\(n\) and \(p\)

1147 observations
- 796 training
- 351 testing
65 predictors
- X, n, zmean, zmean_c, max, quad_mean, quad_mean_c, cv, cv_c, z_kurt, z_skew, L2, L3, L4, L_cv, L_skew, L_kurt, h10, h20, h30, h40, h50, h60, h70, h80, h90, h95, h99, hvol, cancov, rpc1, d10, d20, d30, d40, d50, d60, d70, d80, d90, stems, ca_max, ca_mean, ca_min, ca25, ca50, ca75, ca90, ca95, precip, tmin, tmax, twi, slope, aspect, elev, tax_category_100, tax_category_200, tax_category_300, tax_category_600, tax_category_900, tax_category_1000, fold_index, tax_category_2000, tax_category_other

Component Models

Tuning used 5-fold CV
Final hyperparameters:

$num.trees
[1] 750

$mtry
[1] 20

$min.node.size
[1] 1

$sample.fraction
[1] 0.5

$splitrule
[1] "maxstat"

$replace
[1] TRUE

$formula
agb_mgha ~ .

$learning_rate
[1] 0.1

$nrounds
[1] 50

$num_leaves
[1] 5

$max_depth
[1] -1

$extra_trees
[1] TRUE

$min_data_in_leaf
[1] 10

$bagging_fraction
[1] 0.3

$bagging_freq
[1] 1

$feature_fraction
[1] 0.5

$min_data_in_bin
[1] 24

$lambda_l1
[1] 0

$lambda_l2
[1] 0.1

$force_col_wise
[1] TRUE

$x
agb_mgha ~ .

$kernel
[1] "laplacedot"

$type
[1] "nu-svr"

$kpar
$kpar$sigma
[1] 0.001953125


$C
[1] 64

$epsilon
[1] 0.001953125

$nu
[1] 1