CAFRI Labs: Ensemble Bias Corrections

Mike Mahoney

Evaluation Results

Last iteration of these models: 2021-02-02

Change Summary

RMSE-weighted and linear model ensembles are post-processed with a bias adjustment:
- After the model is fit to the training data, two linear models (AGB ~ prediction and AGB ~ poly(prediction, 2)) are fit on model predictions against the training data
- Predictions are then adjusted using whichever linear model had the lowest RMSE against the training sample
- Bias models are recalculated for each training sample – so the holdout set is using a single adjuster for each component model, the bootstrap 1,000 different adjusters for each model, the validation stage 100
- RMSE-weighted and linear model ensembles are fit using non -bias-corrected predictions

	RF (ranger)	GBM (LightGBM)	SVM (kernlab)	Ensemble (model weighted)	Ensemble (RMSE weighted)
RMSE	36.031	35.721	36.041	35.338	35.041
MBE	3.911	3.189	1.136	3.078	3.297
R2	0.785	0.789	0.783	0.793	0.797

AGB Distribution

summary(bind_rows(training, testing)$agb_mgha)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   7.096  86.038  91.674 149.404 425.363

Validation Results

RMSE	Min	Median	Max
Rf	34.326	40.118	45.497
Lgb	33.807	40.151	46.160
Svm	35.718	41.305	47.601
Ensemble	34.947	40.604	46.303

R2	Min	Median	Max
rf	0.694	0.748	0.793
lgb	0.685	0.745	0.791
svm	0.644	0.730	0.780
ensemble	0.689	0.743	0.795

Metadata

Ensembles

RMSE-weighted model weights:

      lgb        rf       svm 
0.3362089 0.3365316 0.3272595

Linear model weights:


Call:
lm(formula = agb_mgha ~ rf_pred * lgb_pred * svm_pred, data = pred_values)

Residuals:
    Min      1Q  Median      3Q     Max 
-136.00  -21.05   -0.65   15.61  218.76 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -2.343e+00  5.795e-01  -4.044 5.27e-05 ***
rf_pred                    2.127e-01  8.528e-02   2.494 0.012649 *  
lgb_pred                   6.999e-01  8.870e-02   7.891 3.12e-15 ***
svm_pred                   1.927e-01  4.955e-02   3.889 0.000101 ***
rf_pred:lgb_pred          -5.553e-04  3.423e-04  -1.622 0.104723    
rf_pred:svm_pred           3.200e-03  6.147e-04   5.206 1.95e-07 ***
lgb_pred:svm_pred         -3.839e-03  6.161e-04  -6.232 4.70e-10 ***
rf_pred:lgb_pred:svm_pred  3.901e-06  1.284e-06   3.038 0.002388 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.84 on 22992 degrees of freedom
Multiple R-squared:  0.7472,    Adjusted R-squared:  0.7471 
F-statistic:  9707 on 7 and 22992 DF,  p-value: < 2.2e-16

Coverages

17 coverages:
- FEMA_FranklinStLawrence2016, FEMA_FultonSaratogaHerkimerFran, FEMA_GreatLakes2014, FEMA_OniedaSubbasin2016, NYSGPO_AlleganySteuben2016, NYSGPO_CayugaOswego_2018, NYSGPO_ColumbiaRensselaer2016, NYSGPO_ErieGeneseeLivingston201, NYSGPO_MadisonOtsego_2015, NYSGPO_Southwest_spring_2017, NYSGPO_SouthwestB_fall_2017, NYSGPO_WarrenWashingtonEssex_20, USGS_3County2014, USGS_ClintonEssexFranklin2014, USGS_LongIsland2014, USGS_NorthEast2011, USGS_Schoharie2014

\(n\) and \(p\)

1108 observations
- 767 training
- 341 testing
73 predictors
- n, zmean, zmean_c, max, quad_mean, quad_mean_c, cv, cv_c, z_kurt, z_skew, L2, L3, L4, L_cv, L_skew, L_kurt, h10, h20, h30, h40, h50, h60, h70, h80, h90, h95, h99, hvol, cancov, rpc1, d10, d20, d30, d40, d50, d60, d70, d80, d90, precip, tmin, tmax, twi, slope, aspect, elev, tax_code_105, tax_code_210, tax_code_240, tax_code_260, tax_code_270, tax_code_280, tax_code_312, tax_code_314, tax_code_322, tax_code_323, tax_code_910, tax_code_911, tax_code_912, tax_code_931, tax_code_932, tax_code_1000, tax_category_100, tax_category_200, tax_category_300, tax_category_900, tax_code_112, tax_code_120, tax_code_241, tax_code_321, tax_code_930, tax_code_941, tax_code_2000

Component Models

Tuning used 5-fold CV
Final hyperparameters:

Random forest:

$num.trees
[1] 1000

$mtry
[1] 18

$min.node.size
[1] 7

$sample.fraction
[1] 0.2

$splitrule
[1] "variance"

$replace
[1] TRUE

$formula
agb_mgha ~ .

LGB:

$learning_rate
[1] 0.05

$nrounds
[1] 100

$num_leaves
[1] 5

$max_depth
[1] 2

$extra_trees
[1] TRUE

$min_data_in_leaf
[1] 10

$bagging_fraction
[1] 0.3

$bagging_freq
[1] 1

$feature_fraction
[1] 0.4

$min_data_in_bin
[1] 8

$lambda_l1
[1] 5

$lambda_l2
[1] 1

$force_col_wise
[1] TRUE

SVM:

$x
agb_mgha ~ .

$kernel
[1] "laplacedot"

$type
[1] "eps-svr"

$kpar
$kpar$sigma
[1] 0.0078125


$C
[1] 12

$epsilon
[1] 1.525879e-05

Bias correction models:

Linear model:

[1] ""

RMSE-weighted:

[1] ""

Ensemble Bias Corrections

Evaluation Results

Change Summary

AGB Distribution

Validation Results

Metadata

Ensembles

Coverages

\(n\) and \(p\)

Component Models

Bias correction models:

Corrections

Citation