CAFRI Labs: Component Bias Corrections

Mike Mahoney

Evaluation Results

Last iteration of these models: 2021-02-02

Change Summary

Each component model (RF, LGB, SVM) is now post-processed with a bias adjustment:
- After the model is fit to the training data, two linear models (AGB ~ prediction and AGB ~ poly(prediction, 2)) are fit on model predictions against the training data
- Predictions are then adjusted using whichever linear model had the lowest RMSE against the training sample
- Bias models are recalculated for each training sample – so the holdout set is using a single adjuster for each component model, the bootstrap 1,000 different adjusters for each model, the validation stage 100
- RMSE-weighted and linear model ensembles are fit using the bias-corrected predictions

	RF (ranger)	GBM (LightGBM)	SVM (kernlab)	Ensemble (model weighted)	Ensemble (RMSE weighted)
RMSE	35.515	35.480	36.671	35.482	34.949
MBE	3.165	3.789	3.552	3.190	3.502
R2	0.791	0.792	0.780	0.791	0.798

AGB Distribution

summary(bind_rows(training, testing)$agb_mgha)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   7.096  86.038  91.674 149.404 425.363

Validation Results

RMSE	Min	Median	Max
Rf	34.918	40.585	46.213
Lgb	34.102	40.424	46.415
Svm	36.847	41.906	47.528
Ensemble	34.396	40.106	45.723

R2	Min	Median	Max
rf	0.689	0.744	0.794
lgb	0.682	0.744	0.792
svm	0.637	0.727	0.780
ensemble	0.693	0.747	0.796

Metadata

Ensembles

RMSE-weighted model weights:

      lgb        rf       svm 
0.3386723 0.3351334 0.3261943

Linear model weights:


Call:
lm(formula = agb_mgha ~ rf_pred * lgb_pred * svm_pred, data = pred_values)

Residuals:
    Min      1Q  Median      3Q     Max 
-135.84  -21.00   -0.54   15.65  220.05 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -2.030e+00  5.557e-01  -3.654 0.000259 ***
rf_pred                    2.987e-01  8.609e-02   3.470 0.000521 ***
lgb_pred                   7.340e-01  7.983e-02   9.195  < 2e-16 ***
svm_pred                   1.968e-01  4.337e-02   4.537 5.73e-06 ***
rf_pred:lgb_pred          -1.261e-03  3.022e-04  -4.174 3.00e-05 ***
rf_pred:svm_pred           2.327e-03  5.495e-04   4.234 2.30e-05 ***
lgb_pred:svm_pred         -3.083e-03  4.915e-04  -6.272 3.63e-10 ***
rf_pred:lgb_pred:svm_pred  3.270e-06  1.085e-06   3.015 0.002572 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.79 on 22992 degrees of freedom
Multiple R-squared:  0.7478,    Adjusted R-squared:  0.7478 
F-statistic:  9741 on 7 and 22992 DF,  p-value: < 2.2e-16

Coverages

17 coverages:
- FEMA_FranklinStLawrence2016, FEMA_FultonSaratogaHerkimerFran, FEMA_GreatLakes2014, FEMA_OniedaSubbasin2016, NYSGPO_AlleganySteuben2016, NYSGPO_CayugaOswego_2018, NYSGPO_ColumbiaRensselaer2016, NYSGPO_ErieGeneseeLivingston201, NYSGPO_MadisonOtsego_2015, NYSGPO_Southwest_spring_2017, NYSGPO_SouthwestB_fall_2017, NYSGPO_WarrenWashingtonEssex_20, USGS_3County2014, USGS_ClintonEssexFranklin2014, USGS_LongIsland2014, USGS_NorthEast2011, USGS_Schoharie2014

\(n\) and \(p\)

1108 observations
- 767 training
- 341 testing
73 predictors
- n, zmean, zmean_c, max, quad_mean, quad_mean_c, cv, cv_c, z_kurt, z_skew, L2, L3, L4, L_cv, L_skew, L_kurt, h10, h20, h30, h40, h50, h60, h70, h80, h90, h95, h99, hvol, cancov, rpc1, d10, d20, d30, d40, d50, d60, d70, d80, d90, precip, tmin, tmax, twi, slope, aspect, elev, tax_code_105, tax_code_210, tax_code_240, tax_code_260, tax_code_270, tax_code_280, tax_code_312, tax_code_314, tax_code_322, tax_code_323, tax_code_910, tax_code_911, tax_code_912, tax_code_931, tax_code_932, tax_code_1000, tax_category_100, tax_category_200, tax_category_300, tax_category_900, tax_code_112, tax_code_120, tax_code_241, tax_code_321, tax_code_930, tax_code_941, tax_code_2000

Component Models

Tuning used 5-fold CV
Final hyperparameters:

Random forest:

$num.trees
[1] 1000

$mtry
[1] 18

$min.node.size
[1] 7

$sample.fraction
[1] 0.2

$splitrule
[1] "variance"

$replace
[1] TRUE

$formula
agb_mgha ~ .

LGB:

$learning_rate
[1] 0.05

$nrounds
[1] 100

$num_leaves
[1] 5

$max_depth
[1] 2

$extra_trees
[1] TRUE

$min_data_in_leaf
[1] 10

$bagging_fraction
[1] 0.3

$bagging_freq
[1] 1

$feature_fraction
[1] 0.4

$min_data_in_bin
[1] 8

$lambda_l1
[1] 5

$lambda_l2
[1] 1

$force_col_wise
[1] TRUE

SVM:

$x
agb_mgha ~ .

$kernel
[1] "laplacedot"

$type
[1] "eps-svr"

$kpar
$kpar$sigma
[1] 0.0078125


$C
[1] 12

$epsilon
[1] 1.525879e-05

Bias correction models:

Linear model:

NULL

RMSE-weighted:

NULL

Component Bias Corrections

Evaluation Results

Change Summary

AGB Distribution

Validation Results

Metadata

Ensembles

Coverages

\(n\) and \(p\)

Component Models

Bias correction models:

Corrections

Citation