CAFRI Labs: Shrubland 1.0.2: Balanced Diet

Mike Mahoney

Evaluation Results

Last iteration of these models: 2022-01-16

Change Summary

1.0.1 accidentally used an unbalanced sample (only 20% shrubland). This version adjusts that to ensure a 50/50 split.
Upped the samples so that training is now 250,000 observations, validation and testing 83,334 each (roughly 60/20/20)

Test Data

ROC Curves

Optimal Coordinates

Threshold values were chosen using the validation set (below) to optimize for a certain level of specificity.

	Probability Threshold	Specificity	Sensitivity
Optimize Both
Linear Ensemble	0.489	0.780	0.842
Neural Net	0.494	0.774	0.830
LGB	0.498	0.776	0.837
RF	0.519	0.752	0.767
90% Specificity
Linear Ensemble	0.755	0.900	0.659
Neural Net	0.704	0.901	0.618
LGB	0.699	0.900	0.644
RF	0.602	0.900	0.537
95% Specificity
Linear Ensemble	0.840	0.949	0.496
Neural Net	0.791	0.950	0.441
LGB	0.787	0.949	0.485
RF	0.648	0.951	0.379
97.5% Specificity
Linear Ensemble	0.881	0.975	0.348
Neural Net	0.843	0.975	0.303
LGB	0.845	0.976	0.342
RF	0.678	0.975	0.262
99% Specificity
Linear Ensemble	0.907	0.989	0.218
Neural Net	0.886	0.989	0.185
LGB	0.889	0.990	0.208
RF	0.704	0.990	0.135

Confusion Matrices

Optimize Both

Logistic Ensemble

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 32680  6537
         1  9211 34906
                                          
               Accuracy : 0.811           
                 95% CI : (0.8084, 0.8137)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6222          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.8423          
            Specificity : 0.7801          
         Pos Pred Value : 0.7912          
         Neg Pred Value : 0.8333          
             Prevalence : 0.4973          
         Detection Rate : 0.4189          
   Detection Prevalence : 0.5294          
      Balanced Accuracy : 0.8112          
                                          
       'Positive' Class : 1

Neural Net

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 32428  7031
         1  9463 34412
                                          
               Accuracy : 0.8021          
                 95% CI : (0.7994, 0.8048)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6043          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.8303          
            Specificity : 0.7741          
         Pos Pred Value : 0.7843          
         Neg Pred Value : 0.8218          
             Prevalence : 0.4973          
         Detection Rate : 0.4129          
   Detection Prevalence : 0.5265          
      Balanced Accuracy : 0.8022          
                                          
       'Positive' Class : 1

LightGBM

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 32496  6761
         1  9395 34682
                                          
               Accuracy : 0.8061          
                 95% CI : (0.8034, 0.8088)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6124          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.8369          
            Specificity : 0.7757          
         Pos Pred Value : 0.7869          
         Neg Pred Value : 0.8278          
             Prevalence : 0.4973          
         Detection Rate : 0.4162          
   Detection Prevalence : 0.5289          
      Balanced Accuracy : 0.8063          
                                          
       'Positive' Class : 1

Random Forest

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 31522  9646
         1 10369 31797
                                          
               Accuracy : 0.7598          
                 95% CI : (0.7569, 0.7627)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5197          
                                          
 Mcnemar's Test P-Value : 0.0000003336    
                                          
            Sensitivity : 0.7672          
            Specificity : 0.7525          
         Pos Pred Value : 0.7541          
         Neg Pred Value : 0.7657          
             Prevalence : 0.4973          
         Detection Rate : 0.3816          
   Detection Prevalence : 0.5060          
      Balanced Accuracy : 0.7599          
                                          
       'Positive' Class : 1

90% Specificity

Logistic Ensemble

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 37699 14144
         1  4192 27299
                                          
               Accuracy : 0.78            
                 95% CI : (0.7771, 0.7828)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5594          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.6587          
            Specificity : 0.8999          
         Pos Pred Value : 0.8669          
         Neg Pred Value : 0.7272          
             Prevalence : 0.4973          
         Detection Rate : 0.3276          
   Detection Prevalence : 0.3779          
      Balanced Accuracy : 0.7793          
                                          
       'Positive' Class : 1

Neural Net

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 37760 15817
         1  4131 25626
                                          
               Accuracy : 0.7606          
                 95% CI : (0.7577, 0.7635)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5205          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.6183          
            Specificity : 0.9014          
         Pos Pred Value : 0.8612          
         Neg Pred Value : 0.7048          
             Prevalence : 0.4973          
         Detection Rate : 0.3075          
   Detection Prevalence : 0.3571          
      Balanced Accuracy : 0.7599          
                                          
       'Positive' Class : 1

LightGBM

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 37689 14739
         1  4202 26704
                                          
               Accuracy : 0.7727          
                 95% CI : (0.7698, 0.7756)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5448          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.6444          
            Specificity : 0.8997          
         Pos Pred Value : 0.8640          
         Neg Pred Value : 0.7189          
             Prevalence : 0.4973          
         Detection Rate : 0.3204          
   Detection Prevalence : 0.3709          
      Balanced Accuracy : 0.7720          
                                          
       'Positive' Class : 1

Random Forest

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 37698 19200
         1  4193 22243
                                          
               Accuracy : 0.7193          
                 95% CI : (0.7162, 0.7223)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4375          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5367          
            Specificity : 0.8999          
         Pos Pred Value : 0.8414          
         Neg Pred Value : 0.6626          
             Prevalence : 0.4973          
         Detection Rate : 0.2669          
   Detection Prevalence : 0.3172          
      Balanced Accuracy : 0.7183          
                                          
       'Positive' Class : 1

95% Specificity

Logistic Ensemble

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 39768 20907
         1  2123 20536
                                          
               Accuracy : 0.7236          
                 95% CI : (0.7206, 0.7267)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4459          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4955          
            Specificity : 0.9493          
         Pos Pred Value : 0.9063          
         Neg Pred Value : 0.6554          
             Prevalence : 0.4973          
         Detection Rate : 0.2464          
   Detection Prevalence : 0.2719          
      Balanced Accuracy : 0.7224          
                                          
       'Positive' Class : 1

Neural Net

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 39809 23150
         1  2082 18293
                                          
               Accuracy : 0.6972          
                 95% CI : (0.6941, 0.7003)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3928          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4414          
            Specificity : 0.9503          
         Pos Pred Value : 0.8978          
         Neg Pred Value : 0.6323          
             Prevalence : 0.4973          
         Detection Rate : 0.2195          
   Detection Prevalence : 0.2445          
      Balanced Accuracy : 0.6959          
                                          
       'Positive' Class : 1

LightGBM

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 39770 21356
         1  2121 20087
                                          
               Accuracy : 0.7183          
                 95% CI : (0.7152, 0.7213)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4351          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4847          
            Specificity : 0.9494          
         Pos Pred Value : 0.9045          
         Neg Pred Value : 0.6506          
             Prevalence : 0.4973          
         Detection Rate : 0.2410          
   Detection Prevalence : 0.2665          
      Balanced Accuracy : 0.7170          
                                          
       'Positive' Class : 1

Random Forest

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 39843 25733
         1  2048 15710
                                          
               Accuracy : 0.6666          
                 95% CI : (0.6634, 0.6698)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3312          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.3791          
            Specificity : 0.9511          
         Pos Pred Value : 0.8847          
         Neg Pred Value : 0.6076          
             Prevalence : 0.4973          
         Detection Rate : 0.1885          
   Detection Prevalence : 0.2131          
      Balanced Accuracy : 0.6651          
                                          
       'Positive' Class : 1

97.5% Specificity

Logistic Ensemble

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 40858 27010
         1  1033 14433
                                          
               Accuracy : 0.6635          
                 95% CI : (0.6603, 0.6667)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3247          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.3483          
            Specificity : 0.9753          
         Pos Pred Value : 0.9332          
         Neg Pred Value : 0.6020          
             Prevalence : 0.4973          
         Detection Rate : 0.1732          
   Detection Prevalence : 0.1856          
      Balanced Accuracy : 0.6618          
                                          
       'Positive' Class : 1

Neural Net

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 40839 28905
         1  1052 12538
                                          
               Accuracy : 0.6405          
                 95% CI : (0.6373, 0.6438)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2784          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.3025          
            Specificity : 0.9749          
         Pos Pred Value : 0.9226          
         Neg Pred Value : 0.5856          
             Prevalence : 0.4973          
         Detection Rate : 0.1505          
   Detection Prevalence : 0.1631          
      Balanced Accuracy : 0.6387          
                                          
       'Positive' Class : 1

LightGBM

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 40895 27288
         1   996 14155
                                          
               Accuracy : 0.6606          
                 95% CI : (0.6574, 0.6638)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3189          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.3416          
            Specificity : 0.9762          
         Pos Pred Value : 0.9343          
         Neg Pred Value : 0.5998          
             Prevalence : 0.4973          
         Detection Rate : 0.1699          
   Detection Prevalence : 0.1818          
      Balanced Accuracy : 0.6589          
                                          
       'Positive' Class : 1

Random Forest

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 40830 30566
         1  1061 10877
                                          
               Accuracy : 0.6205          
                 95% CI : (0.6172, 0.6238)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.238           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.2625          
            Specificity : 0.9747          
         Pos Pred Value : 0.9111          
         Neg Pred Value : 0.5719          
             Prevalence : 0.4973          
         Detection Rate : 0.1305          
   Detection Prevalence : 0.1433          
      Balanced Accuracy : 0.6186          
                                          
       'Positive' Class : 1

99% Specificity

Logistic Ensemble

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 41437 32415
         1   454  9028
                                          
               Accuracy : 0.6056          
                 95% CI : (0.6022, 0.6089)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2079          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.2178          
            Specificity : 0.9892          
         Pos Pred Value : 0.9521          
         Neg Pred Value : 0.5611          
             Prevalence : 0.4973          
         Detection Rate : 0.1083          
   Detection Prevalence : 0.1138          
      Balanced Accuracy : 0.6035          
                                          
       'Positive' Class : 1

Neural Net

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 41415 33767
         1   476  7676
                                          
               Accuracy : 0.5891          
                 95% CI : (0.5857, 0.5924)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.1746          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.18522         
            Specificity : 0.98864         
         Pos Pred Value : 0.94161         
         Neg Pred Value : 0.55086         
             Prevalence : 0.49731         
         Detection Rate : 0.09211         
   Detection Prevalence : 0.09782         
      Balanced Accuracy : 0.58693         
                                          
       'Positive' Class : 1

LightGBM

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 41487 32807
         1   404  8636
                                          
               Accuracy : 0.6015          
                 95% CI : (0.5981, 0.6048)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.1996          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.2084          
            Specificity : 0.9904          
         Pos Pred Value : 0.9553          
         Neg Pred Value : 0.5584          
             Prevalence : 0.4973          
         Detection Rate : 0.1036          
   Detection Prevalence : 0.1085          
      Balanced Accuracy : 0.5994          
                                          
       'Positive' Class : 1

Random Forest

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 41471 35857
         1   420  5586
                                         
               Accuracy : 0.5647         
                 95% CI : (0.5613, 0.568)
    No Information Rate : 0.5027         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.1253         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.13479        
            Specificity : 0.98997        
         Pos Pred Value : 0.93007        
         Neg Pred Value : 0.53630        
             Prevalence : 0.49731        
         Detection Rate : 0.06703        
   Detection Prevalence : 0.07207        
      Balanced Accuracy : 0.56238        
                                         
       'Positive' Class : 1

Validation Data

ROC Curves

Optimal Coordinates

	Probability Threshold	Specificity	Sensitivity
Optimize Both
Linear Ensemble	0.489	0.779	0.844
Neural Net	0.494	0.772	0.830
LGB	0.498	0.774	0.838
RF	0.519	0.753	0.769
90% Specificity
Linear Ensemble	0.755	0.900	0.655
Neural Net	0.704	0.900	0.616
LGB	0.699	0.900	0.640
RF	0.602	0.900	0.536
95% Specificity
Linear Ensemble	0.840	0.950	0.493
Neural Net	0.791	0.950	0.439
LGB	0.787	0.950	0.478
RF	0.648	0.950	0.376
97.5% Specificity
Linear Ensemble	0.881	0.975	0.344
Neural Net	0.843	0.975	0.298
LGB	0.845	0.975	0.337
RF	0.678	0.975	0.257
99% Specificity
Linear Ensemble	0.907	0.990	0.213
Neural Net	0.886	0.990	0.181
LGB	0.889	0.990	0.205
RF	0.704	0.990	0.133

Metadata

Data

416668 observations
- 250000 training
- 83334 validation
- 83334 testing
19 predictors
- tcb, tcw, tcg, nbr, mag, yod, nys_precip, nys_tmax, nys_tmin, nys_aspect, nys_dem, nys_slope, nys_twi, lcsec_X2, lcsec_X3, lcsec_X4, lcsec_X5, lcsec_X6, lcsec_X8

Models

Tuning used 5-fold CV
Final models:

Logistic Regression


Call:  glm(formula = shrub ~ ., family = "binomial", data = validation)

Coefficients:
(Intercept)          tcb          tcw          tcg          nbr  
 -4.0528297    0.0004629    0.0001641   -0.0005451    0.0005373  
        mag          yod   nys_precip     nys_tmax     nys_tmin  
 -0.0005495    0.0001269    0.0002575    0.0468812   -0.0569427  
 nys_aspect      nys_dem    nys_slope      nys_twi     lcsec_X2  
  0.0003939   -0.0001579   -0.0046514    0.0113271   -0.1513639  
   lcsec_X3     lcsec_X4     lcsec_X5     lcsec_X6     lcsec_X8  
  0.0045767   -0.0064030   -0.0359777   -0.1350724    0.3104493  
        lgb           rf         nnet  
  4.2314516   -1.3507346    2.6849262  

Degrees of Freedom: 83333 Total (i.e. Null);  83311 Residual
Null Deviance:      115500 
Residual Deviance: 69130    AIC: 69180

Neural net

Model
Model: "sequential"
______________________________________________________________________
 Layer (type)                  Output Shape                Param #    
======================================================================
 dense_features (DenseFeatures  multiple                   0          
 )                                                                    
                                                                      
 dense_5 (Dense)               multiple                    5120       
                                                                      
 dense_4 (Dense)               multiple                    32896      
                                                                      
 dense_3 (Dense)               multiple                    8256       
                                                                      
 dense_2 (Dense)               multiple                    2080       
                                                                      
 dense_1 (Dense)               multiple                    528        
                                                                      
 dropout (Dropout)             multiple                    0          
                                                                      
 dense (Dense)                 multiple                    17         
                                                                      
======================================================================
Total params: 48,897
Trainable params: 48,897
Non-trainable params: 0
______________________________________________________________________

Random Forest

$num.trees
[1] 3000

$mtry
[1] 1

$min.node.size
[1] 6

$replace
[1] TRUE

$sample.fraction
[1] 0.2

$formula
shrub ~ .

LightGBM

$params
$params$learning_rate
[1] 0.01

$params$nrounds
[1] 2500

$params$num_leaves
[1] 14

$params$max_depth
[1] -1

$params$extra_trees
[1] FALSE

$params$min_data_in_leaf
[1] 10

$params$bagging_fraction
[1] 0.5

$params$bagging_freq
[1] 1

$params$feature_fraction
[1] 0.9

$params$min_data_in_bin
[1] 3

$params$lambda_l1
[1] 0

$params$lambda_l2
[1] 0.5

$params$force_col_wise
[1] TRUE

Shrubland 1.0.2: Balanced Diet

Evaluation Results

Change Summary

Test Data

ROC Curves

Optimal Coordinates

Confusion Matrices

Optimize Both

90% Specificity

95% Specificity

97.5% Specificity

99% Specificity

Validation Data

ROC Curves

Optimal Coordinates

Metadata

Data

Models

Logistic Regression

Neural net

Random Forest

LightGBM

Corrections

Citation