CAFRI Labs: Shrubland 1.0.1: Supersize Me

Mike Mahoney

Evaluation Results

Last iteration of these models: 2022-01-15

Change Summary

Increased number of samples by 10x (100,000 train, 33,334 validation/test)

Test Data

ROC Curves

Optimal Coordinates

Threshold values were chosen using the validation set (below) to optimize for a certain level of specificity.

	Probability Threshold	Specificity	Sensitivity
Optimize Both
Linear Ensemble	0.141	0.779	0.846
Neural Net	0.167	0.757	0.854
LGB	0.206	0.777	0.832
RF	0.226	0.731	0.781
90% Specificity
Linear Ensemble	0.332	0.896	0.683
Neural Net	0.366	0.895	0.669
LGB	0.370	0.898	0.655
RF	0.295	0.897	0.552
95% Specificity
Linear Ensemble	0.523	0.948	0.529
Neural Net	0.481	0.948	0.507
LGB	0.490	0.947	0.500
RF	0.338	0.948	0.404
97.5% Specificity
Linear Ensemble	0.677	0.974	0.384
Neural Net	0.579	0.973	0.372
LGB	0.602	0.975	0.348
RF	0.375	0.975	0.272
99% Specificity
Linear Ensemble	0.818	0.990	0.241
Neural Net	0.709	0.990	0.226
LGB	0.707	0.990	0.211
RF	0.409	0.990	0.146

Confusion Matrices

Optimize Both

Logistic Ensemble

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 20688  1044
         1  5880  5722
                                          
               Accuracy : 0.7923          
                 95% CI : (0.7879, 0.7966)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : 0.9844          
                                          
                  Kappa : 0.493           
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8457          
            Specificity : 0.7787          
         Pos Pred Value : 0.4932          
         Neg Pred Value : 0.9520          
             Prevalence : 0.2030          
         Detection Rate : 0.1717          
   Detection Prevalence : 0.3481          
      Balanced Accuracy : 0.8122          
                                          
       'Positive' Class : 1

Neural Net

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 20108   988
         1  6460  5778
                                         
               Accuracy : 0.7766         
                 95% CI : (0.7721, 0.781)
    No Information Rate : 0.797          
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.4694         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.8540         
            Specificity : 0.7569         
         Pos Pred Value : 0.4721         
         Neg Pred Value : 0.9532         
             Prevalence : 0.2030         
         Detection Rate : 0.1733         
   Detection Prevalence : 0.3671         
      Balanced Accuracy : 0.8054         
                                         
       'Positive' Class : 1

LightGBM

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 20644  1134
         1  5924  5632
                                          
               Accuracy : 0.7883          
                 95% CI : (0.7838, 0.7926)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4822          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8324          
            Specificity : 0.7770          
         Pos Pred Value : 0.4874          
         Neg Pred Value : 0.9479          
             Prevalence : 0.2030          
         Detection Rate : 0.1690          
   Detection Prevalence : 0.3467          
      Balanced Accuracy : 0.8047          
                                          
       'Positive' Class : 1

Random Forest

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 19427  1484
         1  7141  5282
                                         
               Accuracy : 0.7413         
                 95% CI : (0.7365, 0.746)
    No Information Rate : 0.797          
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.3903         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.7807         
            Specificity : 0.7312         
         Pos Pred Value : 0.4252         
         Neg Pred Value : 0.9290         
             Prevalence : 0.2030         
         Detection Rate : 0.1585         
   Detection Prevalence : 0.3727         
      Balanced Accuracy : 0.7559         
                                         
       'Positive' Class : 1

90% Specificity

Logistic Ensemble

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 23796  2148
         1  2772  4618
                                          
               Accuracy : 0.8524          
                 95% CI : (0.8485, 0.8562)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.559           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.6825          
            Specificity : 0.8957          
         Pos Pred Value : 0.6249          
         Neg Pred Value : 0.9172          
             Prevalence : 0.2030          
         Detection Rate : 0.1385          
   Detection Prevalence : 0.2217          
      Balanced Accuracy : 0.7891          
                                          
       'Positive' Class : 1

Neural Net

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 23783  2239
         1  2785  4527
                                             
               Accuracy : 0.8493             
                 95% CI : (0.8454, 0.8531)   
    No Information Rate : 0.797              
    P-Value [Acc > NIR] : < 2.2e-16          
                                             
                  Kappa : 0.5478             
                                             
 Mcnemar's Test P-Value : 0.00000000000001483
                                             
            Sensitivity : 0.6691             
            Specificity : 0.8952             
         Pos Pred Value : 0.6191             
         Neg Pred Value : 0.9140             
             Prevalence : 0.2030             
         Detection Rate : 0.1358             
   Detection Prevalence : 0.2194             
      Balanced Accuracy : 0.7821             
                                             
       'Positive' Class : 1

LightGBM

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 23851  2331
         1  2717  4435
                                          
               Accuracy : 0.8486          
                 95% CI : (0.8447, 0.8524)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5417          
                                          
 Mcnemar's Test P-Value : 0.00000006001   
                                          
            Sensitivity : 0.6555          
            Specificity : 0.8977          
         Pos Pred Value : 0.6201          
         Neg Pred Value : 0.9110          
             Prevalence : 0.2030          
         Detection Rate : 0.1330          
   Detection Prevalence : 0.2146          
      Balanced Accuracy : 0.7766          
                                          
       'Positive' Class : 1

Random Forest

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 23823  3033
         1  2745  3733
                                          
               Accuracy : 0.8267          
                 95% CI : (0.8226, 0.8307)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4556          
                                          
 Mcnemar's Test P-Value : 0.0001596       
                                          
            Sensitivity : 0.5517          
            Specificity : 0.8967          
         Pos Pred Value : 0.5763          
         Neg Pred Value : 0.8871          
             Prevalence : 0.2030          
         Detection Rate : 0.1120          
   Detection Prevalence : 0.1943          
      Balanced Accuracy : 0.7242          
                                          
       'Positive' Class : 1

95% Specificity

Logistic Ensemble

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 25194  3185
         1  1374  3581
                                          
               Accuracy : 0.8632          
                 95% CI : (0.8595, 0.8669)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5305          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5293          
            Specificity : 0.9483          
         Pos Pred Value : 0.7227          
         Neg Pred Value : 0.8878          
             Prevalence : 0.2030          
         Detection Rate : 0.1074          
   Detection Prevalence : 0.1486          
      Balanced Accuracy : 0.7388          
                                          
       'Positive' Class : 1

Neural Net

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 25177  3339
         1  1391  3427
                                          
               Accuracy : 0.8581          
                 95% CI : (0.8543, 0.8618)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5087          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5065          
            Specificity : 0.9476          
         Pos Pred Value : 0.7113          
         Neg Pred Value : 0.8829          
             Prevalence : 0.2030          
         Detection Rate : 0.1028          
   Detection Prevalence : 0.1445          
      Balanced Accuracy : 0.7271          
                                          
       'Positive' Class : 1

LightGBM

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 25173  3381
         1  1395  3385
                                          
               Accuracy : 0.8567          
                 95% CI : (0.8529, 0.8605)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5028          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5003          
            Specificity : 0.9475          
         Pos Pred Value : 0.7082          
         Neg Pred Value : 0.8816          
             Prevalence : 0.2030          
         Detection Rate : 0.1015          
   Detection Prevalence : 0.1434          
      Balanced Accuracy : 0.7239          
                                          
       'Positive' Class : 1

Random Forest

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 25176  4034
         1  1392  2732
                                          
               Accuracy : 0.8372          
                 95% CI : (0.8332, 0.8412)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4112          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.40378         
            Specificity : 0.94761         
         Pos Pred Value : 0.66246         
         Neg Pred Value : 0.86190         
             Prevalence : 0.20298         
         Detection Rate : 0.08196         
   Detection Prevalence : 0.12372         
      Balanced Accuracy : 0.67569         
                                          
       'Positive' Class : 1

97.5% Specificity

Logistic Ensemble

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 25884  4169
         1   684  2597
                                          
               Accuracy : 0.8544          
                 95% CI : (0.8506, 0.8582)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4431          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.38383         
            Specificity : 0.97425         
         Pos Pred Value : 0.79153         
         Neg Pred Value : 0.86128         
             Prevalence : 0.20298         
         Detection Rate : 0.07791         
   Detection Prevalence : 0.09843         
      Balanced Accuracy : 0.67904         
                                          
       'Positive' Class : 1

Neural Net

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 25848  4251
         1   720  2515
                                         
               Accuracy : 0.8509         
                 95% CI : (0.847, 0.8547)
    No Information Rate : 0.797          
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.4278         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.37171        
            Specificity : 0.97290        
         Pos Pred Value : 0.77743        
         Neg Pred Value : 0.85877        
             Prevalence : 0.20298        
         Detection Rate : 0.07545        
   Detection Prevalence : 0.09705        
      Balanced Accuracy : 0.67231        
                                         
       'Positive' Class : 1

LightGBM

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 25907  4413
         1   661  2353
                                          
               Accuracy : 0.8478          
                 95% CI : (0.8439, 0.8516)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.407           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.34777         
            Specificity : 0.97512         
         Pos Pred Value : 0.78069         
         Neg Pred Value : 0.85445         
             Prevalence : 0.20298         
         Detection Rate : 0.07059         
   Detection Prevalence : 0.09042         
      Balanced Accuracy : 0.66144         
                                          
       'Positive' Class : 1

Random Forest

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 25917  4927
         1   651  1839
                                          
               Accuracy : 0.8327          
                 95% CI : (0.8286, 0.8367)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3235          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.27180         
            Specificity : 0.97550         
         Pos Pred Value : 0.73855         
         Neg Pred Value : 0.84026         
             Prevalence : 0.20298         
         Detection Rate : 0.05517         
   Detection Prevalence : 0.07470         
      Balanced Accuracy : 0.62365         
                                          
       'Positive' Class : 1

99% Specificity

Logistic Ensemble

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 26303  5134
         1   265  1632
                                        
               Accuracy : 0.838         
                 95% CI : (0.834, 0.842)
    No Information Rate : 0.797         
    P-Value [Acc > NIR] : < 2.2e-16     
                                        
                  Kappa : 0.316         
                                        
 Mcnemar's Test P-Value : < 2.2e-16     
                                        
            Sensitivity : 0.24121       
            Specificity : 0.99003       
         Pos Pred Value : 0.86031       
         Neg Pred Value : 0.83669       
             Prevalence : 0.20298       
         Detection Rate : 0.04896       
   Detection Prevalence : 0.05691       
      Balanced Accuracy : 0.61562       
                                        
       'Positive' Class : 1

Neural Net

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 26300  5234
         1   268  1532
                                          
               Accuracy : 0.8349          
                 95% CI : (0.8309, 0.8389)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2978          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.22643         
            Specificity : 0.98991         
         Pos Pred Value : 0.85111         
         Neg Pred Value : 0.83402         
             Prevalence : 0.20298         
         Detection Rate : 0.04596         
   Detection Prevalence : 0.05400         
      Balanced Accuracy : 0.60817         
                                          
       'Positive' Class : 1

LightGBM

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 26311  5338
         1   257  1428
                                          
               Accuracy : 0.8322          
                 95% CI : (0.8281, 0.8362)
    No Information Rate : 0.797           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2796          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.21106         
            Specificity : 0.99033         
         Pos Pred Value : 0.84748         
         Neg Pred Value : 0.83134         
             Prevalence : 0.20298         
         Detection Rate : 0.04284         
   Detection Prevalence : 0.05055         
      Balanced Accuracy : 0.60069         
                                          
       'Positive' Class : 1

Random Forest

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 26311  5780
         1   257   986
                                         
               Accuracy : 0.8189         
                 95% CI : (0.8147, 0.823)
    No Information Rate : 0.797          
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.1955         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.14573        
            Specificity : 0.99033        
         Pos Pred Value : 0.79324        
         Neg Pred Value : 0.81989        
             Prevalence : 0.20298        
         Detection Rate : 0.02958        
   Detection Prevalence : 0.03729        
      Balanced Accuracy : 0.56803        
                                         
       'Positive' Class : 1

Validation Data

ROC Curves

Optimal Coordinates

	Probability Threshold	Specificity	Sensitivity
Optimize Both
Linear Ensemble	0.141	0.785	0.843
Neural Net	0.167	0.760	0.856
LGB	0.206	0.779	0.829
RF	0.226	0.730	0.776
90% Specificity
Linear Ensemble	0.332	0.900	0.673
Neural Net	0.366	0.900	0.665
LGB	0.370	0.900	0.638
RF	0.295	0.900	0.536
95% Specificity
Linear Ensemble	0.523	0.950	0.521
Neural Net	0.481	0.950	0.507
LGB	0.490	0.950	0.485
RF	0.338	0.950	0.398
97.5% Specificity
Linear Ensemble	0.677	0.975	0.381
Neural Net	0.579	0.975	0.365
LGB	0.602	0.975	0.339
RF	0.375	0.975	0.270
99% Specificity
Linear Ensemble	0.818	0.990	0.237
Neural Net	0.709	0.990	0.225
LGB	0.707	0.990	0.205
RF	0.409	0.990	0.142

Metadata

Data

166668 observations
- 100000 training
- 33334 validation
- 33334 testing
19 predictors
- tcb, tcw, tcg, nbr, mag, yod, nys_precip, nys_tmax, nys_tmin, nys_aspect, nys_dem, nys_slope, nys_twi, lcsec_X2, lcsec_X3, lcsec_X4, lcsec_X5, lcsec_X6, lcsec_X8

Models

Tuning used 5-fold CV
Final models:

Logistic Regression


Call:  glm(formula = shrub ~ ., family = "binomial", data = validation)

Coefficients:
(Intercept)          tcb          tcw          tcg          nbr  
-3.11032321   0.00027987   0.00073321  -0.00006364  -0.00157420  
        mag          yod   nys_precip     nys_tmax     nys_tmin  
 0.00036542  -0.00002006  -0.00020836   0.12872106  -0.08899034  
 nys_aspect      nys_dem    nys_slope      nys_twi     lcsec_X2  
 0.00014450   0.00032340  -0.06009474  -0.13512527   0.09216093  
   lcsec_X3     lcsec_X4     lcsec_X5     lcsec_X6     lcsec_X8  
 0.21841891   0.23121761   0.40354893   2.97905223   0.29797653  
        lgb           rf         nnet  
 2.65936070   1.14082510   4.23373004  

Degrees of Freedom: 33333 Total (i.e. Null);  33311 Residual
Null Deviance:      33560 
Residual Deviance: 21150    AIC: 21200

Neural net

Model
Model: "sequential"
______________________________________________________________________
 Layer (type)                  Output Shape                Param #    
======================================================================
 dense_features (DenseFeatures  multiple                   0          
 )                                                                    
                                                                      
 dense_5 (Dense)               multiple                    5120       
                                                                      
 dense_4 (Dense)               multiple                    32896      
                                                                      
 dense_3 (Dense)               multiple                    8256       
                                                                      
 dense_2 (Dense)               multiple                    2080       
                                                                      
 dense_1 (Dense)               multiple                    528        
                                                                      
 dropout (Dropout)             multiple                    0          
                                                                      
 dense (Dense)                 multiple                    17         
                                                                      
======================================================================
Total params: 48,897
Trainable params: 48,897
Non-trainable params: 0
______________________________________________________________________

Random Forest

$num.trees
[1] 3000

$mtry
[1] 1

$min.node.size
[1] 6

$replace
[1] TRUE

$sample.fraction
[1] 0.2

$formula
shrub ~ .

LightGBM

$params
$params$learning_rate
[1] 0.01

$params$nrounds
[1] 2500

$params$num_leaves
[1] 14

$params$max_depth
[1] -1

$params$extra_trees
[1] FALSE

$params$min_data_in_leaf
[1] 10

$params$bagging_fraction
[1] 0.5

$params$bagging_freq
[1] 1

$params$feature_fraction
[1] 0.9

$params$min_data_in_bin
[1] 3

$params$lambda_l1
[1] 0

$params$lambda_l2
[1] 0.5

$params$force_col_wise
[1] TRUE

Shrubland 1.0.1: Supersize Me

Evaluation Results

Change Summary

Test Data

ROC Curves

Optimal Coordinates

Confusion Matrices

Optimize Both

90% Specificity

95% Specificity

97.5% Specificity

99% Specificity

Validation Data

ROC Curves

Optimal Coordinates

Metadata

Data

Models

Logistic Regression

Neural net

Random Forest

LightGBM

Corrections

Citation