A statistical learning framework for groundwater nitrate models of the Central Valley, California
U.S. Geological Survey
Nitrate is one of the most common anthropogenic contaminants in domestic well water and exceeds the maximum contaminant level of 10 mg/L as N in many wells of the Central Valley, California. Empirical models commonly are used to estimate nitrate potential in groundwater, and can benefit resource managers by identifying the most vulnerable areas. Linear regression and classification methods have been popular choices for estimating nitrate impacts on groundwater. However, such methods can be hampered by hypothesis testing assumptions such as linear and monotonic responses and normally distributed data. Machine learning methods are promising alternatives that dispense with traditional hypothesis testing. For example, tree-based methods do not require data transformation, can fit nonlinear and non-monotonic relations, and automatically incorporate interactions among predictor variables. However, such methods are prone to overfit, which causes the models to not predict well to new data (i.e., samples not used in model training). We used a statistical learning framework to optimize the predictive performance of three machine-learning methods for Central Valley groundwater nitrate data: boosted regression trees (BRT), artificial neural networks (ANN), and Bayesian networks (BN). The statistical learning framework uses cross validation (CV) training and testing data to tune the complexity of the models, and a separate hold-out data set for evaluation of final models. With these data the order of prediction performance based on both CV testing and hold-out R2 values was BRT>BN>ANN. For each method we identified two models based on CV testing results: one with maximum testing R2 and a simpler version with R2 within one standard error of the maximum (the 1SE model). The former yielded training R2 values of 0.94 - 1.0 and the 1SE versions had R2 values of 0.90 - 0.91. Cross-validation testing R2 values indicate predictive performance, and these were 0.22 - 0.39 for the maximum R2 models and 0.19 - 0.36 for the 1SE models. Evaluation with hold-out data suggested that the 1SE BRT and ANN models predicted better to new data (R2 = 0.12 - 0.26) compared with the maximum CV-testing R2 versions. In contrast, a multiple linear regression model explained less than half the variation in the training data and had a hold-out R2 of 0.07. Scatterplots of predicted vs. observed hold-out data obtained for final models helped identify prediction bias, which was greater for ANN and BN than BRT. Spatial patterns of predictions by the final, 1SE BRT model agreed reasonably well with previously observed patterns of nitrate in domestic wells of the Central Valley. According to a map of model predictions, groundwater at domestic well depth in the San Joaquin Valley (south part of the Central Valley) generally is more vulnerable to nitrate than that of the Sacramento Valley to the north, particularly in the eastern and western alluvial fans. Sediments generally are more fine-grained in the Sacramento Valley and have higher Fe and Mn concentrations, conditions that are less conducive to nitrate occurrence in groundwater.