Data augmentation for bias correction in mapping PM2.5 based on satellite retrievals and ground observations
Data augmentation for bias correction in mapping PM2.5 based on satellite retrievals and ground observations
-
摘要: As most air quality monitoring sites are in urban areas worldwide, machine learning models may produce substantial estimation bias in rural areas when deriving spatiotemporal distributions of air pollutants. The bias stems from the issue of dataset shift, as the density distributions of predictor variables differ greatly between urban and rural areas. We propose a data-augmentation approach based on the multiple imputation by chained equations (MICE-DA) to remedy the dataset shift problem. Compared with the benchmark models, MICE-DA exhibits superior predictive performance in deriving the spatiotemporal distributions of hourly PM2.5 in the megacity (Chengdu) at the foot of the Tibetan Plateau, especially for correcting the estimation bias, with the mean bias decreasing from -3.4 µg/m3 to -1.6 µg/m3. As a complement to the holdout validation, the semi-variance results show that MICE-DA decently preserves the spatial autocorrelation pattern of PM2.5 over the study area. The essence of MICE-DA is strengthening the correlation between PM2.5 and aerosol optical depth (AOD) during the data augmentation. Consequently, the importance of AOD is largely enhanced for predicting PM2.5, and the summed relative importance value of the two satellite-retrieved AOD variables increases from 5.5% to 18.4%. This study resolved the puzzle that AOD exhibited relatively lower importance in local or regional studies. The results of this study can advance the utilization of satellite remote sensing in modeling air quality while drawing more attention to the common dataset shift problem in data-driven environmental research.Abstract: As most air quality monitoring sites are in urban areas worldwide, machine learning models may produce substantial estimation bias in rural areas when deriving spatiotemporal distributions of air pollutants. The bias stems from the issue of dataset shift, as the density distributions of predictor variables differ greatly between urban and rural areas. We propose a data-augmentation approach based on the multiple imputation by chained equations (MICE-DA) to remedy the dataset shift problem. Compared with the benchmark models, MICE-DA exhibits superior predictive performance in deriving the spatiotemporal distributions of hourly PM2.5 in the megacity (Chengdu) at the foot of the Tibetan Plateau, especially for correcting the estimation bias, with the mean bias decreasing from -3.4 µg/m3 to -1.6 µg/m3. As a complement to the holdout validation, the semi-variance results show that MICE-DA decently preserves the spatial autocorrelation pattern of PM2.5 over the study area. The essence of MICE-DA is strengthening the correlation between PM2.5 and aerosol optical depth (AOD) during the data augmentation. Consequently, the importance of AOD is largely enhanced for predicting PM2.5, and the summed relative importance value of the two satellite-retrieved AOD variables increases from 5.5% to 18.4%. This study resolved the puzzle that AOD exhibited relatively lower importance in local or regional studies. The results of this study can advance the utilization of satellite remote sensing in modeling air quality while drawing more attention to the common dataset shift problem in data-driven environmental research.