Introduction

Air pollution has major adverse effects on health and climate. Recent research1 shows that in 2018 more than 8 million people died from acute and chronic respiratory diseases, lung cancer, stroke and heart disease related to fossil fuel pollution, indicating a much greater burden than suggested in previous studies2. Globally, air pollution plays a crucial but complex role in Earth’s climate and ecosystems3. Similar to the demand for high-resolution climate model estimates, reliable and finely-resolved air pollution estimates are needed to accurately evaluate its effects on humans and ecosystems at regional and local scales, and to inform effective control strategies4. Air pollutant concentrations can vary sharply over short spatial distances due to unevenly distributed emissions sources, meteorological transport dynamics and complex secondary chemical processes5, particularly in extreme air quality events6.

Despite advances in mechanistic and statistical modeling, predicting fine-scale concentration gradients remains a challenge due to sparsely distributed measurement data, difficulty in characterizing complex atmospheric physiochemical processes, and modeling limitations7. For mechanism-based methods including dispersion, photochemical and plume models, issues such as missing and noisy input data (e.g., insufficient or outdated emission inventories), ill-posed boundary conditions, and complex accumulative parameterizations may lead to high uncertainties for many fine-scale local applications5,8. Statistics-based machine learning methods solely rely on the empirical statistical associations between air pollutant concentrations and explanatory factors to make predictions. However, the formation and transformation of air pollutants involves varying emission sources, atmospheric transport and physiochemical reactions within the planetary boundary layer, which are also affected by altitude, surrounding land characteristics and local geography4,9. These features cannot be well captured by classical statistical or machine learning techniques. Furthermore, while statistical methods have reported improved predictive performance, they may produce physically inconsistent or implausible results10 due to sampling bias, lack of physical constraints, and/or confounding effects arising from spatially stratified heterogeneous processes11. These issues may be amplified given that models are often tested by taking random samples across space and time for validation, which does not provide accurate generalizable performance statistics when applying the model to make predictions at unobserved spatial locations. To improve upon these limitations, machine learning has been combined with climate or atmospheric chemistry models to speed up computation or improve prediction accuracy, by replacing time-consuming mechanism components (e.g., chemical integrator)12,13, memorizing the input-output relationship in mechanism models14, correcting the bias15 or representing subgrid process16. Furthermore, there has been a growing trend of integrating physical knowledge into deep learning techniques to model fluid fields, incorporating constraints such as partial differential equations (PDE)17,18, yet apart from few limited applications19,20 there continue to be gaps in effective strategies for embedding physics in deep learning to improve finely-resolved air quality assessments.

Here, we report a flexible deep graph hybrid modeling (DGM) approach (Fig. 1) incorporating fluid physics21 in deep learning to capture spatiotemporal interactions of air pollutants, encoding their dynamic characteristics and transport at fine scales to improve spatiotemporal predictions. We are motivated by the applicability of deep graph learning to encode neighborhood information that represents the dynamics of air pollutants9, and physical invariance to guide optimization17. For irregular non-Euclidean data with limited measurements, such as monitoring data of air pollutants, graph neural networks that are rooted on solid mathematical grounds of spectral graph theory22 and locality23, can be used to encode complex geometric structures and model their interactions. With graph operations such as Laplacian smoothing24, graph convolution network (GCN) can message and aggregate neighborhood information to represent interactions between nodes, with similar applications in protein-protein25 and disease-gene26 interactions and drug discovery27.

Fig. 1: Architecture of the deep graph network for air quality assessment.
figure 1

a The input data included atmospheric and surface grids, emissions or their proxies, meteorology, dry deposition, landuse and vertical profile. b The graph convolutions are constructed in multiple layers to simulate spatiotemporal dynamics of air pollutants, encoding their local spreading characteristics. c The GC multiscale outputs are concatenated with the input to enter full residual encoder-decoder to account for transformation and deposition. d The loss function included e1 (mean square error (MSE) between observed and predicted values), e2 (residual to encode PDE) and e3 (normalization).

Our modeling architecture utilizes graph convolutions (GC) to simulate the dynamics of air pollutants. Additionally, by incorporating PDE residuals, the predictions are guided to conform to the principles of mass conservation and continuity in atmospheric fluid physics28: the former assumes that the total mass of the air pollutant of interest in the atmosphere is nearly constant in balance during the study period; the latter assumes that the air pollutant fluid of interest is continuous (it contains no gaps). Given the absence of vertical measurements of air pollutant concentrations in the atmosphere (only ground measurements were available), our approach focuses on simulating the horizontal ground advection and diffusion of air pollutants. This pertains to local or fine-scale dynamics along the north-south and east-west directions on the ground surface. We do not explicitly account for vertical convection in our methodology. Thus, our architecture is built upon a 2-D Eulerian system (Fig.1a), rather than a 3-D system21. In this system, the evolution of air pollutant concentrations at a target location of ground is modeled using partial differential equations for advection, diffusion, chemical transformation, emission and deposition. Our modeling architecture is meshfree, making it different from traditional numerical methods like finite element or finite difference methods. Unlike these methods, which rely on a predefined mesh or grid structure to discretize the problem domain, our approach offers greater flexibility and adaptability, particularly when dealing with complex geometries or irregularly distributed samples. By employing a meshfree approach, our architecture allows the neural network to directly approximate the dynamics or PDEs at irregularly distributed sampling points within the domain. This eliminates the constraints imposed by a predefined mesh, enabling more effective learning of the underlying physics29.

Emissions

As an essential input, emission inventories are developed as a database listing the amount of air pollutants emitted over a specific time period (e.g., a year) by source. However, accurately quantifying emissions, especially in urban settings, poses significant challenges30; inventories tend to have a high degree of uncertainty and are often available at coarse spatiotemporal scales. In place of raw emission inventories, we used proxy variables including satellite-derived data (Moderate Resolution Imaging Spectroradiometer (MODIS) aerosol optical depth (AOD) for particulate matter, Ozone Monitoring Instrument (OMI) for nitrogen dioxide (NO2), The Modern-Era Retrospective analysis for Research and Applications, version 2 - Global Modeling Initiative (MERRA2-GMI), and traffic variables from OpenStreetMap). The satellite data have daily retrievals with global coverage. MERRA-2 GMI is driven by MERRA-2 meteorology to the Global Modeling Initiative’s (GMI) stratosphere-troposphere chemical mechanism run at ~50 km horizontal resolution and output onto the native MERRA-2 grid. Compared with the emission inventories, the MERRA2-GMI provides background levels of air pollution at global to regional scales.

Driven by wind, air pollutants are transported across locations, cities, states, regions or even globally. Mechanical models like GMI and CMAQ are effective in simulating pollutant movement on a regional to global scale. However, they struggle to capture local and fine-scale variations in air pollutant gradients. This limitation may arise from inadequate fine-scale bottom-up emissions data, complex local terrain, the presence of transboundary pollutants under certain micro environments, high uncertainty in parameterizations, and challenging physical assumptions. Additionally, reanalysis methods like MERRA2-GMI operate at a coarse resolution due to limitations in data storage and the time required for long-term simulations. To overcome these challenges, we propose the integration of deep graph learning with prior, mechanism and/or empirical knowledge. This approach allows for the simulation of fine-scale air pollutant transport in a more time-efficient manner, while also improving generalization in prediction. By adopting physics-inspired deep graph learning, we can reduce the limitations of mechanical models and gain a more comprehensive understanding of the complex dynamics of air pollution at local and fine scales.

Graph convolutions for fine-scale dynamics of air pollutants

While our architecture is designed based on a 2-D Eulerian system due to the absence of vertical concentration measurements, it is important to note that atmospheric vertical processes, such as boundary layer venting, play a crucial role in controlling pollutant concentrations. This is particularly significant during winter when the planetary boundary layer height (PBLH) is very close to the ground31. Therefore, the input data in our horizontal transport model included PBLH and vertical meteorological variables (e.g., wind speed, specific humidity, and temperature), as well as other variables to account for the influence of vertical processes on ground-level pollutant concentrations, in a simplified way. According to the fundamental assumptions underlying atmospheric motions, the continuity equation21 is used to simulate conservation of the air pollutant mass of interest on the ground. For a spatial location of interest, the continuity equation simulates the temporal evolution of the concentration of air pollutants, taking into account the pollutant sources, transport, sinks as well as meteorological, topological and geographical effects. Using Reynolds decomposition32,33, advection and diffusion processes were summarized based on direction velocities, the advection partial derivatives, diffusion coefficients and the eddy diffusion terms (Eq. 2). Therefore, the air pollutant concentration changes over time (Supplementary Fig. 1) were decomposed into three balanced components, advection, diffusion, and the others.

For diffusion, the GCN provides an essential solution to simulate the second-order derivative on the discrete graph space22,23, represented by the Laplacian convolution operator shown in Supplementary Fig. 2. Advection (Supplementary Fig. 1) can be decomposed along the directions of the connected nodes based on the differential formula.

A single fixed graph network is inappropriate for spatiotemporal estimation since such a network is inherently transductive and cannot naturally generalize to unseen data34. Here, we developed a hybrid semi-supervised deep graph model (Fig. 1), which uses local GC (Fig. 1b) to extract local spread characteristics of air pollutants from the multiscale nearest neighbors (cells in the regular grid system) of nodes, regardless of whether measured concentrations were available in neighboring nodes. To simulate the spatial spread of air pollutants at a local scale in the graph space, we employed a Eulerian regular 2-D grid system with a spatial resolution of 1 km and daily temporal resolution. The retrieval of cell neighbors was based on this grid system where the proximity of cell samples to a target spatial location played a significant role, and cells closer to the target were given higher weights (inversely proportional to their distance) during the aggregation of graph convolutions. Additionally, we considered the lag influence of samples on the target spatiotemporal point. To enhance computational convenience, we simplified the construction of a local graph. We performed a search for the k nearest spatiotemporal cell samples to the target point and linked them to create the local topology graph, as depicted in Fig. 1a, b. This local graph aimed to simulate the spatiotemporal dynamics of air pollutants for the target location. By leveraging local graph networks, we obtained a flexible graph-level solution for spatiotemporal simulation. To model pollutant transport, we employed deep graph learning due to its flexibility in graph convolutions. This approach was also suitable for handling non-Euclidean and irregular spatiotemporal samples, which are common characteristics of air pollution measurement data.

Therefore, with neighboring grids, the dynamic spread of pollutants was able to be modeled using embedded learning techniques based on graph convolutions. The inputs to the model represented emissions or their proxies, meteorological driving factors, land-use data, spatially-varying covariates, and temporal indices at multiple scales (Supplementary Table 1 for the complete list). With neighboring grid cells, we constructed multiple interconnected local graph convolutions (as shown in Fig. 1b) to simulate the local spread of air pollutants (Fig. 1b). Mapped relationships between the continuity equation and the simulation of air pollution advection/diffusion on the graph space and feature space are presented in Supplementary Fig. 1. We defined the convolutional operators in multilevel GC layers to simulate the transport of air pollutants.

Coupling with residual learning for transformation and deposition

Air pollutants from a wide range of emission sources are transported by winds and deposited by gravitational sedimentation or precipitation. In the DGM architecture, we introduced MERRA2-GMI as inputs to represent background pollution levels and used GC to simulate local and fine scale transport. To account for vertical variation, chemical transformation and deposition, we coupled the GC with a full residual deep network by concatenating the GC output with the aforementioned inputs (Fig. 1c) in a hybrid systematic framework. In addition, the attention layers were added to weigh the influence of different parts of the input on the model’s predictions35. In the hybrid systematic architecture, error information can be systematically back-propagated to the residual and GC layers during learning. Full residual deep learning has proved to be powerful for spatiotemporal regression of air pollutants36 and imputation of missing satellite data37. The necessary covariates were also introduced: vertical meteorological factors (e.g., relative humidity, temperature, wind speed and PBLH) for vertical influence on ground pollutant concentration, OMI-NO2 for O3, NDVI for dry deposition, and precipitation for wet deposition.

In addition, to support the governing mass conservation, we encoded physical invariance in the total loss through the soft residual constraint of the continuity partial differential equation (PDE, Fig. 1d). Recent physics-informed neural networks (PINNs)17,38 used the PDE residuals to improve physical relevant results. The deep learning process by gradient descent is also a simulation of the transport of the air pollutant of interest at a fine scale in the graph and feature space. Through this semi-supervised learning, the goal of minimizing the target loss function was achieved, and the optimal solution was obtained, which ensured robust predictions while maintaining continuity of the PDE17. We classified our method as “hybrid” based on two aspects: model structure and learning loss function. In terms of model structure, we coupled two types of architectures, namely graph convolutional network (GCN) and full residual deep network (FRDN), to leverage their respective strengths and enhance overall learning efficiency. The GCN was employed to capture dynamic information from the neighborhood, while the FRDN was used to improve the efficiency of regression learning for local variables. Furthermore, we have incorporated a combination of traditional data loss terms and physics-based regularization loss terms into our loss function (Supplementary Note 1). The purpose of incorporating these physics-based regularization loss terms into the model is to make the predictions of the model as consistent as possible with the principles of physics. By doing so, we reduced inherent bias and significantly improved the model’s generalization capabilities when making predictions in new scenarios.

Results

Mainland China, a large region with heterogeneous topography, diverse meteorology and multiple emission sources, was selected as the testbed for the DGM method (Supplementary Note 2). We trained seven national hybrid models for six air pollutants including daily average carbon monoxide (CO), nitrogen dioxide (NO2), fine particulate matter (PM2.5), coarse particulate matter (PM10) and sulfur dioxide (SO2), as well as hourly maximum daily ozone (abbreviated as O3M24) and 8 h average ozone (abbreviated as O3A8). Daily gridded surfaces with a spatial resolution of 1 × 1 km2 were also generated from 2015 to 2018. To assess the true generalization of the DGM method, three validations were performed, including site-based independent testing, comparison with MERRA2-GMI pollutant variables and sensitivity analysis without MERRA2-GMI input of pollutants.

Trained air quality models are typically used to make predictions at new/unobserved geographic locations. Model validation based on test sets that are constructed by randomly sampling across space and time can result in inflated confidence in the model’s out-of-sample prediction performance because it does not provide a genuine measure of the model’s ability to generalize to new spatial locations. Instead, we used site-based independent testing, which uses complete time series data at untrained locations for testing. In addition, in order to have a thorough validation of the DGM, we reported the sensitivity analysis results without the MERRA2-GMI pollutant input.

Our site-based independent tests included 288 sites (about 18%) that were randomly selected from 1604 monitoring sites; all the data from these sites were used only for testing but not for training. From the remaining samples, we randomly selected 78% for training and 22% for regular testing across space and time (stratified by spatiotemporal factors, i.e., province and month). We chose the sample so that the proportion of the total samples for regular and site-based tests was similar to the proportion ( ~ 0.37) of the unique test samples in the bootstrap 0.632 rule39 and 56 variables were used in training. In order to avoid the influence of random noise, we trained 100 models and summarized their average performance for each air pollutant, and then compared them with seven baseline machine learning or statistical regression models (for these methods and optimization of their hyperparameters, see Supplementary Table 2 and Materials and Methods). Sensitivity analysis was conducted to retrieve optimal weights (0.5 for α and 0.2 for β) for the regularizers. The results (Supplementary Tables 3, 4; Supplementary Fig. 3 for the boxplots of testing root mean square error (RMSE)) show that our hybrid method consistently performed better than all seven other machine learning and statistical regression models. For PM2.5 and PM10, our method had the highest site-based test mean R2 (0.85–0.87), an improvement of 13–85% over the other methods. Compared to PM2.5, we found a lower site-based test mean R2 for NO2 and O3A8 (0.66–0.78), but overall our hybrid method showed an improvement of 6–61% over the other methods. For SO2, random forest had a slightly better testing R2 (0.74 vs. 0.73) than our method but had lower site-based test R2 (0.58 vs. 0.63) and higher RMSE (13.11 vs. 12.37 μg/m3). Although the training R2 and RMSE of the full residual network36 were similar to our hybrid DGM (0.84–0.93 vs. 0.83–0.93), the test R2 and RMSE, especially independent test R2 and RMSE, showed poorer performance. Compared with the regular network, the hybrid method improved test R2 by 6–11%, and site-based independent test R2 by 7–21% (Fig. 2a), and it decreased test RMSE of the monthly mean by 4–18% and site-based test R2 by 5–21% (Fig. 2b). Our sensitivity analysis showed that GCN made a considerable contribution to the improvement in prediction (by 6–19% for testing and sited-based testing R2) and the PDE residual made additional 1–4% improvement in test and sited-based test R2. Furthermore, we evaluated the extent of mass conservation by estimating the e2 component of the total RMSE, based on the ratio of e2 to the overall loss. As shown in Supplementary Table 5, the converged small values of the e2 RMSE (ranging from 1.62E-5 to 0.134, approaching 0) indicate a high level of mass conservation with minimal bias. The models with GCN simulation and constrained by the PDE residual performed the best, with a steady learning process, and better generalization and extrapolation. In site-based independent tests, the samples from selected sites also exhibit fairly consistent distributions between observed and predicted values (Supplementary Fig. 4). In the site-based tests, on average, our DGM method consistently improved explained variance (R2) by 11–22%, and reduced RMSE by 12–35% compared to baseline statistical learning methods (Fig. 2a; Supplementary Table 6).

Fig. 2: Improvement in generalization in comparison with baseline machine learning methods and MERRA2-GMI data.
figure 2

a, b The increase in R2 (a) and the decrease in RMSE (b) by the physics-aware deep learning method for site-based testing and regular testing. c Mean time series of observed and predicted values for all sites in the site-based testing dataset (black dots: observed values). d The boxplots between daily grid mean estimates and daily MERRA2-GMI variables (Grid: 1 × 1 km2 grid mean estimates; BOT Bottom layer diagnostics, AD Aerosol diagnostics, SOF Daily satellite overpass fields, SLD Single-level diagnostics). In (d), the whiskers (solid horizontal lines) indicate the minimum and maximum correlations, excluding outliers; the box bounds the interquartile range from the 25th to 75th percentiles, with the 50s percentile (median) marked (bold horizontal lines); the whiskers (dashed vertical lines) extend 1.5 times the interquartile range from the box, defining the range of typical correlations; data points (small circles) beyond the whiskers are outliers.

We also conducted a comparison between our method and the baseline machine learning methods by analyzing observed and predicted time series. For each independent test site, we calculated the root mean square error (RMSE) for each machine learning method. The results indicate that our method effectively captures temporal trends, as evidenced by the violin plots of the site-based RMSE in Supplementary Fig. 5. Additionally, when comparing the total monthly means of predicted and observed values (Fig. 2c) and the total RMSE (Supplementary Table 7) for all these sites, our method considerably outperforms other machine learning methods. It produced more stable time series with significantly smaller deviations (much less RMSE) from the observed values.

Correlations of our DGM point and 1 × 1 km2 grid estimates with ground measurements were considerably stronger than those of MERRA2 GMI reanalysis data (Supplementary Table 8 and Fig. 2d: 0.76–0.92/0.66–0.83 vs. 0.005–0.53). For example, the MERRA2-GMI PM2.5 and NO2 had a correlation of about 0.41–0.53 and the other MERRA2-GMI pollutant variables had a poor correlation (0.005–0.36). This comparison suggests that the coarse-scale reanalysis MERRA2-GMI data could be quite biased against ground measurements. With finer spatial resolution, our estimates improved correlation coefficients significantly. We also reported the seasonal and yearly performance metrics for the site-based independent test (Supplementary Tables 9, 10). In comparison to a recent study that utilized the Weather Research and Forecasting model with Chemistry (WRF-Chem) at high spatial resolution for the same region and year (2015)40, our results demonstrate significantly stronger Pearson’s correlation and lower mean bias (as shown in Supplementary Table 10). Specifically, our method exhibited correlations ranging from 0.77 to 0.92 for PM2.5, PM10, O3, and NO2, whereas the previous study showed correlations of only 0.29 to 0.51 for the same pollutants. Additionally, when considering the correlation computed from the mean temporal evolution averaged across all monitoring stations, our method achieved an impressive correlation of 0.99, surpassing the range of 0.71 to 0.90 reported in the previous study. It is worth noting that our approach demonstrated comparable or superior performance to several other investigations41,42,43 employing WRF-Chem for O3, NO2, and/or PM10 in different regions, although making direct comparisons may not be entirely appropriate due to potential variations in methodology and data sources.

On the other hand, using MERRA2-GMI or other reanalysis pollutant data in our DGM provided the information on regional and global scale emissions and transport of air pollutants. For example, through the SHAP (SHapley Additive exPlanations), we summarize the contributions of the emission-related components to the PM2.5 predictions in mainland China and its four representative regions (Supplementary Note 3; Supplementary Fig. 6). Nevertheless, sensitivity analyses (Supplementary Tables 3, 4) showed a small (0–4%) contribution to the independent test R2 by the MERRA2-GMI pollutant data, illustrating the robust performance of our approach for fine-scale air quality assessment even without the MERRA2-GMI pollutant input.

Interpretation of improved air quality assessment

Compared to purely data-driven machine learning methods, our approach offers a simulation-based solution for capturing the advection and diffusion of air pollutants. We achieve this by constructing local graphs to simulate spatiotemporal dynamics and incorporating the continuity equation into the training cost function. As a result, our method effectively captured the fine-scale transport of pollutants and significantly reduced estimation bias. For PM10, we generated nationwide concentration grids with similar or finer spatial resolution (1 km vs. 1–10 km) and higher generalization accuracy (site-based R2: 0.85 vs. CV or out-of-station R2: 0.82–0.83) compared to44,45,46,47. For the recent studies of PM2.5 in China47,48,49,50,51, we had higher generalization (site-based R2: 0.87 vs. CV or out-of-station R2: 0.64–0.85) at the finer spatial scale (1 km vs. 1–50 km). For O3A8 and NO2, our method had similar generalization (site-based R2: 0.72–0.78 vs. out-of-station or out-of-city R2 0.71–0.80) at 1 km spatial resolution compared to recent studies52,53. Although direct comparison is inappropriate, compared to the existing studies, our physics-inspired approach demonstrates improved accuracy in capturing the temporal evolution of fine-scale air pollutants. By incorporating fluid physics principles and encouraging mass conservation, our model yields predictions that more effectively reflect the dynamics of these pollutants. Additionally, our model produces daily spatial surfaces of air pollutants that exhibit continuous variations without abrupt changes or discontinuities, in contrast to the outputs of tree-based machine learning algorithms. This demonstrates our model’s ability to accurately simulate air continuity. Based on the 1 × 1 km2 grid estimates of air pollutants, we investigate the evolutions of their spatial variability and interpret them in terms of regional/global scale transport and fine-scale dynamics of air pollutants and meteorological characteristics. For a daily grid of 1 × 1 km2, we extracted a heatmap of neighborhood features from the output of the final graph convolution layer, representing the spatial agglomeration, and visualized them using the contours.

For inert air pollutants including PM2.5 and PM10, multiscale pollutant transport has important influence upon their spatiotemporal variability. On April 15, 2015, Beijing experienced the biggest sandstorm with the highest PM10 levels seen since 200254 (Fig. 3a–c). Meteorologically, the Rossby waves and intense frontal activities in the spring resulted in strong wind gusts that injected large amounts of material into the lower and middle troposphere. Dust uplifted from the arid and semi-arid regions was transported to the downwind regions of North China55. Backward trajectories (http://ready.arl.noaa.gov/HYSPLIT.php) of the winds showed two main sources of dust, i.e. Northwest China and Southern Mongolia for the sandstorm incidence in the Beijing city (Fig. 3b). The spatiotemporal evolution of PM10 (Supplementary Fig. 7) shows that the dust transport pattern was well captured by the extracted graphical convolution heatmap, and the estimated PM10 exactly matched the trends in the surface wind and geopotential height (GPH) fields. The spatiotemporal evolution of PM2.5 (Supplementary Fig. 8) during this sandstorm was different from that of PM10, because the latter was mainly from the long-distance transport of dust and aerosols, while the former was more from local transport and sources, including power plants, industries, and automobiles, etc. At the local or fine spatial scale, our estimated PM2.5 and PM10, respectively, matched well with the ground measurements on the sandstorm day, as shown in the enlarged local maps and indicated by the high correlations in the independent point estimates and the grid estimates (0.80–0.87), that were much higher than the reanalysis estimates. In addition, compared to the measured data, our estimated concentrations had similar normal distributions, and the 2D histogram of PM2.5 vs. PM10 of the measurement data was consistent with that of the finely resolved estimated data (Supplementary Fig. 9).

Fig. 3: Four representative events of air pollution presented by the predicted surface grids.
figure 3

a–c The predicted PM10 surfaces on a sandstorm event in North China (c) of April 15, 2015 and the day (a) before it, and the difference (b) between the two. d–f The predicted PM2.5 surfaces on a haze event in East China (f) of December 23, 2016 and the day (d) before it, and the difference (e) between the two. g–i The predicted O3A8 surfaces on a severe ozone event in Beijing (i) of July 1, 2017 and the day (g) before it, and the difference (h) between the two. j–l The predicted NO2 surfaces on a haze event in Shanghai (l) of December 19, 2018 and the day (j) before it, and the difference (k) between the two. The blue arrow is mean ground wind vector (m/s) and the black isolines present GPH in meter at 500 hPa. The dash lines in (b, e, h, k) show the HYSPLIT clusters of backward air trajectories at 500 m with the percentage.

Similar to PM10, interregional transport may play an important role in PM2.5 pollution, although the latter is partially from local emission sources. For the haze incident in East China in late December 2016 (Fig. 3d–f), the updraft in front of the 500 hPa high-altitude trough was conducive to the regional transport of pollutants, and in the northwest wind field before high air pressure, the majority of PM2.5 in the haze incidence was composed of upstream pollutants being transported to the downwind areas of East China such as Jiangsu Province and Shanghai. The backward trajectory analysis showed two main sources of haze in Shandong Province and the Beijing-Tianjin-Hebei region (Fig. 3e). In the spatiotemporal evolution of PM2.5 (Supplementary Fig. 10), the graph convolution features were shown to capture regional and local meteorological transport of PM2.5 well.

As a reactive pollutant, O3 is mainly formed by photochemical reactions of volatile organic compounds (VOCs), CO and nitrogen oxides under sunlight, especially the UV spectrum. An intensive and persistent regional ozone pollution episode occurred in eastern China from 25 June to 5 July 2017, mostly located in the Beijing-Tianjin-Hebei and the surrounding area (BTHS) (Fig. 3g–i). A study56 showed that the prevailing northwesterly winds in the mid-high troposphere hindered the northward movement of the Western Pacific Subtropical High, thereby inhibiting the transport of water vapor from low latitudes and providing favorable meteorological conditions (low relative humidity, less cloud cover, strong solar radiation and temperature inversion) for local formation of O3; in addition, warm advection due to southerly winds prevailing in the lower troposphere also brought pollutants to BTHS, as shown in the spatiotemporal evolution of ozone during this period (Supplementary Fig. 11). Backward trajectory analysis showed considerable contributions from southern regions. Compared with other reactive pollutants such as CO and NO2, O3 had much higher spatial autocorrelation56 as seen in the grid estimates, which showed a more gradual spatial gradient pattern at fine spatial scales and the estimates were in high agreement with the ground measurements (correlation: 0.81–0.92).

As a highly reactive gas and precursor of surface and tropospheric O3, NO2 has a relatively short atmospheric lifetime compared to PM. In winter, when solar radiation, relative humidity and precipitation are low, the cool and dry conditions of the troposphere slow down photochemical reactions, allowing pollutants to accumulate, and the lifetime of NO2 is longer than in spring and autumn. For the 2018 winter haze in Shanghai (Fig. 3j–l), in addition to local emission sources (power plants, biomass burning, and fossil fuel combustion etc.), the analysis also revealed partial contributions from regional transport of pollution in nearby cities in North, Central, and East China. The dominant NO2 source regions usually correspond to areas with high population, heavy traffic and large industry scale. Given high traffic leading to high NO2, local hot spots in urban areas show up around heavy traffic routes, traffic intersections and transportation hubs57,58. The spatiotemporal evolution (Supplementary Fig. 12) of NO2 during this episode presented such hot spot patterns on fine scales, with high correlation (0.81–0.88) with measurements.

Further investigation of other days and pollutants (SO2 and CO) also revealed the critical role of graph convolutions and PDEs in simulating advection and dispersion. For inert background air pollutants (PM2.5 and PM10), their concentration profiles are dominated by advection and diffusion, with regional and local transport contributing much more to their spatiotemporal variability than other pollutants. The results showed that graph convolutions for simulating local pollutant transport achieved the largest improvement (14–15% in R2) for PM2.5 and PM10, which was not seen using pure deep learning. Furthermore, MERRA2-GMI inputs had little effect on the PM2.5 and PM10 estimates as they involved fewer chemical transformations compared with reactive pollutants. On the other hand, for reactive air pollutants involving complex physiochemical reactions that are subject to meteorological conditions, although the contribution of global and regional pollutant transport may be limited, by simulating fine-scale pollutant transport, graph convolutions still achieved significant improvements (R2: 7–13% for NO2 and O3; 4–6% for CO and SO2). The MERRA2-GMI input provided the relevant background-level data on precursor chemicals and vertical meteorology for reactive air pollutants, that helped improve their estimates, albeit limited.

Discussion

By simulating spatiotemporal dynamics of air mass, our deep graph learning method achieved better generalization with physical interpretability than purely mechanistic or machine learning methods. The fine-scale (1 × 1 km2) gridded estimates accurately represented the spatiotemporal distribution of all air pollutants and enabled more reliable and finer statistics of national air pollutant trends than previously reported. A literature review shows that to date few studies have reported high-resolution predicted surfaces for all criteria pollutants and total air quality index (AQI, for the quantification definition, refer to Supplementary Note 4). This study fills this gap and provides a fine-scale grid of AQI for China, providing important information over a large geographic area that does not have monitoring stations. Nationwide, the population-weighted means (Supplementary Fig. 13) were much higher than the raw means for NO2 (by about 9 μg/m3) and PM2.5 (by about 13.5 μg/m3), indicating that they were mainly distributed in densely populated areas. The yearly summary (Supplementary Fig. 14 and Supplementary Table 11) shows the there was a national decline from 2015 to 2018 for the population-weighted PM2.5 (by 11 μg/m3), PM10 (by 11 μg/m3), SO2 (by 6 μg/m3) and CO (by 0.12 mg/m3) but a national rise in O3 (7 μg/m3 for the 8 h daytime average). The national trends of air pollutants were generally consistent with ground measurements and reanalysis data, but since most monitoring sites were located in urban areas, the limited measurement data mostly overestimated the national averages by about 6–52%, and the reanalysis data also substantially underestimate the national averages due to coarse spatial scales. In addition, the daily national air quality index (AQI) results (Supplementary Figs. 15, 16) provided a fine representation of the spatiotemporal distribution of each air pollutant, and the aggregated trend was consistent with measurement-based AQI59 (Pearson’s correlation: 0.76–0.92). The highest concentrations in summer were PM10 for the Xinjiang desert area and O3 for most of the other areas; while in winter it was PM2.5 for most of the eastern areas of China and PM10 for the other areas (Supplementary Figs. 17, 18).

Compared with the national point-based AQI (https://www.iqair.cn/cn/china), the statistics of the predicted grids more accurately show the spatiotemporal distribution patterns of air pollutants. From 2015 to 2018, the population-weighted daily AQI (defined in60) declined by about 6 (by about 3 for the naïve AQI mean; Fig. 4a for monthly mean; Supplementary Table 11 and Supplementary Fig. 19). Compared with other air pollutants, PM2.5 and SO2 decreased by about 18–30%, and PM10 and CO decreased by about 12–13%. The mean value of NO2 dropped slightly by about 4% (by about 3% for its population-weighted mean). The national daily population-weighted 8 h average O3 rose by about 8%. The results show that there were higher PM2.5 concentrations partially in the eastern areas of China in winter, and the northwestern region of Xinjiang had higher PM10 concentrations than the other regions. For mainland China, the air quality in summer is much better than in winter (AQI: 35–41 vs. 47–58). The grids (Fig. 4b) of AQI change show that from January 2015 to December 2018, the monthly AQI in winter declined by about 11 in AQI and the PM2.5 and SO2 had the maximum decline (by about 13–15 in AQI); from July 2015 to July 2018, the monthly AQI in summer declined by about 5 and PM2.5 and SO2 had the maximum decline (by about 10) (Fig. 4c). In China, the primary pollutants in winter included PM2.5 and PM10, while the primary pollutant in summer was O3. Spatially, concentrations of PM2.5, PM10 and CO were higher in the Beijing-Tianjin-Tangshan and Xinjiang areas than in the Yangtze River Delta and the Pearl River Delta. From 2015 to 2018, the Beijing-Tianjin-Tangshan area, Jilin Province and Sichuan Basin experienced the largest declines in AQI.

Fig. 4: National population-weighted AQI changes.
figure 4

a Monthly trend for total AQI and each air pollutant AQI with the shade of 95% confidence intervals. b, c Total air quality index (AQI) changes in winter between 2015 and 2018 (b), and in summer between 2015 and 2018 (c) in mainland China (mean AQI change by latitude, below in b, c; mean AOI change by longitude on the right in b, c).

Our results show that from 2015 to 2018, air pollution in mainland China dropped significantly, which is consistent with the clean air action in China61. However, as one criteria air pollutant, O3 has presented a national upward trend (by about 8%) from 2015 to 2018, which is consistent with reports62. Climate change can result in the increase of tropospheric O3 since increased air temperature results in O3 formation. Data on daily air temperature statistics (Supplementary Fig. 20) support this finding. As a greenhouse gas, O3 can raise the global temperature and has adverse effects on health. For China, while maintaining lower emissions policies for other criteria air pollutants, it is time to take a cooperative strategy to efficiently control emissions of volatile organic compounds and other pollutants that contribute to the formation of O3.

Our extensive evaluation effectively reveals the power of a physics-inspired deep graph learning approach for finely resolved air quality assessments. First, multilevel graph convolutions can be constructed to simulate the spatiotemporal dynamics of air pollutants to capture pollutant transport on a fine scale. Graph convolution is a tool to model local interactions and movements by its Laplace differential operator as the gradient divergence22. Graph-theoretical methods have brought about a paradigm shift in the exploration of complex systems63, including atmospheric chemical mechanisms. In their study, Silva et al. 64. utilized species-reaction graphs to uncover similarities among different chemical reaction systems, unveiling novel characteristics that transcend existing domain knowledge and thereby advance our understanding of atmospheric chemical mechanisms within intricate geosystems. Diverging from the species-reaction graph approach, this research introduced a pioneering methodology by utilizing local graphs as a roadmap to precisely simulate the transport of air pollutants originating from immediate and geographically intricate surroundings at a fine scale. Through efficient embedding learning, our approach significantly enhanced the prediction of product outcomes resulting from the advection and diffusion of air masses, as well as atmospheric chemical reactions. These intricate processes are profoundly influenced by a multitude of factors, including the surrounding atmosphere, surface characteristics, and geographical environment. Here, local graph convolutions are used to overcome the limitation of convolution and recurrent neural networks to handle the ubiquitous irregularity and sparsity of air quality monitoring data. They facilitate the utilization of substantial quantities of unlabeled or unmeasured data in a semi-supervised approach. Whether inert or reactive pollutants, pollutant transport plays an important role in finely resolved air quality assessment, although we found that it may contribute more to inert regional air pollutants such as PM2.5 and PM10. Second, in order to make the final predictions tractable, we encoded the constraint of the continuity partial differential equation as a loss residual so as to strengthen physical invariance. Different from the simulation in the graph space and hard constraint of mass conservation in ref. 20, the PDE residual constraint acted as a soft constraint that encouraged the final predictions to conform to the mass conservation for the ground air pollutants along the timeline. Both ways of encoding physical concepts resulted in parameters that were optimized to better represent physical processes than the model without any physical constraints. Third, using a smaller spatial resolution, we were able to simulate spatiotemporal dynamics on a finer spatial scale, which may be useful for reactive air pollutants with rapid distance decay gradients from the emission sources, such as NO2, SO2 and CO. Using more graph convolutional layers, we could extend the simulation to a longer range of pollutant transport, which favors regional pollutants with slower decay gradients such as PM2.5, PM10, and O3. To account for the transformation and deposition of air pollutants, the full residual and attention layers were systematically concatenated with the graph convolution layers to reduce potential over-smoothing in GCs and strengthen the contribution of local and non-traffic factors to the final prediction65.

Essentially, the physics-inspired hybrid deep graph network is built on a flexible and unified architecture of air pollutant emissions, pollutant transport, and deposition, and is designed for spatiotemporal prediction. Machine learning has played a crucial role in improving air quality assessments over the past few decades. Its effectiveness and computational efficiency have made it a popular choice for enhancing accuracy and scalability in this field66,67. In previous efforts, Keller12 and Kelp13 used machine learning to replace time-consuming components in the physical models to accelerate emulating of the chemistry components; Sturm14 designed the mass- and energy-conserving framework for using machine learning; Ivatt15 used gradient-boosted regression trees to reduce the bias in physical model outputs. However, the lack of physical mechanisms in most machine learning methods may lead to solutions that seriously violate physical principles such as conservation laws, with potential bias and low interpretability68. For example, tree-based machine learners (e.g., random forest, XGBoost and extremely randomized trees) have been widely used alone46,50,69 or in physical models12,15 with reported high accuracy, but the binary discretization in these algorithms inherently violate the continuity law of air pollutants, which can lead to discontinuity bias36. In contrast to these existing methods70,71, based on deep neural networks as a general nonlinear approximator of any continuous functions72, the DGM directly simulated dynamics of air mass by graph convolution and encoded the fluid physical laws of mass conservation and continuity by PDE residuals, which is a creative combinational strategy for computing efficiency and generalization. Our approach outperformed simulations conducted using mechanism models like MERRA2-GMI or high-resolution WRF-Chem, highlighting the significance of integrating mechanism models with advanced deep learning techniques for substantial improvements in spatiotemporal predictions of air pollutants. This underscores the importance of leveraging the strengths of both methodologies to achieve more accurate and reliable results in predicting air quality dynamics.

In physical models, the advection velocity and diffusion coefficients are affected by meteorological factors, gravitational settling, viscosity of specific air pollutant, terrain, soil and land cover, which are hard to obtain due to missing or ill-posed boundary data and complicated atmospheric process8,21, and require to be specified manually. The proxy coefficients in the DGM overcame this as they were adaptively learned through end-to-end automatic differential learning. When the concentration measurements alone do not provide sufficient information to guarantee a single solution for these coefficients, we could provide additional data on them to enforce model training. However, it is impossible to acquire these coefficient measurements in practice and solving inverse problems with hidden physics is prohibitively expensive due to high uncertainty in simulation formulas17. Similar to the simulation by hidden fluid mechanics18, we achieved a high quantitative agreement between observed and predicted proxy coefficients by relying solely on the information contained in the observed concentration data.

By simulating spatiotemporal dynamics of air mass and encouraging the adherence to the physical laws of mass conservation and continuity for all samples, including prediction points and those beyond the observation dataset, the trained model exhibits robust accuracy in spatiotemporal extrapolation (low generalization error), less affected by the values outside the training samples in predictions. Further, in this flexible architecture, satellite variables and reanalysis data of air pollutants provided clues for pollution background; alternatively, with reanalysis or inversed meteorology, the data from historical and future emission inventories can be used to predict future concentrations of criteria pollutants under certain carbon emissions and climate change scenarios. The MERRA2-GMI or other reanalysis data provide vertical background pollution levels that can be used to model regional-scale pollutant transport, or as precursor proxies for reactive air pollutants, but these coarse-scale data, if missing, have limited effect on generalization of DGM.

Methods

Dataset

The observed data consisted of concentrations of six air pollutants (daily average of CO, NO2, PM2.5, PM10, and SO2; hourly maximum of daily O3 and 8 h average of O3) collected from monitoring stations of the China Environmental Monitoring Station. From January 1, 2015 to December 31, 2018. A total of 1,913,012 samples were obtained from 1604 monitoring stations covering mainland China. We had 56 spatial and temporal input variables including 2 spatial coordinates, elevation, day of year, ground aerosol coefficient, Aura Ozone Monitoring Instrument (OMI) NO2, planetary boundary layer height (PBLH), 4 high-resolution meteorological variables, normalized difference vegetation index (NDVI), an indicator variable of workday, roadway length within a 10 km buffer around each monitoring site, 26 component pollutant variables and 10 reanalysis vertical meteorological variables. For more details, please see Supplementary Note 1 and Supplementary Table 1.

Modeling architecture

The architecture is constructed based on a Eulerian system where the change of air pollutant concentration at a specific spatial location of interest (Fig. 1a) is modeled using the following simplified partial differential equations (PDE):

$$\frac{\partial C}{\partial t}=-\nabla (VC)+p{\nabla }^{2}C+R+E-F$$
(1)

where \(C\) is air pollutant concentration, \(\partial t\) represents the time derivative, \(\partial C/\partial t\) represents concentration change in each target location over time, \(\nabla (VC)\) represents advection, \(V\) represents velocity, \(p\) is the coefficient for the diffusion term (\({\nabla }^{2}C\)), \(R\) represents chemical transformation, \(E\) stands for emission, and \(F\) represents deposition.

To model over a surface, Eq. 1 uses horizontal advection and diffusion (\({l}_{x}\) and \({l}_{y}\) directions); vertical convection (\({l}_{z}\) direction) is not explicitly considered, but proxy variables are used to account for variation of vertical convection. Advection and diffusion can be simplified32,33:

$$\frac{\partial C}{\partial t}=-{v}_{{l}_{x}}\frac{\partial C}{\partial {l}_{x}}-{v}_{{l}_{y}}\frac{\partial C}{\partial {l}_{y}}+{p}_{{l}_{x}}\frac{{\partial }^{2}C}{{\partial }^{2}{l}_{x}}+{p}_{{l}_{y}}\frac{{\partial }^{2}C}{{\partial }^{2}{l}_{y}}+R+E-F$$
(2)

where \({v}_{{l}_{x}}\) (\({v}_{{l}_{y}}\)ss) is the Reynolds velocity of the air pollutant of the \({l}_{x}\)(\({l}_{y}\)) direction of the spatial location of interest, \(\partial C/\partial {l}_{x}\) (\(\partial C/\partial {l}_{y}\)) denotes the advection partial derivative, and \({p}_{{l}_{x}}\) (\({p}_{{l}_{y}}\)) is the diffusion coefficient (simplified as the sum of eddy and molecular diffusion coefficients), \({\partial }^{2}C/{\partial }^{2}{l}_{x}\) and \({\partial }^{2}C/{\partial }^{2}{l}_{y}\) represent the eddy diffusion terms. The air pollutant concentration changes over time (Supplementary Fig. 1) are decomposed into three components, advection, diffusion and the others.

For the diffusion process, the GCN provides an essential solution to simulate the second-order derivative on the discrete graph space22,23, represented by the Laplacian convolution operator shown in Supplementary Fig. 2. Advection can be decomposed along the directions of the connected nodes based on the differential formula and the conservation of mass. We obtain the total convolutional operation to simulate the pollutant traffic on the graph space (Supplementary Fig. 1d):

$$\frac{\partial C}{\partial t}=-\bar{V}LC-pLC+\rho C=-(\bar{V}L+p{D}^{-\tfrac{1}{2}}A{D}^{-\tfrac{1}{2}})C+\rho C$$
(3)

where D is the degree matrix, L is the Laplacian matrix, A is the adjacency matrix to quantify spatial relationship, p is the diffusion coefficient, \(\bar{V}\) is the learnable velocity matrix based on the Laplacian matrix, A, and ρ represents a proportion factor to account for the concentration changes from chemical transformation, emissions and deposition.

We constructed the semi-supervised multilevel local graph modeling architecture (Fig. 1). For each spatiotemporal node of interest, multilevel local graph convolutions were constructed in sequence based on the k nearest neighbors73 to capture neighborhood information to encode multilevel dynamic processes of air mass. We characterized the space and time dimensions using a discrete data model to simulate the variation of spatiotemporal fields. Specifically, we used k-NN to find the nearest spatiotemporal samples based on the spatiotemporal distance to construct graph convolutions for each target node. The temporal and spatial dimensions had different units (coordinate space: meter; time: day) and needed to be fused to construct the deep graph network using k-NN. To fuse them, the standard deviation was used to normalize the coordinates (x and y) and time (day) to make them unitless. The reciprocal of the distance was also used as a weight in aggregating neighborhood information. The multilevel neighbors were retrieved based on the search depths to capture distance-varying neighborhood influence on a target node. For each graph convolution, we trained a set of functions that learned to aggregate a spread feature from a node’s local neighborhood. Our GCN is an inductive framework34, different from the transductive fixed graph, and it can be easily generalized to the same form of features or invisible nodes. In our application we used four graph convolutions where each node had k (k = 12 was an optimal solution of grid search) neighbors. For graph convolutions, the vector size was (in order) 128, 64, 32, and 1.

The output of each graph convolution was merged with the original input features as the input to the downstream components, a full residual deep network with the attention layers (Fig. 1a–c). The full residual encoder-decoder36 was fused with the neighborhood features to account for local sources and sinks, and transformation. In order to reduce or avoid gradient vanishing in training, residual connections were introduced from each encoding layer to its corresponding decoding layer. This fusion also alleviated the potential over-smoothing issue in GC74. The attention layers were added with the residual layers after the input layer to allow the model to focus and place more “attention” (weights) on the important parts of the input features35,75. For the residual network topology, the number of nodes in each encoding or coding layer was 512 → 320 → 256 → 128 → 96 → 64 → 32 → 16; the number of nodes in each decoding layer was symmetric with its corresponding encoding layer. The final output was inversely normalized and exponentially transformed to represent the air pollutant concentrations in their original units. Graph convolutions extracted the spread feature from the input data of the multilevel neighbors for a node of interest. Irregularly distributed sparse or limited measured data at sample locations was used as the dependent variable to drive training of the entire model.

In addition, in order to make the final concentration predictions to maintain the continuity partial differential equation (Eq. 1), we introduce a residual term, \({e}_{2}\) to encode such physical invariance of PDE in the loss function:

$$L({\theta }_{{\bf{W}},{\bf{b}}})={e}_{1}+\alpha {e}_{2}+\beta {e}_{3}$$
(4)

where \({e}_{1}\) is the mean square error (MSE) function for the observed values and the predicted values, \({e}_{2}\) is the residual errors for the continuity equation (defined according to Eq. 1), \({e}_{3}\) is the normalization for the parameters, and a and β are the scaled weights for \({e}_{2}\) and \({e}_{3}\), respectively. The weight (a or β) can vary from 0 to 1, where a larger weight indicates under-fitting, while a smaller weight implies over-fitting. In order to determine the optimal weights, a sensitivity analysis was performed by testing a range of values between 0 and 1, with an interval of 0.1 for a and β. See Fig. 1d.

For each of the seven air pollutants considered, the system model was constructed and trained, using the logarithmic and normalized transformation of the pollutant concentrations as the dependent variable. In training, the minibatch size was 2048, and the initial training rate was 0.001 and adjusted in optimization, and the Adam optimizer76 was used. The number of epochs was set at 200. In total, we sampled the dataset and trained the model 100 times and the final test performance was summarized.

Graph convolutions

As the core component in our method, multilevel graph convolutions were used to encode and represent local spread of air pollutants. Based on Eqs. 1–3 and Supplementary Fig. 1, the convolutional operator in the kth convolution layer is defined to simulate the spatiotemporal dynamics of air pollutant concentration:

$$\frac{\partial {\bf{x}}}{\partial t}=\frac{{{\bf{x}}{\boldsymbol{{\prime} }}}_{i}-{{\bf{x}}}_{i}}{\varDelta t}=-{{\bf{v}}}^{k}\sum _{j\in N(i)}{\bar{A}}_{ij}^{k}({{\bf{x}}}_{j}-{{\bf{x}}}_{i})+{{\bf{p}}}^{k}\sum _{j\in N(i)}({{\bf{x}}}_{j}-{{\bf{x}}}_{i})+{\rho }^{k}{{\bf{x}}}_{i}$$
(5)

where xi (or hi) are the features (or the output) of the kth layer, representing the concentration of a specific air pollutant, Ci in the target location, i, \({{\bf{x}}{\boldsymbol{{\prime} }}}_{i}\) (or \({{\bf{h}}{\boldsymbol{{\prime} }}}_{i}\)) is the updated status after an evolution timestep of mini-batch gradient descent, Δt, vk is the velocity matrix, pk can be interpreted as the proxy diffusion matrix, \({\bar{A}}_{ij}^{k}\) is the neighborhood weight matrix, and N(i) is the set of spatial or spatiotemporal neighbors for i.

In order to strengthen representation learning, \({\bar{A}}_{ij}^{k}\) is simplified as the product of a learnable velocity (vk) and the weight, \({w}_{{d}_{ij}}^{k}\) (the reciprocal distance from a neighbor to the target node), and two additional learnable matrixes, \({{\bf{W}}}_{c}^{k}\) and \({{\bf{W}}}_{u}^{k}\) are introduced for the current input and the gradient-derived change, respectively. Empirical thresholds can be set up for the learning parameters such as vk, pk and ρk. Thus, we define the update of the kth graph convolution after an evolution timestep (Δt) as:

$${{\bf{x}}{^{{\prime} }}}_{i}={{\bf{W}}}_{c}^{k}{{\bf{x}}}_{i}-{{\bf{W}}}_{u}^{k}\left(-{{\bf{v}}}^{k}\sum _{j\in N(i)}{w}_{{d}_{ij}^{k}}({{\bf{x}}}_{j}-{{\bf{x}}}_{i})+{{\bf{p}}}^{k}\sum _{j\in N(i)}({{\bf{x}}}_{j}-{{\bf{x}}}_{i})+{\rho }^{k}{{\bf{x}}}_{i}\right)$$
(6)

The learning process involved the process of graph convolutions to simulate the spatiotemporal dynamics of air pollutants, as shown in Supplementary Fig. 1. Through the evolutional training driven by the measured concentration and covariate data, the model learned optimal parameters to inverse the target estimates. Accordingly, the minibatch forward propagation algorithm of Multilevel local Graph convolutions for Air pollutant spread was developed and shown as follows (Supplementary Algorithm 1).

Although GCs can efficiently extract spread features of air pollution, multiple graph convolutions may result in over-smoothing due to Laplacian smoothing. Thus, the full residual deep layers with multiple optimal attention layers was used as the downstream components for the output of neighborhood features to enhance the influence of non-traffic factors from emission, transformation and deposition on the final predictions. Here, the original input features and the aggregate feature were concatenated as the input for the full residual deep network. The residual error of continuity equation (\({e}_{2}\)) was introduced to the loss function. For details, please see Supplementary Note 2 and Supplementary Algorithm 2.

Training, testing and sensitivity analysis

In total, approximately 18% (288) of monitoring sites were randomly selected for site-based testing, and all samples from these sites were used for independent testing only and not for training. Then, from the remaining samples, we randomly selected 78% for training and 22% for regular testing stratified by spatiotemporal factors (province and month). We chose the samples so that the proportion of the total samples for regular and site-based tests was similar to the proportion ( ~ 0.37) of the unique test samples in the bootstrap 0.632 rule39. In semi-supervised training, all 56 variables were used for model learning; all the training, regular testing and predicting samples were used to construct the graph convolutions to simulate fluid dynamics, and encouraged to satisfy the continuity equations in PDE residuals; only the training samples were used to minimize the MSE loss between observations and predictions. Theoretically, it is possible to utilize site-based testing samples in model training, encouraging them to satisfy the continuity equations. However, in practice, we opted not to include these site-based samples in our model training to avoid testing bias and ensure a fair comparison. To fairly compare our method with other machine learning methods, the same training, regular testing, and site-based testing samples were used for each method, and the hyperparameters of each model were optimally set. In addition, in order to avoid the influence of random noise, we trained 100 models and summarized their average performance for each air pollutant and each method.

We undertook a rigorous comparison of our method against several machine learning techniques commonly employed in similar domains. The methods included GraphSAGE (local graph)34, full residual deep network36, random forest77, XGBoost78, non-linear generalized additive model (GAM)79, and spatial statistical models such as ordinary kriging and regression kriging80. To ensure a fair comparison, we used the same set of covariates for all the machine learning and statistical regression methods, with the exception of ordinary kriging. For ordinary kriging, it relies solely on constructing the variogram of spatial dependency from neighbors and does not utilize any covariates for predictions. In order to identify the most suitable hyperparameters, we conducted a grid search for each method. The hyperparameter search was based on suggested value ranges specific to each method. Supplementary Table 2 provides a succinct overview of each method, including its default value or value range for the hyperparameters, as well as the search step. After obtaining the optimal hyperparameter values for each method, we proceeded to compare their performance using appropriate evaluation metrics including total and time series metrics. This approach allowed us to gain insights into the strengths and weaknesses of each technique and assess how our method measures up against these established alternatives.

For the sensitivity analysis of missing MERRA2-GMI pollutants, we also used the same training, regular testing and site-based testing samples. But we removed 28 variables of MERRA2-GMI pollutants (shown in the gray background in Supplementary Table 1) and kept the vertical meteorological variables and other variables. We developed our hybrid deep graph learning framework using PyTorch Geometric (Version 2.0), a specialized software tool known for its efficiency81. Model training, testing, and prediction were carried out in a server environment equipped with 128GB memory, 16 Intel(R) Xeon(R) E5–2620 CPUs, and three NVIDIA 1080Ti GPUs (12GB memory). On average, training and testing a national model of an air pollutant took approximately 3–4 days (72–96 h) with the support of a GPU. To expedite the process, we utilized parallel computing, allowing us to complete the training and testing of three national models simultaneously. For daily national predictions of each air pollutant spanning from 2015 to 2018, it took approximately 10 days. Again, parallel computing was employed to accelerate the national predictions of air pollutants.