Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond

Zhang, Kun; Fan, Wei

doi:10.1007/s10115-007-0095-1

Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond

Regular Paper
Published: 06 July 2007

Volume 14, pages 299–326, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Kun Zhang¹ &
Wei Fan²

606 Accesses
48 Citations
3 Altmetric
Explore all metrics

Abstract

Much work on skewed, stochastic, high dimensional, and biased datasets usually implicitly solve each problem separately. Recently, we have been approached by Texas Commission on Environmental Quality (TCEQ) to help them build highly accurate ozone level alarm forecasting models for the Houston area, where these technical difficulties come together in one single problem. Key characteristics of this problem that is challenging and interesting include: (1) the dataset is sparse (72 features, and 2 or 5% positives depending on the criteria of “ozone days”), (2) evolving over time from year to year, (3) limited in collected data size (7 years or around 2,500 data entries), (4) contains a large number of irrelevant features, (5) is biased in terms of “sample selection bias”, and (6) the true model is stochastic as a function of measurable factors. Besides solving a difficult application problem, this dataset offers a unique opportunity to explore new and existing data mining techniques, and to provide experience, guidance and solution for similar problems. Our main technical focus addresses on how to estimate reliable probability given both sample selection bias and a large number of irrelevant features, and how to choose the most reliable decision threshold to predict the unknown future with different distribution. On the application side, the prediction accuracy of our chosen approach (bagging probabilistic decision trees and random decision trees) is 20% higher in recall (correctly detects 1–3 more ozone days, depending on the year) and 10% higher in precision (15–30 fewer false alarm days per year) than state-of-the-art methods used by air quality control scientists, and these results are significant for TCEQ. On the technical side of data mining, extensive empirical results demonstrate that, at least for this problem, and probably other problems with similar characteristics, these two straight-forward non-parametric methods can provide significantly more accurate and reliable solutions than a number of sophisticated and well-known algorithms, such as SVM and AdaBoost among many others.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm
Cortes C and Vapnik V (1995). Support-vector networks. Mach Learn 20(3): 273–297
MATH Google Scholar
Davidson I, Fan W (2003) When efficient model averaging out-performs boosting and bagging. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases. Springer, Berlin, pp. 478–486
EPA (1999) Guideline for developing an ozone forecasting program. EPA-454/R-99-009
Fan W, Davidson I (2006) ReverseTesting: an efficient framework to select amongst classifiers under sample selection bias. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. Philadelphia
Fan W, Wang H, Yu P, Ma S (2003) Is random model better? On its accuracy and efficiency. In: Proceedings of the 3rd IEEE international conference on data mining
Ferri C, Flach P and Hernndez J (2003). Decision trees for ranking: effect of new smoothing methods, new splitting criteria and simple pruning methods. Technical report, UPV(DSIC 2003)
Google Scholar
Forswall CD, Higgins KE (2006) Clean air act implementation in Houston: an historical perspective, 1970–2005 Technical report, Rice University, Environmental and Energy Systems Institute, Shell Center for Sustainability
Ghiaus C (2005). Linear fuzzy-discriminant analysis applied to forecast ozone concentration classes in sea-breeze regime. Atmos Environ 39(26): 4691–4702
Article Google Scholar
Janssen N, Sanderson E (2004) Air-pollution exposure assessment. http://airnet.iras.uu.nl
Kim Y and Kim J (2006). Convex Hull ensemble machine for regression and classification. Knowledge Info Sys 6(6): 645–663
Article Google Scholar
Lambeth B (2006) Ozone maximum model forecast version. In: Proceedings of the national air quality conference, San Antonio
Ling CX, Yan J (2003) Decision tree with better ranking. In: The Proceedings of the 20th international conference on machine learning
Mamitsuka H (2006). Query-learning-based iterative feature-subset selection for learning from high-dimensional data sets. Knowl Info Syst 9(1): 91–108
Article Google Scholar
McMillan N, Bortnicka S, Irwinb M and Berlinerc LM (2005). A hierarchical Bayesian model to estimate and forecast ozone through space and time. Atmos Environ 39(8): 1373–1382
Article Google Scholar
Mintz R, Young B and Svrcek W (2005). Fuzzy logic modeling of surface ozone concentrations. Comput Chem Eng 29(10): 2049–2059
Article Google Scholar
Mitchell T (1997) Machine learning. McGraw Hill
NCDC (2000) http://www.ncdc.noaa.gov/oa/ncdc.html
Ortega S, Soler MR, Beneito J, Pino D (2004) Evaluating of two ozone air quality modeling systems. Atmos Chem Phys Discussi 4:1855–1885, European Geosciences Union
Google Scholar
Provost F and Domingos P (2003). Tree induction for probability-based rankings. Mach Learn 52(3): 199–215
Article MATH Google Scholar
Schlink U, Dorlingb S, Pelikanc E, Nunnarid G, Cawleye G, Junninenf H, Greigg A, Foxallb R, Ebenc K, Chattertonb T, Vondracekc J, Richtera M, Dostalc M, Bertuccod L, Kolehmainenf M and Doyleb M (2003). A rigorous inter-comparison of ground-level ozone predictions. Atmos Environ 37: 3237–3253
Article Google Scholar
Wu X, Yu P, Piatetsky-Shapiro G, Cercone N, Lin TY, Kotagiri R and Wah BW (2003). Data mining: how research meets practical development?. Knowl Info Syst 5(2): 248–261
Article Google Scholar
Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of the 21st international conference on machine learning. Morgan Kaufmann, Sanfransisco
Zhang K, Xu Z, Peng J, Buckles B (2005) Learning through changes: an empirical study of dynamic behaviors of probability estimation trees. In: Proceedings of the 5th IEEE international conference on data mining

Download references

Author information

Authors and Affiliations

Department of Computer Science, Xavier University, New Orleans, LA, USA
Kun Zhang
IBM T.J.Watson Research, Hawthorne, NY, USA
Wei Fan

Authors

Kun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Fan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, K., Fan, W. Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond. Knowl Inf Syst 14, 299–326 (2008). https://doi.org/10.1007/s10115-007-0095-1

Download citation

Received: 22 March 2007
Accepted: 28 April 2007
Published: 06 July 2007
Issue Date: March 2008
DOI: https://doi.org/10.1007/s10115-007-0095-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Bayesian approach to forecasting daily air-pollutant levels

EDA and a Tailored Data Imputation Algorithm for Daily Ozone Concentrations

Air Quality Model Evaluation Using Gaussian Process Modelling and Empirical Orthogonal Function Decomposition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Bayesian approach to forecasting daily air-pollutant levels

EDA and a Tailored Data Imputation Algorithm for Daily Ozone Concentrations

Air Quality Model Evaluation Using Gaussian Process Modelling and Empirical Orthogonal Function Decomposition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation