A Scalable Adaptive Sampling Based Approach for Big Data Classification

Djouzi, Kheyreddine; Beghdad-Bey, Kadda; Amamra, Abdenour

doi:10.1007/978-3-031-12097-8_7

Kheyreddine Djouzi¹²,
Kadda Beghdad-Bey¹² &
Abdenour Amamra¹²

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 513))

Included in the following conference series:

International Conference on Computing Systems and Applications

388 Accesses

Abstract

Big Data is dealing with two major issues witch are the adaptation and scaling of data analysis techniques at the Big Data level by one side, such as the implementation of distribute Machine Learning algorithms. In another side, the reduction of data sets so that a processing can be performed using the existing Machine Learning techniques. It consists to determine a smallest and sufficient training set size that obtains the same accuracy as the entire available dataset. The proposed approach deals on the selection of instances number needed to be presented for data mining algorithms. GDAS is one of the adaptive sampling algorithms that can scale down the data. In this paper, an improved GDAS algorithm based on BLB technique and ScaSRS algorithm is presented, which substantially allows a better scalability and performances in terms of time needed. As validated by experiments on various datasets, our approach can achieve very prominent improvement in efficiency and also the resulting accuracy over previous works.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering

A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis

Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

References

John, G.H., Langley, P.: In: KDD, vol. 96, pp. 367–370 (1996)
Google Scholar
Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 23–32. ACM (1999)
Google Scholar
Satyanarayana, A., Davidson, I.: A dynamic adaptive sampling algorithm (dasa) for real world applications: finger print recognition and face recognition. In: Hacid, M.-S., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS (LNAI), vol. 3488, pp. 631–640. Springer, Heidelberg (2005). https://doi.org/10.1007/11425274_65
Chapter Google Scholar
Ratnieks, F.L., Schrell, F., Sheppard, R.C., Brown, E., Bristow, O.E., Garbuzov, M.: Data reliability in citizen science: learning curve and the effects of training method, volunteer background and experience on identification accuracy of insects visiting ivy flowers. Methods Ecol. Evol. 7(10), 1226–1235 (2016)
Article Google Scholar
Garg, A., Lee, Y. T., Song, Z., Srivastava, N.: A matrix expander chernoff bound. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1102–1114 (2018)
Google Scholar
Satyanarayana, A.: Intelligent sampling for big data using bootstrap sampling and chebyshev inequality. In: 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1–6. IEEE (2014)
Google Scholar
Stellato, B., Van Parys, B.P., Goulart, P.J.: Multivariate Chebyshev inequality with estimated mean and variance. Am. Stat. 71(2), 123–127 (2017)
Article MathSciNet Google Scholar
Mashreghi, Z., Haziza, D., Léger, C., et al.: A survey of bootstrap methods in finite population sampling. Stat. Surv. 10, 1–52 (2016)
Article MathSciNet Google Scholar
Meng, X.: Scalable simple random sampling and stratified sampling. In: International Conference on Machine Learning, pp. 531–539 (2013)
Google Scholar
Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M.I.: A scalable bootstrap for massive data. J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.) 76(4), 795–816 (2014)
Article MathSciNet Google Scholar
Gavagsaz, E., Rezaee, A., Javadi, H.H.S.: Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J. Supercomput. 74(7), 3415–3440 (2018)
Article Google Scholar
Xu, S.: Bayesian naïve bayes classifiers to text classification. J. Inf. Sci. 44(1), 48–59 (2018)
Article Google Scholar
Shih, A., Choi, A., Darwiche, A.: A symbolic approach to explaining bayesian network classifiers. arXiv preprint arXiv:1805.03364 (2018)
Song, Y.Y., Ying, L.: Decision tree methods: applications for classification and prediction. Shanghai Arch. Psychiat. 27(2), 130 (2015)
Google Scholar
Antonelli, M., Ducange, P., Lazzerini, B., Marcelloni, F.: Multi-objective evolutionary design of granular rule-based classifiers. Granular Comput. 1(1), 37–58 (2016)
Article Google Scholar
Howard, A.G.: Some improvements on deep convolutional neural network based image classification. arXiv preprint arXiv:1312.5402 (2013)
Gissin, D., Shalev-Shwartz, S.: Discriminative active learning. arXiv preprint arXiv:1907.06347 (2019)
Luengo, D., Martino, L., Bugallo, M., Elvira, V., Särkkä, S.: A survey of Monte Carlo methods for parameter estimation. EURASIP J. Adv. Signal Process. 1, 1–62 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Polytechnic Military School, Algiers, Algeria
Kheyreddine Djouzi, Kadda Beghdad-Bey & Abdenour Amamra

Authors

Kheyreddine Djouzi
View author publications
You can also search for this author in PubMed Google Scholar
Kadda Beghdad-Bey
View author publications
You can also search for this author in PubMed Google Scholar
Abdenour Amamra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kheyreddine Djouzi .

Editor information

Editors and Affiliations

Department of Computer Science, Ecole Militaire Polytechnique, Algiers, Algeria
Mustapha Reda Senouci
Department of Computer Science, Ecole Militaire Polytechnique, Algiers, Algeria
Said Yacine Boulahia
Department of Computer Science, Ecole Militaire Polytechnique, Algiers, Algeria
Mohamed Akrem Benatia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Djouzi, K., Beghdad-Bey, K., Amamra, A. (2022). A Scalable Adaptive Sampling Based Approach for Big Data Classification. In: Senouci, M.R., Boulahia, S.Y., Benatia, M.A. (eds) Advances in Computing Systems and Applications. CSA 2022. Lecture Notes in Networks and Systems, vol 513. Springer, Cham. https://doi.org/10.1007/978-3-031-12097-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-12097-8_7
Published: 28 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12096-1
Online ISBN: 978-3-031-12097-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

A Scalable Adaptive Sampling Based Approach for Big Data Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering

A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis

Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Scalable Adaptive Sampling Based Approach for Big Data Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering

A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis

Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation