Abstract
As the amount of multimedia data is increasing day-by-day thanks to cheaper storage devices and increasing number of information sources, the machine learning algorithms are faced with large-sized datasets. When original data is huge in size small sample sizes are preferred for various applications. This is typically the case for multimedia applications. But using a simple random sample may not obtain satisfactory results because such a sample may not adequately represent the entire data set due to random fluctuations in the sampling process. The difficulty is particularly apparent when small sample sizes are needed. Fortunately the use of a good sampling set for training can improve the final results significantly. In KDD’03 we proposed EASE that outputs a sample based on its ‘closeness’ to the original sample. Reported results show that EASE outperforms simple random sampling (SRS). In this paper we propose EASIER that extends EASE in two ways. (1) EASE is a halving algorithm, i.e., to achieve the required sample ratio it starts from a suitable initial large sample and iteratively halves. EASIER, on the other hand, does away with the repeated halving by directly obtaining the required sample ratio in one iteration. (2) EASE was shown to work on IBM QUEST dataset which is a categorical count data set. EASIER, in addition, is shown to work on continuous data of images and audio features. We have successfully applied EASIER to image classification and audio event identification applications. Experimental results show that EASIER outperforms SRS significantly.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of International Conference on Management of Data
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of International Conference on Very Large Databases
Angluin D (1988) Queries and concept learning. Mach Learn 2(4):319–342
Astashyn A (2004) Deterministic data reduction methods for transactional data sets. Master thesis
Atlas L, Cohn D, Ladner R, El-Sharkawi MA, Marks IIRJ (1990) Training connectionist networks with queries and selective sampling. In: Advances in neural information processing systems, vol. 2, Morgan Kaufmann Publishers Inc., pp. 566–573
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Brönnimann H, Chen B, Dash M, Haas P, Scheuermann P (2003) Efficient data reduction with EASE. In: Proceedings of 9th International Conference on Knowledge Discovery and Data Mining, pp 59–68
Chapelle O, Halffiner P, Vapnik VN (1999) Support vector machine for histogram based image classification. IEEE Trans Neutr Netw 10(5):1055–1064
Chawla N, Eschrich S, Hall LO (2001) Creating ensembles of classifiers. In: Proceedings of International Conference on Data Mining, pp 580–581
Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of International Conference on Knowledge Discovery and Data Mining
Cohn DA, Ghahramani Z, Jordan MI (1995) Active learning with statistical models. In: Advances in neural information processing systems, vol 7, The MIT Press, pp 705–712
Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: International Conference on Machine Learning, pp 194–202
Duan LY, Xu M, Chua TS, Tian Q, Xu CS (2003) A mid-level representation framework for semantic sports video analysis. In: Proceedings of ACM Multimedia
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of International Conference on Management of Data
ISO/IEC15938-8/FDIS3. Information Technology—Multimedia Content Description Interface—Part 8: Extraction and use of MPEG-7 descriptions
Iyengar VS, Apte C, Zhang T (2000) Active learning using adaptive resampling. In: Proceeding of Intenational Conference on Knowledge Discovery and Data Mining, pp 92–98
Jin R, Yan R, Hauptmann A (2003) Image classification using a bigram model. In: AAAI Spring Symposium on Intelligent Multimedia Knowledge Management
Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Cohen WW, Hirsh H (eds) Proceedings of 11th International Conference on Machine Learning, New Brunswick, US. Morgan Kaufmann Publishers, San Francisco, US, pp 148–156
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: W. Bruce Croft and Cornelis J. van Rijsbergen (eds) Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, Dublin, IE. Springer Verlag, Heidelberg, DE, pp 3–12
Manjunath BS, Salembier P, Sikora T (2002) Introduction to MPEG-7. John Wiley & Sons, Ltd
Meek C, Thiesson B, Heckerman D (2002) The learning-curve sampling method applied to model-based clustering. J Mach Learn Res 2(3):397–418
Nepal S, Srinivasan U, Reynolds G (2001) Automatic detection of goal segments in basketball videos. In: Proceedings of ACM Multimedia, Los Angeles, CA
Ojala T, Aittola M, Matinmikko E (2002) Empirical evaluation of mpeg-7 xm color descriptors in content-based retrieval of semantic image categories. In: Proceedings of 16th International Conference on Pattern Recognition, Quebec, Canada, pp 1021–1024
Plutowski M, White H (1993) Selecting concise training sets from clean data. IEEE Trans Neur Netw 4(2):305–318
Rui Y, Gupta A, Acero A (2000) Automatically extracting highlights for tv baseball programs. In: Proceedings of ACM Multimedia, pp 105–115
Saar-Tsechansky M, Provost F (2001) Active learning for class probability estimation and ranking. In: Proceedings of 17th International Joint Conference on Artificial Intelligence, pp 911–920
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of 8th ACM International Conference on Knowledge Discovery and Data Mining
Scheffer T, Decomain C, Wrobel S (2001) Active hidden Markov models for information extraction. In: Proceedings of the International Symposium on Intelligent Data Analysis
Tong S, Koller D (2000) Support vector machine active learning with applications to text classification. In: Langley P (ed) Proceedings of 17th International Conference on Machine Learning, Stanford, US. Morgan Kaufmann Publishers, San Francisco, US, pp 999–1006
Vitter JS (1985) Random sampling with a reservoir. ACM Trans Mathem Softw 11(1):37–57
Wang S, Dash M, Chia L-T (2005) Efficient sampling for image application. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining
Wang S, Xu M, Chia L-T, Dash M (2005) Easier sampling for audio event identification. In: Proceedings of International Conference on Multimedia and Expo
Xu M, Duan L-Y, Cai J, Chia L-T, Xu C-S, Tian Q (2004) Hmm-based audio keyword generation. In: Proceedings of Pacific Conference on Multimedia vol. 3, pp. 566–574
Xu M, Duan L-Y, Chia L-T, Xu C-S (2004) Audio keywords generation for sports video analysis. In: Proceedings of ACM Multimedia
Young S et al (2002) The HTK Book (for HTK Version 3.1). Cambridge University Engineering Department
Author information
Authors and Affiliations
Corresponding author
Additional information
Surong Wang received the B.E. and M.E. degree from the School of Information Engineering, University of Science and Technology Beijing, China, in 1999 and 2002 respectively. She is currently studying toward for the Ph.D. degree at the School of Computer Engineering, Nanyang Technological University, Singapore. Her research interests include multimedia data processing, image processing and content-based image retrieval.
Manoranjan Dash obtained Ph.D. and M. Sc. (Computer Science) degrees from School of Computing, National University of Singapore. He has worked in academic and research institutes extensively and has published more than 30 research papers (mostly refereed) in various reputable machine learning and data mining journals, conference proceedings, and books. His research interests include machine learning and data mining, and their applications in bioinformatics, image processing, and GPU programming. Before joining School of Computer Engineering (SCE), Nanyang Technological University, Singapore, as Assistant Professor, he worked as a postdoctoral fellow in Northwestern University. He is a member of IEEE and ACM. He has served as program committee member of many conferences and he is in the editorial board of “International journal of Theoretical and Applied Computer Science.”
Liang-Tien Chia received the B.S. and Ph.D. degrees from Loughborough University, in 1990 and 1994, respectively. He is an Associate Professor in the School of Computer Engineering, Nanyang Technological University, Singapore. He has recently been appointed as Head, Division of Computer Communications and he also holds the position of Director, Centre for Multimedia and Network Technology.
His research interests include image/video processing & coding, multimodal data fusion, multimedia adaptation/transmission and multimedia over the Semantic Web. He has published over 80 research papers.
Rights and permissions
About this article
Cite this article
Wang, S., Dash, M., Chia, LT. et al. Efficient data reduction in multimedia data. Appl Intell 25, 359–374 (2006). https://doi.org/10.1007/s10489-006-0112-1
Issue Date:
DOI: https://doi.org/10.1007/s10489-006-0112-1