Abstract
Event sequences often contain continuous variability at different levels. In other words, their properties and characteristics change at different rates, concurrently. For example, the sales of a product may slowly become more frequent over a period of several weeks, but there may be interesting variation within a week at the same time. To provide an accurate and robust “view” of such multi-level structural behavior, one needs to determine the appropriate levels of granularity for analyzing the underlying sequence. We introduce the novel problem of finding the best set of window lengths for analyzing discrete event sequences. We define suitable criteria for choosing window lengths and propose an efficient method to solve the problem. We give examples of tasks that demonstrate the applicability of the problem and present extensive experiments on both synthetic data and real data from two domains: text and DNA. We find that the optimal sets of window lengths themselves can provide new insight into the data, e.g., the burstiness of events affects the optimal window lengths for measuring the event frequencies.
This work was supported by the Finnish Centre of Excellence for Algorithmic Data Analysis Research (ALGODAN).
Chapter PDF
Similar content being viewed by others
References
Altmann, E.G., Pierrehumbert, J.B., Motter, A.E.: Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS ONE 4(11), e7678 (2009)
Benson, G.: Tandem repeats finder: a program to analyze dna sequences. Nucleic Acids Research 27(2), 573–580 (1999)
Biber, D.: Variation across speech and writing. Cambridge University Press (1988)
Bourgain, C., Genin, E., Quesneville, H., Clerget-Daproux, F.: Search for multifactorial disease susceptibility genes in founder populations. Annals of Human Genetics 64(03), 255–265 (2000)
Calders, T., Dexters, N., Goethals, B.: Mining frequent items in a stream using flexible windows. Intelligent Data Analysis 12(3), 293–304 (2008)
Chiu, B., Keogh, E., Lonardi, S.: Probabilistic discovery of time series motifs. In: Proc. of ACM SIGKDD, pp. 493–498 (2003)
Das, M.K., Dai, H.-K.: A survey of DNA motif finding algorithms. BMC Bioinformatics 8(suppl. 7), S21 (2007)
Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency Estimation of Internet Packet Streams with Limited Space. In: Möhring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 348–360. Springer, Heidelberg (2002)
Evert, S.: How random is a corpus? the library metaphor. Zeitschrift für Anglistik und Amerikanistik 54(2), 177–190 (2006)
Forsyth, D., Ponce, J.: Computer Vision: A Modern Approach. Prentice Hall (2011)
Gentles, A.J., Karlin, S.: Genome-scale compositional comparisons in eukaryotes. Genome Research 11(4), 540–546 (2001)
Giannella, C., Han, E.R.J., Liu, C.: Mining frequent itemsets over arbitrary time intervals in data streams. Technical Report TR587 (2003)
Golab, L., López-Ortiz, A., Dehaan, D., Munro, J.I.: Identifying frequent items in sliding windows over on-line packet streams. In: Proc. of IMC, pp. 173–178 (2003)
Gries, S.T.: Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4), 403–437 (2008)
Jin, L., Chai, D.J., Lee, Y.K., Ryu, K.H.: Mining frequent itemsets over data streams with multiple time-sensitive sliding windows. In: Proc. of ALPIT, pp. 486–491 (2007)
Jin, R., Agrawal, G.: An algorithm for in-core frequent itemset mining on streaming data. In: Proc. of IEEE ICDM, pp. 210–217 (2005)
Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28(1), 51–55 (2003)
Katz, S.M.: Distribution of content words and phrases in text and language modelling. Natural Language Engineering 2(1), 15–59 (1996)
Kirkness, E.F., Bafna, V., Halpern, A.L., Levy, S., Remington, K., Rusch, D.B., Delcher, A.L., Pop, M., Wang, W., Fraser, C.M., Venter, J.C.: The dog genome: survey sequencing and comparative analysis. Science 301(5641), 1898–1903 (2003)
Lee, D., Lee, W.: Finding maximal frequent itemsets over online data streams adaptively. In: Proc. of IEEE ICDM, pp. 266–273 (2005)
Li, C., Wang, B., Yang, X.: Vgram: improving performance of approximate queries on string collections using variable-length grams. In: Proc. of VLDB, pp. 303–314 (2007)
Li, Y., Sung, W.-K., Liu, J.J.: Association mapping via regularized regression analysis of single-nucleotidepolymorphism haplotypes in variable-sized sliding windows. The American Journal of Human Genetics 80(4), 705–715 (2007)
Lijffijt, J., Papapetrou, P., Puolamäki, K., Mannila, H.: Analyzing Word Frequencies in Large Text Corpora Using Inter-arrival Times and Bootstrapping. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part II. LNCS, vol. 6912, pp. 341–357. Springer, Heidelberg (2011)
Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the dirichlet distribution. In: Proc. of ICML, pp. 545–552 (2005)
Mannila, H., Toivonen, H., Inkeri Verkamo, A.: Discovery of frequent episodes in event sequences. Data Min. Knowl. Discov. 1(3), 259–289 (1997)
Mathias, R., Gao, P., Goldstein, J., Wilson, A., Pugh, E., Furbert-Harris, P., Dunston, G., Malveaux, F., Togias, A., Barnes, K., Beaty, T., Huang, S.-K.: A graphical assessment of p-values from sliding window haplotype tests of association to identify asthma susceptibility loci on chromosome 11q. BMC Genetics 7(1) (2006)
Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, B.: Exact discovery of time series motifs. In: Proc. of SIAM SDM (2009)
Papadimitriou, S., Yu, P.: Optimal multi-scale patterns in time series streams. In: Proc. of ACM SIGMOD, pp. 647–658 (2006)
Papapetrou, P., Benson, G., Kollios, G.: Discovering frequent poly-regions in dna sequences. In: Proc. of IEEE ICDM Workshops, pp. 94–98 (2006)
Sörnmo, L., Laguna, P.: Bioelectrical Signal Processing in Cardiac and Neurological Applications. Elsevier Academic Press (2005)
Tang, R., Feng, T., Sha, Q., Zhang, S.: A variable-sized sliding-window approach for genetic association studies via principal component analysis. Annals of Human Genetics 73(Pt 6), 631–637 (2009)
Toivonen, H., Onkamo, P., Vasko, K., Ollikainen, V., Sevon, P., Mannila, H., Herr, M., Kere, J.: Data mining applied to linkage disequilibrium mapping. Am. J. Hum. Genet. 67, 133–145 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lijffijt, J., Papapetrou, P., Puolamäki, K. (2012). Size Matters: Finding the Most Informative Set of Window Lengths. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7524. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33486-3_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-33486-3_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33485-6
Online ISBN: 978-3-642-33486-3
eBook Packages: Computer ScienceComputer Science (R0)