Size Matters: Finding the Most Informative Set of Window Lengths

Lijffijt, Jefrey; Papapetrou, Panagiotis; Puolamäki, Kai

doi:10.1007/978-3-642-33486-3_29

Jefrey Lijffijt²¹,
Panagiotis Papapetrou^21,22 &
Kai Puolamäki²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7524))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

5283 Accesses
9 Citations

Abstract

Event sequences often contain continuous variability at different levels. In other words, their properties and characteristics change at different rates, concurrently. For example, the sales of a product may slowly become more frequent over a period of several weeks, but there may be interesting variation within a week at the same time. To provide an accurate and robust “view” of such multi-level structural behavior, one needs to determine the appropriate levels of granularity for analyzing the underlying sequence. We introduce the novel problem of finding the best set of window lengths for analyzing discrete event sequences. We define suitable criteria for choosing window lengths and propose an efficient method to solve the problem. We give examples of tasks that demonstrate the applicability of the problem and present extensive experiments on both synthetic data and real data from two domains: text and DNA. We find that the optimal sets of window lengths themselves can provide new insight into the data, e.g., the burstiness of events affects the optimal window lengths for measuring the event frequencies.

This work was supported by the Finnish Centre of Excellence for Algorithmic Data Analysis Research (ALGODAN).

Download to read the full chapter text

Chapter PDF

Introducing time series chains: a new primitive for time series data mining

Article 02 June 2018

Exploring variable-length time series motifs in one hundred million length scale

Article 10 May 2018

Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic Network Analysis—An R Studio Pipeline

Keywords

References

Altmann, E.G., Pierrehumbert, J.B., Motter, A.E.: Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS ONE 4(11), e7678 (2009)
Google Scholar
Benson, G.: Tandem repeats finder: a program to analyze dna sequences. Nucleic Acids Research 27(2), 573–580 (1999)
Article MathSciNet Google Scholar
Biber, D.: Variation across speech and writing. Cambridge University Press (1988)
Google Scholar
Bourgain, C., Genin, E., Quesneville, H., Clerget-Daproux, F.: Search for multifactorial disease susceptibility genes in founder populations. Annals of Human Genetics 64(03), 255–265 (2000)
Article Google Scholar
Calders, T., Dexters, N., Goethals, B.: Mining frequent items in a stream using flexible windows. Intelligent Data Analysis 12(3), 293–304 (2008)
Google Scholar
Chiu, B., Keogh, E., Lonardi, S.: Probabilistic discovery of time series motifs. In: Proc. of ACM SIGKDD, pp. 493–498 (2003)
Google Scholar
Das, M.K., Dai, H.-K.: A survey of DNA motif finding algorithms. BMC Bioinformatics 8(suppl. 7), S21 (2007)
Article Google Scholar
Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency Estimation of Internet Packet Streams with Limited Space. In: Möhring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 348–360. Springer, Heidelberg (2002)
Chapter Google Scholar
Evert, S.: How random is a corpus? the library metaphor. Zeitschrift für Anglistik und Amerikanistik 54(2), 177–190 (2006)
MathSciNet Google Scholar
Forsyth, D., Ponce, J.: Computer Vision: A Modern Approach. Prentice Hall (2011)
Google Scholar
Gentles, A.J., Karlin, S.: Genome-scale compositional comparisons in eukaryotes. Genome Research 11(4), 540–546 (2001)
Article Google Scholar
Giannella, C., Han, E.R.J., Liu, C.: Mining frequent itemsets over arbitrary time intervals in data streams. Technical Report TR587 (2003)
Google Scholar
Golab, L., López-Ortiz, A., Dehaan, D., Munro, J.I.: Identifying frequent items in sliding windows over on-line packet streams. In: Proc. of IMC, pp. 173–178 (2003)
Google Scholar
Gries, S.T.: Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4), 403–437 (2008)
Article Google Scholar
Jin, L., Chai, D.J., Lee, Y.K., Ryu, K.H.: Mining frequent itemsets over data streams with multiple time-sensitive sliding windows. In: Proc. of ALPIT, pp. 486–491 (2007)
Google Scholar
Jin, R., Agrawal, G.: An algorithm for in-core frequent itemset mining on streaming data. In: Proc. of IEEE ICDM, pp. 210–217 (2005)
Google Scholar
Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28(1), 51–55 (2003)
Article Google Scholar
Katz, S.M.: Distribution of content words and phrases in text and language modelling. Natural Language Engineering 2(1), 15–59 (1996)
Article Google Scholar
Kirkness, E.F., Bafna, V., Halpern, A.L., Levy, S., Remington, K., Rusch, D.B., Delcher, A.L., Pop, M., Wang, W., Fraser, C.M., Venter, J.C.: The dog genome: survey sequencing and comparative analysis. Science 301(5641), 1898–1903 (2003)
Article Google Scholar
Lee, D., Lee, W.: Finding maximal frequent itemsets over online data streams adaptively. In: Proc. of IEEE ICDM, pp. 266–273 (2005)
Google Scholar
Li, C., Wang, B., Yang, X.: Vgram: improving performance of approximate queries on string collections using variable-length grams. In: Proc. of VLDB, pp. 303–314 (2007)
Google Scholar
Li, Y., Sung, W.-K., Liu, J.J.: Association mapping via regularized regression analysis of single-nucleotidepolymorphism haplotypes in variable-sized sliding windows. The American Journal of Human Genetics 80(4), 705–715 (2007)
Article Google Scholar
Lijffijt, J., Papapetrou, P., Puolamäki, K., Mannila, H.: Analyzing Word Frequencies in Large Text Corpora Using Inter-arrival Times and Bootstrapping. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part II. LNCS, vol. 6912, pp. 341–357. Springer, Heidelberg (2011)
Chapter Google Scholar
Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the dirichlet distribution. In: Proc. of ICML, pp. 545–552 (2005)
Google Scholar
Mannila, H., Toivonen, H., Inkeri Verkamo, A.: Discovery of frequent episodes in event sequences. Data Min. Knowl. Discov. 1(3), 259–289 (1997)
Article Google Scholar
Mathias, R., Gao, P., Goldstein, J., Wilson, A., Pugh, E., Furbert-Harris, P., Dunston, G., Malveaux, F., Togias, A., Barnes, K., Beaty, T., Huang, S.-K.: A graphical assessment of p-values from sliding window haplotype tests of association to identify asthma susceptibility loci on chromosome 11q. BMC Genetics 7(1) (2006)
Google Scholar
Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, B.: Exact discovery of time series motifs. In: Proc. of SIAM SDM (2009)
Google Scholar
Papadimitriou, S., Yu, P.: Optimal multi-scale patterns in time series streams. In: Proc. of ACM SIGMOD, pp. 647–658 (2006)
Google Scholar
Papapetrou, P., Benson, G., Kollios, G.: Discovering frequent poly-regions in dna sequences. In: Proc. of IEEE ICDM Workshops, pp. 94–98 (2006)
Google Scholar
Sörnmo, L., Laguna, P.: Bioelectrical Signal Processing in Cardiac and Neurological Applications. Elsevier Academic Press (2005)
Google Scholar
Tang, R., Feng, T., Sha, Q., Zhang, S.: A variable-sized sliding-window approach for genetic association studies via principal component analysis. Annals of Human Genetics 73(Pt 6), 631–637 (2009)
Article Google Scholar
Toivonen, H., Onkamo, P., Vasko, K., Ollikainen, V., Sevon, P., Mannila, H., Herr, M., Kere, J.: Data mining applied to linkage disequilibrium mapping. Am. J. Hum. Genet. 67, 133–145 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information and Computer Science, Aalto University, Finland
Jefrey Lijffijt, Panagiotis Papapetrou & Kai Puolamäki
Department of Computer Science and Information Systems, University of London, Birkbeck, UK
Panagiotis Papapetrou

Authors

Jefrey Lijffijt
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Papapetrou
View author publications
You can also search for this author in PubMed Google Scholar
Kai Puolamäki
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Intelligent Systems Laboratory, University of Bristol, Merchant Venturers Building, Woodland Road, BS8 1UB, Bristol, UK
Peter A. Flach
Intelligent Systems Laboratory, University of Bristol, Merchant Venturers Building, Woodland Road,, BS8 1UB, Bristol, UK
Tijl De Bie & Nello Cristianini &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lijffijt, J., Papapetrou, P., Puolamäki, K. (2012). Size Matters: Finding the Most Informative Set of Window Lengths. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7524. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33486-3_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-33486-3_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33485-6
Online ISBN: 978-3-642-33486-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Size Matters: Finding the Most Informative Set of Window Lengths

Abstract

Chapter PDF

Similar content being viewed by others

Introducing time series chains: a new primitive for time series data mining

Exploring variable-length time series motifs in one hundred million length scale

Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic Network Analysis—An R Studio Pipeline

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Size Matters: Finding the Most Informative Set of Window Lengths

Abstract

Chapter PDF

Similar content being viewed by others

Introducing time series chains: a new primitive for time series data mining

Exploring variable-length time series motifs in one hundred million length scale

Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic Network Analysis—An R Studio Pipeline

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation