Abstract
Suffix array is a powerful data structure, used mainly for pattern detection in strings. The main disadvantage of a full suffix array is its quadratic O(n 2) space capacity when the actual suffixes are needed. In our previous work [39], we introduced the innovative All Repeated Patterns Detection (ARPaD) algorithm and the Moving Longest Expected Repeated Pattern (MLERP) process. The former detects all repeated patterns in a string using a partition of the full Suffix Array and the latter is capable of analyzing large strings regardless of their size. Furthermore, the notion of Longest Expected Repeated Pattern (LERP), also introduced by the authors in a previous work, significantly reduces to linear O ( n ) the space capacity needed for the full suffix array. However, so far the LERP value has to be specified in ad hoc manner based on experimental or empirical values. In order to overcome this problem, the Probabilistic Existence of LERP theorem has been proven in this paper and, furthermore, a formula for an accurate upper bound estimation of the LERP value has been introduced using only the length of the string and the size of the alphabet used in constructing the string. The importance of this method is the optimum upper bounding of the LERP value without any previous preprocess or knowledge of string characteristics. Moreover, the new data structure LERP Reduced Suffix Array is defined; it is a variation of the suffix array, and has the advantage of permitting the classification and parallelism to be implemented directly on the data structure. All other alternative methodologies deal with the very common problem of fitting any kind of data structure in a computer memory or disk in order to apply different time efficient methods for pattern detection. The current advanced and elegant proposed methodology allows us to alter the above-mentioned problem such that smaller classes of the problem can be distributed on different systems and then apply current, state-of-the-art, techniques such as parallelism and cloud computing using advanced DBMSs which are capable of handling the storage and analysis of big data. The implementation of the above-described methodology can be achieved by invoking our innovative ARPaD algorithm. Extensive experiments have been conducted on small, comparable strings of Champernowne Constant and DNA as well as on extremely large strings of π with length up to 68 billion digits. Furthermore, the novelty and superiority of our methodology have been also tested on real life application such as a Distributed Denial of Service (DDoS) attack early warning system.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Apostolico A, Preparata FP (1983) Optimal off-line detection of repetitions in a string. Theor Comput Sci 22:297–315
Apostolico A, Szpankowski W (1992) Self-alignment in words and their applications. J Algorithms 13 (3):446–467
Borel E (1909) Les probabilités dénombrables et leurs applications arithmétiques. Rend Circ Mat Palermo 27:247–271
Bailey DH, Crandall RE (2001) On the random character of fundamental constant expansions. Exp Math 10(2):175–190
Bailey DH, Crandall RE (2002) Random generators and normal numbers. Exp Math 11(4):527–546
Bailey DH, Borwein JM, Calude CS, Dinneen MJ, Dumitrescu M, Yee A (2012) An empirical approach to the NorMality of π. Exp Math 21(4):375–384
Becher V (2012) Turing’s normal numbers: towards randomness. In: Cooper BS, Dawar A, Löwe B (eds) How the world computes: lecture notes in computer science, vol 7318. Springer, pp 35–45
Calude C (1994) Borel normality and algorithmic randomness. In: Rozenberg G, Salomaa A (eds) Development in language theory. World Scientif, Singapore, pp 113–129
Calude C (1995) What is a random string? J Univ Sci 1(1):48–66
Chaitin GJ (1988) Randomness in arithmetic. Sci Am 259 (1):80–85
Champernowne D (1933) The construction of decimals normal in the scale of ten. J London Math Soc 8:254–260
Church A (1940) On the concept of a random sequence. Bull Amer Math Soc 46(2):130–135
Copeland AH, Erdos P (1946) Note on normal numbers. Bull Amer Math Soc 52:857–860
Dasgupta A (2011) Mathematical foundations of randomness. In: Gabbay DM, Thagard P, Woods J (eds) Philosophy of statistics. North Holland, Saint Louis, pp 641–710
Davenport H, Erdos P (1952) Note on normal decimals. Canad J Math 4:58–63
Devroye L, Szpankowski W, Rais B (1992) A note on the height of suffix trees. SIAM J Comput 21 (1):48–53
Franek F, Smyth WF, Tang Y (2003) Computing all repeats using suffix arrays. J Autom Lang Comb 8(4):579–591
Gog S, Moffat A, Culpepper S, Turpin A, Wirth A (2013) Large-scale pattern search using reduced-space on-disk suffix arrays. arXiv:1303.6481v1
Guo D, Hu X, Xie F, Wu X (2013) Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph. Appl Intell 39:57–74
Hardy GH, Wright EM (1960) An introduction to the theory of numbers, 4th edn. Oxford University Press
Karkkainen J, Sanders P, Burkhardt S (2006) Linear work suffix array construction. J ACM (JACM) 53(6):918–936
Karlin S, Ghandour G, Ost F, Tavere S, Korn L (1983) New approaches for computer analysis of nucleic acid sequences. Proc Natl Acad Sci USA 80:5660–5664
Khoshnevisan D (2006) Normal numbers are normal. Clay Mathematics Institute Annual Report 15(2006):27–31
Ko P, Aluru S (2003) Space efficient linear time construction of suffix arrays. In: Proceedings of the 14th annual conference on Combinatorial pattern matching, pp 200–210
Long CT (1957) Note on normal numbers. Pac J Math 7(2):1163–1165
Manber U, Myers G (1990) Suffix arrays: a new method for on-line string searches. In: Proceedings of the first annual ACM-SIAM symposium on discrete algorithms, pp 319–327
Niven I, Zuckerman H (1951) On the definition of normal numbers. Pac J Math 1(1):103–109
Orlandi A, Venturini R (2011) Space-efficient substring occurrence estimation. In: Proceedings of the 30th principles of database systems PODS, pp 95–106
Phoophakdee B, Zaki M (2007) Genome-scale disk-based suffix tree indexing. In: Proceedings of international conference on management of data SIGMOD ’07, pp 833–844
Puglishi SJ, Smyth WF, Yusufu M (2008) Fast optimal algorithms for computing all the repeats in a string. In: Proceedings of PSC, pp 161–169
Schürmann KB, Stoye J (2005) An incomplex algorithm for fast suffix array construction. In: Proceedings of the 7th workshop on algorithm engineering and experiments and the 2nd workshop on analytic algorithmics and combinatorics (ALENEX/ANALCO 2005), pp 77–85
Sinha R, Moffat A, Puglisi S, Turpin A (2008) Improving Suffix Array Locality for Fast Pattern Matching on Disk. In: Proceedings of international conference on management of data SIGMOD ’08, pp 661–672
Wagon S (1985) Is Pi normal?. Math Intell 7(3):65–67
Weiner P Linear pattern matching algorithms. In: SWAT ’73 proceedings of the 14th annual symposium on switching and automata theory (swat 1973), pp 1–11
Wu Y, Wang L, Ren J, Ding W, Wu X (2014) Mining sequential patterns with periodic wildcards. Appl Intell 41:99–116
Xylogiannopoulos K, Karampelas P, Alhajj R (2012) Periodicity data mining in time series using suffix arrays. In: Proceedings of IEEE intelligent systems IS’12, pp 172–181
Xylogiannopoulos K, Karampelas P, Alhajj R (2012) Minimization of suffix array’s storage capacity for periodicity detection in time series. In: Proceedings of IEEE international conference in tools with artificial intelligence
Xylogiannopoulos K, Karampelas P, Alhajj R (2014) Early DDoS detection based on data mining techniques. In: Proceedings of 8th workshop in information security theory and practice (WISTP), pp 190–199
Xylogiannopoulos K, Karampelas P, Alhajj R (2014) Analyzing very large time series using ssuffix arrays. Appl Intell 41(3):941–955
Xylogiannopoulos K, Karampelas P, Alhajj R (2014) Experimental analysis on the NorMality of π, e, φ, sqrt(2) using advanced data-mining techniques. Exp Math 23(2):105–128
Yee A (2013) Y-cruncher – a multi-threaded Pi-program [Online]. Available: http://www.numberworld.org/y-cruncher/
UCLA, (2006, Feb 26). http://www.lasr.cs.ucla.edu/ddos/traces/public/attacktrace2/udp/
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xylogiannopoulos, K.F., Karampelas, P. & Alhajj, R. Repeated patterns detection in big data using classification and parallelism on LERP Reduced Suffix Arrays. Appl Intell 45, 567–597 (2016). https://doi.org/10.1007/s10489-016-0766-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-016-0766-2