Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Institute of Informatics & Telecommunications, National Center for Scientific Research “Demokritos”, Greecealevizos.elias@iit.demokritos.grhttps://orcid.org/0000-0002-9260-0024 Department of Maritime Studies, University of Piraeus, Greece and Institute of Informatics & Telecommunications, National Center for Scientific Research “Demokritos”, Greecea.artikis@unipi.grhttps://orcid.org/0000-0001-6899-4599 Institute of Informatics & Telecommunications, National Center for Scientific Research “Demokritos”, Greecepaliourg@iit.demokritos.grhttps://orcid.org/0000-0001-9629-2367 \CopyrightElias Alevizos, Alexander Artikis and Georgios Paliouras{CCSXML} <ccs2012> <concept> <concept_id>10003752.10003766</concept_id> <concept_desc>Theory of computation Formal languages and automata theory</concept_desc> <concept_significance>300</concept_significance> </concept> <concept> <concept_id>10003752.10003809.10010031.10010032</concept_id> <concept_desc>Theory of computation Pattern matching</concept_desc> <concept_significance>300</concept_significance> </concept> <concept> <concept_id>10003752.10010061.10010065</concept_id> <concept_desc>Theory of computation Random walks and Markov chains</concept_desc> <concept_significance>300</concept_significance> </concept> <concept> <concept_id>10002951.10003227.10003236.10003239</concept_id> <concept_desc>Information systems Data streaming</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012> \ccsdesc[300]Theory of computation Formal languages and automata theory \ccsdesc[300]Theory of computation Pattern matching \ccsdesc[500]Information systems Data streaming \supplement\hideLIPIcs

Complex Event Recognition with Symbolic Register Transducers: Extended Technical Report111 This is the extended technical report for the paper Complex Event Recognition with Symbolic Register Transducers to be presented at VLBD 2024. Please, use the VLDB version, once published, if you need to cite the paper.

Elias Alevizos    Alexander Artikis    Georgios Paliouras
Abstract

We present a system for Complex Event Recognition (CER) based on automata. While multiple such systems have been described in the literature, they typically suffer from a lack of clear and denotational semantics, a limitation which often leads to confusion with respect to their expressive power. In order to address this issue, our system is based on an automaton model which is a combination of symbolic and register automata. We extend previous work on these types of automata, in order to construct a formalism with clear semantics and a corresponding automaton model whose properties can be formally investigated. We call such automata Symbolic Register Transducers (𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT). The distinctive feature of 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT, compared to previous automaton models used in CER, is that they can encode patterns relating multiple input events from an event stream, without sacrificing rigor and clarity. We study the closure properties of 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT under union, intersection, concatenation, Kleene closure, complement and determinization by extending previous relevant results from the field of languages and automata theory. We show that 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT are closed under various operators, but are not in general closed under complement and they are not determinizable. However, they are closed under these operations when a window operator, quintessential in Complex Event Recognition, is used. We show how 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT can be used in CER in order to detect patterns upon streams of events, using our framework that provides declarative and compositional semantics, and that allows for a systematic treatment of such automata. For 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT to work in pattern detection, we allow them to mark events from the input stream as belonging to a complex event or not, hence the name “transducers”. We also present an implementation of 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT which can perform CER. We compare our 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT-based CER engine against other state-of-the-art CER systems and show that it is both more expressive and more efficient.

keywords:
Finite Automata, Regular Expressions, Complex Event Processing, Symbolic Automata
category:
\relatedversion

1 Introduction

A Complex Event Recognition (CER) system takes as input a stream of “simple events”, along with a set of patterns, defining relations among the input events, and detects instances of pattern satisfaction, thus producing an output stream of “complex events” [DBLP:journals/vldb/GiatrakosAADG20, DBLP:books/daglib/0017658, DBLP:journals/csur/CugolaM12]. Typically, an event has the structure of a tuple of values which might be numerical or categorical. Time is of critical importance for CER and thus a temporal formalism is used in order to define the patterns to be detected. Such a pattern imposes temporal (and possibly atemporal) constraints on the input events, which, if satisfied, lead to the detection of a complex event. Atemporal constraints may be “local”, applying only to the last event read from the input stream. For example, in streams from temperature sensors, the constraint that the temperature of the last event is higher than some constant threshold would constitute such a local constraint. More commonly, these constraints involve multiple events of the pattern, e.g., the constraint that the temperature of the last event is higher than that of the previous event. Complex events must often be detected with very low latency, which, in certain cases, may even be in the order of a few milliseconds [DBLP:books/daglib/0017658, DBLP:books/daglib/0024062, hedtstuck_complex_2017].

Automata are of particular interest for the field of CER, because they provide a natural way of handling sequences. As a result, the usual operators of regular expressions, like concatenation, union and Kleene-star, have often been given an implicit temporal interpretation in CER. For example, the concatenation of two events is said to occur whenever the second event is read by an automaton after the first one, i.e., whenever the timestamp of the second event is greater than the timestamp of the first (assuming the input events are temporally ordered). On the other hand, atemporal constraints are not easy to define using classical automata, since they either work without memory or, even if they do include a memory structure, e.g., as with push-down automata, they can only work with a finite alphabet of input symbols. For this reason, the CER community has proposed several extensions of classical automata. These extended automata have the ability to store input events and later retrieve them in order to evaluate whether a constraint is satisfied [DBLP:conf/cidr/DemersGPRSW07, DBLP:conf/sigmod/AgrawalDGI08, DBLP:journals/csur/CugolaM12]. They resemble both register automata [DBLP:journals/tcs/KaminskiF94], through their ability to store events, and symbolic automata [DBLP:conf/cav/DAntoniV17], through the use of predicates on their transitions. They differ from symbolic automata in that predicates apply to multiple events, retrieved from the memory structure that holds previous events. They differ from register automata in that predicates may be more complex than that of (in)equality.

One issue with these CER-specific automata is that their properties have not been systematically investigated, in contrast to models derived directly from the field of languages and automata; see [DBLP:conf/icdt/GrezRU19] for a discussion about the weaknesses of automaton models in CER. Moreover, they sometimes need to impose restrictions on the use of regular expression operators in a pattern, e.g., nesting of Kleene-star operators is not allowed. A recently proposed formal framework for CER attempts to address these issues [DBLP:conf/icdt/GrezRU19]. Its advantage is that it provides a logic for CER patterns, with denotational and compositional semantics, but without imposing severe restrictions on the use of operators. An automaton model is also proposed which may be conceived as a variation of symbolic transducers [DBLP:conf/cav/DAntoniV17]. However, this automaton model can only handle “local” constraints, i.e., the formulas on their transitions are unary and thus are applied only to the last event read. A model which combines symbolic and register automata (called symbolic register automata) has recently been proposed in [DBLP:conf/cav/DAntoniFS019]. However, this work focuses on the more theoretical aspects of the proposed automaton model, without investigating how this model may be applied to CER (e.g., by providing a language appropriate for CER or by examining the effects of windows).

We propose a system for CER, based on an automaton model that is a combination of symbolic and register automata. It has the ability to store events and its transitions have guards in the form of n𝑛nitalic_n-ary conditions. These conditions may be applied both to the last event and to past events that have been stored. Conditions on multiple events are crucial in CER because they allow us to express many patterns of interest, e.g., an increasing trend in the speed of a vehicle. We call such automata Symbolic Register Transducers (𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT). 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT extend the expressive power of symbolic and register automata, by allowing for more complex patterns to be defined and detected on a stream of events. They also extend the power of symbolic register automata, by allowing events in a stream to be marked as belonging to a pattern match or not. This feature is crucial in cases where we need to enumerate all complex events detected at any given timepoint (i.e., exactly report all simple events which compose the complex ones) instead of simply reporting that a complex event has been detected. We also present a language with which we can define patterns for complex events that can then be translated to 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT. We call such patterns Symbolic Regular Expressions with Memory and Output (𝑆𝑅𝐸𝑀𝑂𝑆𝑅𝐸𝑀𝑂\mathit{SREMO}italic_SREMO), as an extension of the work presented in [DBLP:journals/jcss/LibkinTV15], where Regular Expressions with Memory (𝑅𝐸𝑀𝑅𝐸𝑀\mathit{REM}italic_REM) are defined and investigated. 𝑅𝐸𝑀𝑅𝐸𝑀\mathit{REM}italic_REM are extensions of classical regular expressions with which some of the terminal symbols of an expression can be stored and later be compared for (in)equality. 𝑆𝑅𝐸𝑀𝑂𝑆𝑅𝐸𝑀𝑂\mathit{SREMO}italic_SREMO allow for more complex conditions to be used, besides those of (in)equality. They additionally allow each terminal sub-expression to mark an element as belonging or not to the string/match that is to be recognized, thus acting as transducers.

Our contributions may then be summarized as follows:

  • We present a CER system based on a formal framework with denotational and compositional semantics, where patterns may be written as Symbolic Regular Expressions with Memory and Output (𝑆𝑅𝐸𝑀𝑂𝑆𝑅𝐸𝑀𝑂\mathit{SREMO}italic_SREMO).

  • We show how this framework subsumes, in terms of expressive power, previous similar attempts. It allows for nesting operators and selection strategies. It also allows n𝑛nitalic_n-ary expressions to be used as conditions in patterns, thus opening the way for the detection of relational patterns.

  • We extend previous work on automata and present a computational model for patterns written in 𝑆𝑅𝐸𝑀𝑂𝑆𝑅𝐸𝑀𝑂\mathit{SREMO}italic_SREMO, Symbolic Register Transducers (𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT), whose main feature is that it supports relations between multiple events in a pattern. Constraints with multiple events are essential in CER, since they are required in order to capture many patterns of interest, e.g., an increasing or decreasing trend in stock prices. 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT also have the ability to mark exactly those simple events comprising a complex one.

  • We study the closure properties of 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT. By extending previous results from automata theory, we show that, in the general case, 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT are closed under the most common operators (union, intersection, concatenation and Kleene-star), but not under complement and determinization. Failure of closure under complement implies that negation cannot be arbitrarily (i.e., in a compositional manner) used in CER patterns. The negative result about determinization implies that certain techniques (like forecasting) requiring deterministic automata are not applicable.

  • We show that, by using windows, 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT are able to retain their nice closure properties, i.e., they remain closed under complement and determinization. Windows are an indispensable operator in CER because, among others, they limit the search space for pattern matching.

  • We describe the implementation of a CER engine with 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT at its core and present relevant experimental results. Our engine is both more efficient than other engines and supports a language that is more expressive than that of other systems.

Example 1.1.
Table 1: Example of a stream.
type B B B S S B
id 1 1 2 1 1 2
price 22 24 32 70 68 33
volume 300 225 1210 760 2000 95
index 1 2 3 4 5 6

We now introduce an example to provide intuition. The example is that of a set of stock market ticks. A stream is a sequence of input events, where each such event is a tuple of the form (𝑡𝑦𝑝𝑒,𝑖𝑑,𝑝𝑟𝑖𝑐𝑒,𝑣𝑜𝑙𝑢𝑚𝑒)𝑡𝑦𝑝𝑒𝑖𝑑𝑝𝑟𝑖𝑐𝑒𝑣𝑜𝑙𝑢𝑚𝑒(\mathit{type},\mathit{id},\mathit{price},\mathit{volume})( italic_type , italic_id , italic_price , italic_volume ). The first attribute (𝑡𝑦𝑝𝑒𝑡𝑦𝑝𝑒\mathit{type}italic_type) is the type of transaction: S𝑆Sitalic_S for SELL and B𝐵Bitalic_B for BUY. The second one (𝑖𝑑𝑖𝑑\mathit{id}italic_id) is an integer identifier, unique for each company. It has a finite set of possible values. The third one (𝑝𝑟𝑖𝑐𝑒𝑝𝑟𝑖𝑐𝑒\mathit{price}italic_price) is a real-valued number for the price of a given stock. Finally, the fourth one (𝑣𝑜𝑙𝑢𝑚𝑒𝑣𝑜𝑙𝑢𝑚𝑒\mathit{volume}italic_volume) is a natural number referring to the volume of the transaction. Table 1 shows an example of such a stream. We assume that events are temporally ordered and their order is implicitly provided through the index. We also assume that concurrent events cannot occur, i.e., each index is unique to a single event.

In Table 2 we have gathered the notation that we use throughout the paper, along with a brief description of every symbol.

Table 2: Notation used throughout the paper.
Symbol Meaning
𝒱𝒱\mathcal{V}caligraphic_V, 𝒰𝒰\mathcal{U}caligraphic_U vocabulary, universe
\mathcal{L}caligraphic_L (𝒰superscript𝒰\mathcal{L}\subseteq\mathcal{U}^{*}caligraphic_L ⊆ caligraphic_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) a language over 𝒰𝒰\mathcal{U}caligraphic_U
ti𝒰subscript𝑡𝑖𝒰t_{i}\in\mathcal{U}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_U term / character
S=t1,t2,𝑆subscript𝑡1subscript𝑡2S=t_{1},t_{2},\cdotsitalic_S = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯, Si..j=ti,,tjS_{i..j}=t_{i},\cdots,t_{j}italic_S start_POSTSUBSCRIPT italic_i . . italic_j end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT stream / stream “slice” from index i𝑖iitalic_i to j𝑗jitalic_j
f(t1,,tm)𝑓subscript𝑡1subscript𝑡𝑚f(t_{1},\cdots,t_{m})italic_f ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) function
P𝑃Pitalic_P, top\top relation, unary TRUE relation
ϕitalic-ϕ\phiitalic_ϕ formula
\mathcal{M}caligraphic_M 𝒱𝒱\mathcal{V}caligraphic_V-structure
ϕmodelsitalic-ϕ\mathcal{M}\models\phicaligraphic_M ⊧ italic_ϕ \mathcal{M}caligraphic_M models ϕitalic-ϕ\phiitalic_ϕ
R={r1,,rk}𝑅subscript𝑟1subscript𝑟𝑘R=\{r_{1},\cdots,r_{k}\}italic_R = { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } register variables
v:R𝒰:𝑣𝑅𝒰v:R\hookrightarrow\mathcal{U}italic_v : italic_R ↪ caligraphic_U valuation
F(r1,,rk)𝐹subscript𝑟1subscript𝑟𝑘F(r_{1},\cdots,r_{k})italic_F ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) set of all valuations on R𝑅Ritalic_R
\sharp, similar-to\sim contents of empty register, automaton head
(u,v)ϕmodels𝑢𝑣italic-ϕ(u,v)\models\phi( italic_u , italic_v ) ⊧ italic_ϕ condition ϕitalic-ϕ\phiitalic_ϕ satisfied by element u𝑢uitalic_u and valuation v𝑣vitalic_v
ϵitalic-ϵ\epsilonitalic_ϵ the “empty” symbol
\bullet, tensor-product\otimes outputs
e1+e2subscript𝑒1subscript𝑒2e_{1}+e_{2}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, e1e2subscript𝑒1subscript𝑒2e_{1}\cdot e_{2}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, esuperscript𝑒e^{*}italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, !e!e! italic_e regular disjunction / concatenation / iteration / negation
eabsent𝑒\circlearrowleft e↺ italic_e, @e@𝑒@e@ italic_e skip-till-any-match, skip-till-next-match operators
e[1..w]e^{[1..w]}italic_e start_POSTSUPERSCRIPT [ 1 . . italic_w ] end_POSTSUPERSCRIPT windowed expression with window size w𝑤witalic_w
(e,S,M,v)vproves𝑒𝑆𝑀𝑣superscript𝑣(e,S,M,v)\vdash v^{\prime}( italic_e , italic_S , italic_M , italic_v ) ⊢ italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT string S𝑆Sitalic_S and match M𝑀Mitalic_M on expression e𝑒eitalic_e with initial valuation v𝑣vitalic_v induce valuation vsuperscript𝑣v^{\prime}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
𝐿𝑎𝑛𝑔(e)𝐿𝑎𝑛𝑔𝑒\mathit{Lang}(e)italic_Lang ( italic_e ) language accepted by expression e𝑒eitalic_e
𝑀𝑎𝑡𝑐ℎ(e,S)𝑀𝑎𝑡𝑐ℎ𝑒𝑆\mathit{Match}(e,S)italic_Match ( italic_e , italic_S ) matches detected by e𝑒eitalic_e on S𝑆Sitalic_S
T𝑇Titalic_T automaton / transducer
Q𝑄Qitalic_Q, qssuperscript𝑞𝑠q^{s}italic_q start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, Qfsuperscript𝑄𝑓Q^{f}italic_Q start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT automaton states / start state / final states
ΔΔ\Deltaroman_Δ, δ𝛿\deltaitalic_δ automaton transition function / transition
W𝑊Witalic_W write registers of a transition
c=[j,q,v]𝑐𝑗𝑞𝑣c=[j,q,v]italic_c = [ italic_j , italic_q , italic_v ] automaton configuration (j𝑗jitalic_j current position, q𝑞qitalic_q current state, v𝑣vitalic_v current valuation)
[j,q,v]𝛿[j,q,v]𝑗𝑞𝑣𝛿superscript𝑗superscript𝑞superscript𝑣[j,q,v]\overset{\delta}{\rightarrow}[j^{\prime},q^{\prime},v^{\prime}][ italic_j , italic_q , italic_v ] overitalic_δ start_ARG → end_ARG [ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] configuration succession
ϱ=[1,q1,v1]δ1δk[k+1,qk+1,vk+1]italic-ϱ1subscript𝑞1subscript𝑣1subscript𝛿1subscript𝛿𝑘𝑘1subscript𝑞𝑘1subscript𝑣𝑘1\varrho=[1,q_{1},v_{1}]\overset{\delta_{1}}{\rightarrow}\cdots\overset{\delta_% {k}}{\rightarrow}[k+1,q_{k+1},v_{k+1}]italic_ϱ = [ 1 , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] start_OVERACCENT italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_OVERACCENT start_ARG → end_ARG ⋯ start_OVERACCENT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_OVERACCENT start_ARG → end_ARG [ italic_k + 1 , italic_q start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] run of automaton T𝑇Titalic_T over stream S1..kS_{1..k}italic_S start_POSTSUBSCRIPT 1 . . italic_k end_POSTSUBSCRIPT
𝐿𝑎𝑛𝑔(T)𝐿𝑎𝑛𝑔𝑇\mathit{Lang}(T)italic_Lang ( italic_T ) language accepted by automaton T𝑇Titalic_T
𝑀𝑎𝑡𝑐ℎ(T,S)𝑀𝑎𝑡𝑐ℎ𝑇𝑆\mathit{Match}(T,S)italic_Match ( italic_T , italic_S ) matches detected by T𝑇Titalic_T on S𝑆Sitalic_S

2 Related Work

Due to their ability to naturally handle sequences of characters, automata have been extensively adopted in CER, where they are adapted in order to handle streams composed of tuples. Typical cases of CER systems that employ automata are the Chronicle Recognition System [DBLP:conf/kr/Ghallab96, DBLP:conf/ijcai/DoussonM07], Cayuga [DBLP:conf/edbt/DemersGHRW06, DBLP:conf/cidr/DemersGPRSW07], TESLA [DBLP:conf/debs/CugolaM10], SASE [DBLP:conf/sigmod/AgrawalDGI08, DBLP:conf/sigmod/ZhangDI14], CORE [DBLP:conf/icdt/GrezRU19, DBLP:journals/pvldb/BucchiGQRV22] and Wayeb [DBLP:journals/vldb/AlevizosAP22, DBLP:conf/lpar/AlevizosAP18]. There also exist systems that do not employ automata as their computational model, e.g., there are logic-based systems [DBLP:journals/jair/TsilionisAP22, DBLP:conf/kr/MantenoglouKA23] or systems that use trees [DBLP:conf/sigmod/MeiM09], but the standard operators of concatenation, union and Kleene-star are quite common and they may be considered as a reasonable set of core operators for CER. The abundance of different CER systems, employing various computational models and using various formalisms has recently led to some attempts to provide a unifying framework [DBLP:conf/icdt/GrezRU19, DBLP:journals/corr/Halle17]. Specifically, in [DBLP:conf/icdt/GrezRU19], a set of core CER operators is identified, a formal framework is proposed that provides denotational semantics for CER patterns, and a computational model is described for capturing such patterns. For an overview of CER languages, see [DBLP:journals/vldb/GiatrakosAADG20], and for a general review of CER systems, see [DBLP:journals/csur/CugolaM12]. In this Section, we present previous related work along three axes. First, we discuss previous theoretical work on automata that is related to CER. We subsequently present previous automata-based CER systems. Finally, we briefly discuss some solutions which are beyond the scope of CER in the strict sense of the term, but have characteristics that are of interest to CER. Table 3 summarizes our discussion and provides a compact way to compare our proposal against previous solutions.

System σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT σnsubscript𝜎𝑛\sigma_{n}italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT \vee \wedge ¬\neg¬ ; * D E S.P. Remarks
Theory
Register automata Sc Selection only for unary (in-)equality.
Symbolic automata Sc
Symbolic register automata Sc
Automata-based CER solutions
SASE all Iteration and selection strategies cannot be nested. \vee, \wedge and ¬\neg¬ possible in principle but not available in source code. Soundness issues with selection strategies
Cayuga ? Stam Re-subscription with multiple automata for nested expressions.
FlinkCEP ? ? ? Soundness issues with selection strategies and iteration.
Esper ? ? all Mixture of trees, automata and Allen’s interval algebra.
CORE ? ? all
Wayeb (symbolic automata) all
Beyond CER
AFA ? ? ? ? Sc Partial support of negation. σnsubscript𝜎𝑛\sigma_{n}italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with a single register.
MATCH_RECOGNIZE ? ? all Supported features depend on the implementation.
Our proposal
Wayeb (SRT) all ¬\neg¬ and determinization supported only for windowed expressions.
Table 3: Comparing state-of-the-art with our proposal.
σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: unary selection, σnsubscript𝜎𝑛\sigma_{n}italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT: n𝑛nitalic_n-ary selection, \wedge: intersection, \vee: union, ¬\neg¬: negation, ;: sequence, *: iteration, D: determinizability, E: enumeration, S.P.: selection policies, Stam : skip-till-any-match, Stnm : skip-till-next-match, Sc : strict-contiguity.

2.1 Extended automaton models: theory

Outside the field of CER, research on automata has evolved towards various directions. Besides the well-known push-down automata that can store elements from a finite set to a stack, there have appeared other automaton models with memory, such as register automata, pebble automata and data automata [DBLP:journals/tcs/KaminskiF94, DBLP:journals/tocl/NevenSV04, DBLP:journals/tocl/BojanczykDMSS11]. For a review, see [DBLP:conf/csl/Segoufin06]. Such models are especially useful when the input alphabet cannot be assumed to be finite, as is often the case with CER. Register automata (initially called finite-memory automata) constitute one of the earliest such proposals [DBLP:journals/tcs/KaminskiF94]. At each transition, a register automaton may choose to store its current input (more precisely, the current input’s data payload) to one of a finite set of registers. A transition is followed if the current input is equal to the contents of some register. With register automata, it is possible to recognize strings constructed from an infinite alphabet, through the use of (in)equality comparisons among the data carried by the current input and the data stored in the registers. However, register automata do not always have nice closure properties, e.g., they are not closed under determinization. For an extensive study of register automata, see [DBLP:journals/jcss/LibkinTV15, DBLP:conf/lpar/LibkinV12]. We build on the framework presented in [DBLP:journals/jcss/LibkinTV15, DBLP:conf/lpar/LibkinV12] in order to construct register automata with the ability to handle “arbitrary” structures, besides those containing only (in)equality relations.

Another model that is of interest for CER is the symbolic automaton, which allows CER patterns to apply constraints on the attributes of events. Automata that have predicates on their transitions were already proposed in [DBLP:journals/grammars/NoordG01]. This initial idea has recently been expanded and more fully investigated in symbolic automata [DBLP:conf/lpar/VeanesBM10, DBLP:conf/wia/Veanes13, DBLP:conf/cav/DAntoniV17]. In symbolic automata, transitions are equipped with formulas constructed from a Boolean algebra. A transition is followed if its formula, applied to the current input, evaluates to TRUE. Contrary to register automata, symbolic automata have nice closure properties, but their formulas are unary and thus can only be applied to a single element from the input string.

This is one limitation that we address here. We use Symbolic Regular Expressions with Memory and Output (𝑆𝑅𝐸𝑀𝑂𝑆𝑅𝐸𝑀𝑂\mathit{SREMO}italic_SREMO) and Symbolic Register Transducers (𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT), a language and an automaton model respectively, that can handle n𝑛nitalic_n-ary formulas and be applied for the purposes of CER. With 𝑆𝑅𝐸𝑀𝑂𝑆𝑅𝐸𝑀𝑂\mathit{SREMO}italic_SREMO we can designate which elements of a pattern need to be stored for later evaluation and which must be marked as being part of a match. 𝑆𝑅𝐸𝑀𝑂𝑆𝑅𝐸𝑀𝑂\mathit{SREMO}italic_SREMO can be compiled into 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT  whose transitions can apply n𝑛nitalic_n-ary formulas/conditions (with n>1𝑛1n{>}1italic_n > 1) on multiple elements. As a result, 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT are more expressive than symbolic and register automata, thus being suitable for practical CER applications, while, at the same time, their properties can be systematically investigated, as in standard automata theory. In fact, our model subsumes these two automaton models as special cases. It is also an extension of Symbolic Register Automata [DBLP:conf/cav/DAntoniFS019], which do not have any output on their transitions and cannot thus enumerate the detected complex events, since they do not have the ability to mark input events as being part of match. Moreover, the applicability of 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT for CER is studied here for the first time. We show precisely how 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT can be used for CER and how the use of 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT provides expressive power without sacrificing clarity and rigor.

We initially presented the results regarding 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT in [DBLP:journals/corr/abs-1804-09999] (we called them Register Match Automata in that report). The difference between that report and the present paper is that now we use a different formalism for expressing patterns at the language level. However, the automaton model remains essentially the same. Automaton models similar to 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT have been independently presented in [DBLP:conf/cav/DAntoniFS019] and [DBLP:journals/corr/abs-2110-04032]. In both cases, the focus was on Symbolic Register Automata, i.e., on automata without any output on their transitions. The former work focused on an extensive theoretical analysis, while the latter on the theoretical applicability of this type of automata for CER, without presenting an implementation.

2.2 Extended automaton models as applied in CER

Automata with registers have been proposed in the past for CER, e.g., in SASE and Cayuga. However, previous systems typically provide operational semantics and it is not always clear a) what operators are allowed, b) at which combinations c) what the properties of their automaton models are. For example, SASE’s language seems to support nested Kleene operators. However, this is not the case. SASE constructs automata whose states are linearly ordered. Therefore, Kleene operators can only be applied to single states. They cannot be nested and they cannot contain other expressions, except for single events. As a result, disjunction is also not allowed. Cayuga attempts to address these issues of constraints on its expressive power through the method of resubscription, i.e., expressions which cannot be captured by a single automaton are compiled into multiple automata [demers2005general]. Each sub-automaton can then subscribe to the output of other automata, thus creating a hierarchy of automata. Although this is an interesting solution, the resulting semantics remains ambiguous, since the correctness and limits of this approach have not been thoroughly investigated. Our system does not suffer from these limitations. Its novelty is that it provides formal, compositional semantics which allows us to address all of the above issues. We show that negation is the only problematic operator. The other operators may be arbitrarily combined in a completely compositional manner and each pattern can be compiled into a single automaton, something which has not been previously achieved. CORE [DBLP:conf/icdt/GrezRU19, DBLP:conf/icdt/GrezRUV20] and Wayeb [DBLP:journals/vldb/AlevizosAP22, DBLP:conf/lpar/AlevizosAP18] constitute two more recent automata-based CER systems. CORE automata may be categorized under the class of “unary” symbolic automata (or transducers, to be more precise), i.e., they do not support patterns relating multiple events. The same is true for Wayeb, which also employs “unary” symbolic automata.

2.3 Extended automaton models beyond CER

An adaptation of finite automata in the context of Data Stream Management Systems (which have strong similarities to CER systems) has also been proposed in [DBLP:journals/pvldb/ChandramouliGM10]. These automata are called augmented finite automata (AFA) and are enriched with registers, in order to capture trends. With respect to compositionality, AFA are similar to 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT: Like 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT, Augmented Finite Automata (AFA) [DBLP:journals/pvldb/ChandramouliGM10] support arbitrary edges and are compositional. On the other hand, AFA have different limitations. Each AFA has a single register (one per active state), whereas there is no such restriction for 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT. AFA are thus less expressive than 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT. Additionally, AFA are not transducers and cannot enumerate the input events of a complex event. They can report event lifetimes, i.e., the duration of a complex event. 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT can also report individual input events. The input events can be reconstructed in a port-processing step, if needed, from the lifetime, but this seems to hold only for contiguous patterns. It is unclear whether this is feasible for non-contiguous patterns. Finally, the properties of AFA have not been theoretically studied, for example with respect to determinization and negation. AFA can handle certain instances of negation, but there are strong reasons to suspect that they are not in general closed under complement, as is the case of register automata. In summary, 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT are more expressive than AFA.

Another way to implement CER patterns, in relational databases, is through SQL’s MATCH_RECOGNIZE, a proposed clause that can perform pattern recognition on rows [MRISO, DBLP:journals/dbsk/Petkovic22]. MATCH_RECOGNIZE is very expressive and can in principle capture almost any pattern expressed in a CER language. However, it is uncertain whether it would work in a streaming setting as efficiently as CER systems. Recent work has proposed implementations of MATCH_RECOGNIZE that are more efficient than the one already available in Flink [DBLP:journals/pvldb/ZhuHC23, DBLP:conf/sigmod/KorberGS21]. The proposed optimizations rely on the use of prefiltering and clever indices so that the automaton responsible for pattern recognition is fed only with a small subset of the initial rows. They target the scenario of historical analysis and their extension to a streaming setting is not considered. It still remains an open issue whether and to what extent the proposed optimizations would work for patterns processing events in real time.

3 Symbolic Regular Expressions with Memory and Output

The field of CER has been growing strong for the past 20 years. It is thus no surprise that there is no lack of languages, formalisms and systems from which one may choose according to their needs. As a result, there is considerable variability concerning the most relevant and useful operators of CER patterns, their semantics and the corresponding computational models to be used for the actual detecting of complex events. On the one hand, this variability may be viewed as a sign of vigor for the field. On the other hand, the fact that operators and their semantics are sometimes defined informally makes it hard to compare different systems in terms of their expressive capabilities. It also makes it hard to study a single system in itself in a more systematic manner, other than actually running it and observing its behavior.

As an attempt to mitigate these problems, we present and describe a framework for CER which has formal, denotational semantics. We first present a language for CER and discuss its semantics. The main feature of this language is that it allows for most of the common CER operators (such as selection, sequence, disjunction and iteration), without imposing restrictions on how they may be used and nested. Our proposed language can also accommodate n-ary conditions, i.e., we can impose constraints on the patterns which relate multiple events of a stream, e.g., that the number of cells in a simulated tumor at the current timepoint is higher than their number at the previous timepoint. We also discuss the semantics of patterns written in our proposed language and show that these are well-defined. As a result, in order to know whether a given stream contains any complex events corresponding to a given pattern, we do not need to resort to a procedural computational model. The semantics of the language may be studied independently of the chosen computational model. Not only is this feature critical in itself, allowing for a systematic understanding of the use of operators, but it could also be of importance for optimization, which often relies on pattern re-writing, assuming that we can know when two patterns are equivalent without actually having to run their computational models. Previous work on CER has produced systems which are highly expressive (e.g., FlinkCEP [FlinkCEP]), but lack a proper, formal description. Some more recent work ([DBLP:journals/pvldb/BucchiGQRV22]) has attempted to construct a system which is both formal and efficient. However, it does not support n-ary expressions, allowing (non-temporal) constraints which are applied only to the last event read from a stream.

Before presenting 𝑆𝑅𝑇𝑆𝑅𝑇\mathit{SRT}italic_SRT, we first present a high-level formalism for defining CER patterns. We extend the work presented in [DBLP:journals/jcss/LibkinTV15], where the notion of regular expressions with memory (

Remark 1.

) was introduced. These regular expressions can store some terminal symbols in order to compare them later against a new input element for (in)equality. One important limitation of