Institute of Informatics & Telecommunications, National Center for Scientific Research “Demokritos”, Greecealevizos.elias@iit.demokritos.grhttps://orcid.org/0000-0002-9260-0024 Department of Maritime Studies, University of Piraeus, Greece and Institute of Informatics & Telecommunications, National Center for Scientific Research “Demokritos”, Greecea.artikis@unipi.grhttps://orcid.org/0000-0001-6899-4599 Institute of Informatics & Telecommunications, National Center for Scientific Research “Demokritos”, Greecepaliourg@iit.demokritos.grhttps://orcid.org/0000-0001-9629-2367 \CopyrightElias Alevizos, Alexander Artikis and Georgios Paliouras{CCSXML} <ccs2012> <concept> <concept_id>10003752.10003766</concept_id> <concept_desc>Theory of computation Formal languages and automata theory</concept_desc> <concept_significance>300</concept_significance> </concept> <concept> <concept_id>10003752.10003809.10010031.10010032</concept_id> <concept_desc>Theory of computation Pattern matching</concept_desc> <concept_significance>300</concept_significance> </concept> <concept> <concept_id>10003752.10010061.10010065</concept_id> <concept_desc>Theory of computation Random walks and Markov chains</concept_desc> <concept_significance>300</concept_significance> </concept> <concept> <concept_id>10002951.10003227.10003236.10003239</concept_id> <concept_desc>Information systems Data streaming</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012> \ccsdesc[300]Theory of computation Formal languages and automata theory \ccsdesc[300]Theory of computation Pattern matching \ccsdesc[500]Information systems Data streaming \supplement\hideLIPIcs

Complex Event Recognition with Symbolic Register Transducers: Extended Technical Report¹¹1 This is the extended technical report for the paper Complex Event Recognition with Symbolic Register Transducers to be presented at VLBD 2024. Please, use the VLDB version, once published, if you need to cite the paper.

Elias Alevizos Alexander Artikis Georgios Paliouras

Abstract

We present a system for Complex Event Recognition (CER) based on automata. While multiple such systems have been described in the literature, they typically suffer from a lack of clear and denotational semantics, a limitation which often leads to confusion with respect to their expressive power. In order to address this issue, our system is based on an automaton model which is a combination of symbolic and register automata. We extend previous work on these types of automata, in order to construct a formalism with clear semantics and a corresponding automaton model whose properties can be formally investigated. We call such automata Symbolic Register Transducers ( $\mathit{SRT}$ ). The distinctive feature of $\mathit{SRT}$ , compared to previous automaton models used in CER, is that they can encode patterns relating multiple input events from an event stream, without sacrificing rigor and clarity. We study the closure properties of $\mathit{SRT}$ under union, intersection, concatenation, Kleene closure, complement and determinization by extending previous relevant results from the field of languages and automata theory. We show that $\mathit{SRT}$ are closed under various operators, but are not in general closed under complement and they are not determinizable. However, they are closed under these operations when a window operator, quintessential in Complex Event Recognition, is used. We show how $\mathit{SRT}$ can be used in CER in order to detect patterns upon streams of events, using our framework that provides declarative and compositional semantics, and that allows for a systematic treatment of such automata. For $\mathit{SRT}$ to work in pattern detection, we allow them to mark events from the input stream as belonging to a complex event or not, hence the name “transducers”. We also present an implementation of $\mathit{SRT}$ which can perform CER. We compare our $\mathit{SRT}$ -based CER engine against other state-of-the-art CER systems and show that it is both more expressive and more efficient.

keywords:

Finite Automata, Regular Expressions, Complex Event Processing, Symbolic Automata

category:

\relatedversion

1 Introduction

A Complex Event Recognition (CER) system takes as input a stream of “simple events”, along with a set of patterns, defining relations among the input events, and detects instances of pattern satisfaction, thus producing an output stream of “complex events” [DBLP:journals/vldb/GiatrakosAADG20, DBLP:books/daglib/0017658, DBLP:journals/csur/CugolaM12]. Typically, an event has the structure of a tuple of values which might be numerical or categorical. Time is of critical importance for CER and thus a temporal formalism is used in order to define the patterns to be detected. Such a pattern imposes temporal (and possibly atemporal) constraints on the input events, which, if satisfied, lead to the detection of a complex event. Atemporal constraints may be “local”, applying only to the last event read from the input stream. For example, in streams from temperature sensors, the constraint that the temperature of the last event is higher than some constant threshold would constitute such a local constraint. More commonly, these constraints involve multiple events of the pattern, e.g., the constraint that the temperature of the last event is higher than that of the previous event. Complex events must often be detected with very low latency, which, in certain cases, may even be in the order of a few milliseconds [DBLP:books/daglib/0017658, DBLP:books/daglib/0024062, hedtstuck_complex_2017].

Automata are of particular interest for the field of CER, because they provide a natural way of handling sequences. As a result, the usual operators of regular expressions, like concatenation, union and Kleene-star, have often been given an implicit temporal interpretation in CER. For example, the concatenation of two events is said to occur whenever the second event is read by an automaton after the first one, i.e., whenever the timestamp of the second event is greater than the timestamp of the first (assuming the input events are temporally ordered). On the other hand, atemporal constraints are not easy to define using classical automata, since they either work without memory or, even if they do include a memory structure, e.g., as with push-down automata, they can only work with a finite alphabet of input symbols. For this reason, the CER community has proposed several extensions of classical automata. These extended automata have the ability to store input events and later retrieve them in order to evaluate whether a constraint is satisfied [DBLP:conf/cidr/DemersGPRSW07, DBLP:conf/sigmod/AgrawalDGI08, DBLP:journals/csur/CugolaM12]. They resemble both register automata [DBLP:journals/tcs/KaminskiF94], through their ability to store events, and symbolic automata [DBLP:conf/cav/DAntoniV17], through the use of predicates on their transitions. They differ from symbolic automata in that predicates apply to multiple events, retrieved from the memory structure that holds previous events. They differ from register automata in that predicates may be more complex than that of (in)equality.

One issue with these CER-specific automata is that their properties have not been systematically investigated, in contrast to models derived directly from the field of languages and automata; see [DBLP:conf/icdt/GrezRU19] for a discussion about the weaknesses of automaton models in CER. Moreover, they sometimes need to impose restrictions on the use of regular expression operators in a pattern, e.g., nesting of Kleene-star operators is not allowed. A recently proposed formal framework for CER attempts to address these issues [DBLP:conf/icdt/GrezRU19]. Its advantage is that it provides a logic for CER patterns, with denotational and compositional semantics, but without imposing severe restrictions on the use of operators. An automaton model is also proposed which may be conceived as a variation of symbolic transducers [DBLP:conf/cav/DAntoniV17]. However, this automaton model can only handle “local” constraints, i.e., the formulas on their transitions are unary and thus are applied only to the last event read. A model which combines symbolic and register automata (called symbolic register automata) has recently been proposed in [DBLP:conf/cav/DAntoniFS019]. However, this work focuses on the more theoretical aspects of the proposed automaton model, without investigating how this model may be applied to CER (e.g., by providing a language appropriate for CER or by examining the effects of windows).

We propose a system for CER, based on an automaton model that is a combination of symbolic and register automata. It has the ability to store events and its transitions have guards in the form of $n$ -ary conditions. These conditions may be applied both to the last event and to past events that have been stored. Conditions on multiple events are crucial in CER because they allow us to express many patterns of interest, e.g., an increasing trend in the speed of a vehicle. We call such automata Symbolic Register Transducers ( $\mathit{SRT}$ ). $\mathit{SRT}$ extend the expressive power of symbolic and register automata, by allowing for more complex patterns to be defined and detected on a stream of events. They also extend the power of symbolic register automata, by allowing events in a stream to be marked as belonging to a pattern match or not. This feature is crucial in cases where we need to enumerate all complex events detected at any given timepoint (i.e., exactly report all simple events which compose the complex ones) instead of simply reporting that a complex event has been detected. We also present a language with which we can define patterns for complex events that can then be translated to $\mathit{SRT}$ . We call such patterns Symbolic Regular Expressions with Memory and Output ( $\mathit{SREMO}$ ), as an extension of the work presented in [DBLP:journals/jcss/LibkinTV15], where Regular Expressions with Memory ( $\mathit{REM}$ ) are defined and investigated. $\mathit{REM}$ are extensions of classical regular expressions with which some of the terminal symbols of an expression can be stored and later be compared for (in)equality. $\mathit{SREMO}$ allow for more complex conditions to be used, besides those of (in)equality. They additionally allow each terminal sub-expression to mark an element as belonging or not to the string/match that is to be recognized, thus acting as transducers.

Our contributions may then be summarized as follows:

•

We present a CER system based on a formal framework with denotational and compositional semantics, where patterns may be written as Symbolic Regular Expressions with Memory and Output ( $\mathit{SREMO}$ ).
•

We show how this framework subsumes, in terms of expressive power, previous similar attempts. It allows for nesting operators and selection strategies. It also allows $n$ -ary expressions to be used as conditions in patterns, thus opening the way for the detection of relational patterns.
•

We extend previous work on automata and present a computational model for patterns written in $\mathit{SREMO}$ , Symbolic Register Transducers ( $\mathit{SRT}$ ), whose main feature is that it supports relations between multiple events in a pattern. Constraints with multiple events are essential in CER, since they are required in order to capture many patterns of interest, e.g., an increasing or decreasing trend in stock prices. $\mathit{SRT}$ also have the ability to mark exactly those simple events comprising a complex one.
•

We study the closure properties of $\mathit{SRT}$ . By extending previous results from automata theory, we show that, in the general case, $\mathit{SRT}$ are closed under the most common operators (union, intersection, concatenation and Kleene-star), but not under complement and determinization. Failure of closure under complement implies that negation cannot be arbitrarily (i.e., in a compositional manner) used in CER patterns. The negative result about determinization implies that certain techniques (like forecasting) requiring deterministic automata are not applicable.
•

We show that, by using windows, $\mathit{SRT}$ are able to retain their nice closure properties, i.e., they remain closed under complement and determinization. Windows are an indispensable operator in CER because, among others, they limit the search space for pattern matching.
•

We describe the implementation of a CER engine with $\mathit{SRT}$ at its core and present relevant experimental results. Our engine is both more efficient than other engines and supports a language that is more expressive than that of other systems.

Example 1.1.

Table 1: Example of a stream.

type	B	B	B	S	S	B	…
id	1	1	2	1	1	2	…
price	22	24	32	70	68	33	…
volume	300	225	1210	760	2000	95	…
index	1	2	3	4	5	6	…

We now introduce an example to provide intuition. The example is that of a set of stock market ticks. A stream is a sequence of input events, where each such event is a tuple of the form $(\mathit{type},\mathit{id},\mathit{price},\mathit{volume})$ . The first attribute ( $\mathit{type}$ ) is the type of transaction: $S$ for SELL and $B$ for BUY. The second one ( $\mathit{id}$ ) is an integer identifier, unique for each company. It has a finite set of possible values. The third one ( $\mathit{price}$ ) is a real-valued number for the price of a given stock. Finally, the fourth one ( $\mathit{volume}$ ) is a natural number referring to the volume of the transaction. Table 1 shows an example of such a stream. We assume that events are temporally ordered and their order is implicitly provided through the index. We also assume that concurrent events cannot occur, i.e., each index is unique to a single event.

In Table 2 we have gathered the notation that we use throughout the paper, along with a brief description of every symbol.

Table 2: Notation used throughout the paper.

Symbol	Meaning
$\mathcal{V}$ , $\mathcal{U}$	vocabulary, universe
$\mathcal{L}$ ( $\mathcal{L}\subseteq\mathcal{U}^{*}$ )	a language over $\mathcal{U}$
$t_{i}\in\mathcal{U}$	term / character
$S=t_{1},t_{2},\cdots$ , $S_{i..j}=t_{i},\cdots,t_{j}$	stream / stream “slice” from index $i$ to $j$
$f(t_{1},\cdots,t_{m})$	function
$P$ , $\top$	relation, unary TRUE relation
$\phi$	formula
$\mathcal{M}$	$\mathcal{V}$ -structure
$\mathcal{M}\models\phi$	$\mathcal{M}$ models $\phi$
$R=\{r_{1},\cdots,r_{k}\}$	register variables
$v:R\hookrightarrow\mathcal{U}$	valuation
$F(r_{1},\cdots,r_{k})$	set of all valuations on $R$
$\sharp$ , $\sim$	contents of empty register, automaton head
$(u,v)\models\phi$	condition $\phi$ satisfied by element $u$ and valuation $v$
$\epsilon$	the “empty” symbol
$\bullet$ , $\otimes$	outputs
$e_{1}+e_{2}$ , $e_{1}\cdot e_{2}$ , $e^{*}$ , $!e$	regular disjunction / concatenation / iteration / negation
$\circlearrowleft e$ , $@e$	skip-till-any-match, skip-till-next-match operators
$e^{[1..w]}$	windowed expression with window size $w$
$(e,S,M,v)\vdash v^{\prime}$	string $S$ and match $M$ on expression $e$ with initial valuation $v$ induce valuation $v^{\prime}$
$\mathit{Lang}(e)$	language accepted by expression $e$
$\mathit{Match}(e,S)$	matches detected by $e$ on $S$
$T$	automaton / transducer
$Q$ , $q^{s}$ , $Q^{f}$	automaton states / start state / final states
$\Delta$ , $\delta$	automaton transition function / transition
$W$	write registers of a transition
$c=[j,q,v]$	automaton configuration ( $j$ current position, $q$ current state, $v$ current valuation)
$[j,q,v]\overset{\delta}{\rightarrow}[j^{\prime},q^{\prime},v^{\prime}]$	configuration succession
$\varrho=[1,q_{1},v_{1}]\overset{\delta_{1}}{\rightarrow}\cdots\overset{\delta_% {k}}{\rightarrow}[k+1,q_{k+1},v_{k+1}]$	run of automaton $T$ over stream $S_{1..k}$
$\mathit{Lang}(T)$	language accepted by automaton $T$
$\mathit{Match}(T,S)$	matches detected by $T$ on $S$

2 Related Work

Due to their ability to naturally handle sequences of characters, automata have been extensively adopted in CER, where they are adapted in order to handle streams composed of tuples. Typical cases of CER systems that employ automata are the Chronicle Recognition System [DBLP:conf/kr/Ghallab96, DBLP:conf/ijcai/DoussonM07], Cayuga [DBLP:conf/edbt/DemersGHRW06, DBLP:conf/cidr/DemersGPRSW07], TESLA [DBLP:conf/debs/CugolaM10], SASE [DBLP:conf/sigmod/AgrawalDGI08, DBLP:conf/sigmod/ZhangDI14], CORE [DBLP:conf/icdt/GrezRU19, DBLP:journals/pvldb/BucchiGQRV22] and Wayeb [DBLP:journals/vldb/AlevizosAP22, DBLP:conf/lpar/AlevizosAP18]. There also exist systems that do not employ automata as their computational model, e.g., there are logic-based systems [DBLP:journals/jair/TsilionisAP22, DBLP:conf/kr/MantenoglouKA23] or systems that use trees [DBLP:conf/sigmod/MeiM09], but the standard operators of concatenation, union and Kleene-star are quite common and they may be considered as a reasonable set of core operators for CER. The abundance of different CER systems, employing various computational models and using various formalisms has recently led to some attempts to provide a unifying framework [DBLP:conf/icdt/GrezRU19, DBLP:journals/corr/Halle17]. Specifically, in [DBLP:conf/icdt/GrezRU19], a set of core CER operators is identified, a formal framework is proposed that provides denotational semantics for CER patterns, and a computational model is described for capturing such patterns. For an overview of CER languages, see [DBLP:journals/vldb/GiatrakosAADG20], and for a general review of CER systems, see [DBLP:journals/csur/CugolaM12]. In this Section, we present previous related work along three axes. First, we discuss previous theoretical work on automata that is related to CER. We subsequently present previous automata-based CER systems. Finally, we briefly discuss some solutions which are beyond the scope of CER in the strict sense of the term, but have characteristics that are of interest to CER. Table 3 summarizes our discussion and provides a compact way to compare our proposal against previous solutions.

System	$\sigma_{1}$	$\sigma_{n}$	$\vee$	$\wedge$	$\neg$	;	*	D	E	S.P.	Remarks
Theory
Register automata	✘	✘	✔	✔	✘	✔	✔	✘	✘	Sc	Selection only for unary (in-)equality.
Symbolic automata	✔	✘	✔	✔	✔	✔	✔	✔	✘	Sc
Symbolic register automata	✔	✔	✔	✔	✘	✔	✔	✘	✘	Sc
Automata-based CER solutions
SASE	✔	✔	✘	✘	✔	✔	✔	✘	✔	all	Iteration and selection strategies cannot be nested. $\vee$ , $\wedge$ and $\neg$ possible in principle but not available in source code. Soundness issues with selection strategies
Cayuga	✔	✔	✔	?	✘	✔	✔	✘	✘	Stam	Re-subscription with multiple automata for nested expressions.
FlinkCEP	✔	✔	✔	?	✔	✔	?	✘	✔	?	Soundness issues with selection strategies and iteration.
Esper	✔	✔	✔	?	✔	✔	✔	?	✔	all	Mixture of trees, automata and Allen’s interval algebra.
CORE	✔	✘	✔	?	?	✔	✔	✔	✔	all
Wayeb (symbolic automata)	✔	✘	✔	✔	✔	✔	✔	✔	✘	all
Beyond CER
AFA	✔	?	✔	?	?	✔	✔	?	✘	Sc	Partial support of negation. $\sigma_{n}$ with a single register.
MATCH_RECOGNIZE	✔	✔	✘	?	✔	✔	✘	?	✘	all	Supported features depend on the implementation.
Our proposal
Wayeb (SRT)	✔	✔	✔	✔	✔	✔	✔	✔	✔	all	$\neg$ and determinization supported only for windowed expressions.

Table 3: Comparing state-of-the-art with our proposal.

\sigma_{1}

: unary selection,

\sigma_{n}

n

-ary selection,

\wedge

: intersection,

\vee

: union,

\neg

: negation, ;: sequence, *: iteration, D: determinizability, E: enumeration, S.P.: selection policies, Stam : skip-till-any-match, Stnm : skip-till-next-match, Sc : strict-contiguity.

2.1 Extended automaton models: theory

Outside the field of CER, research on automata has evolved towards various directions. Besides the well-known push-down automata that can store elements from a finite set to a stack, there have appeared other automaton models with memory, such as register automata, pebble automata and data automata [DBLP:journals/tcs/KaminskiF94, DBLP:journals/tocl/NevenSV04, DBLP:journals/tocl/BojanczykDMSS11]. For a review, see [DBLP:conf/csl/Segoufin06]. Such models are especially useful when the input alphabet cannot be assumed to be finite, as is often the case with CER. Register automata (initially called finite-memory automata) constitute one of the earliest such proposals [DBLP:journals/tcs/KaminskiF94]. At each transition, a register automaton may choose to store its current input (more precisely, the current input’s data payload) to one of a finite set of registers. A transition is followed if the current input is equal to the contents of some register. With register automata, it is possible to recognize strings constructed from an infinite alphabet, through the use of (in)equality comparisons among the data carried by the current input and the data stored in the registers. However, register automata do not always have nice closure properties, e.g., they are not closed under determinization. For an extensive study of register automata, see [DBLP:journals/jcss/LibkinTV15, DBLP:conf/lpar/LibkinV12]. We build on the framework presented in [DBLP:journals/jcss/LibkinTV15, DBLP:conf/lpar/LibkinV12] in order to construct register automata with the ability to handle “arbitrary” structures, besides those containing only (in)equality relations.

Another model that is of interest for CER is the symbolic automaton, which allows CER patterns to apply constraints on the attributes of events. Automata that have predicates on their transitions were already proposed in [DBLP:journals/grammars/NoordG01]. This initial idea has recently been expanded and more fully investigated in symbolic automata [DBLP:conf/lpar/VeanesBM10, DBLP:conf/wia/Veanes13, DBLP:conf/cav/DAntoniV17]. In symbolic automata, transitions are equipped with formulas constructed from a Boolean algebra. A transition is followed if its formula, applied to the current input, evaluates to TRUE. Contrary to register automata, symbolic automata have nice closure properties, but their formulas are unary and thus can only be applied to a single element from the input string.

This is one limitation that we address here. We use Symbolic Regular Expressions with Memory and Output ( $\mathit{SREMO}$ ) and Symbolic Register Transducers ( $\mathit{SRT}$ ), a language and an automaton model respectively, that can handle $n$ -ary formulas and be applied for the purposes of CER. With $\mathit{SREMO}$ we can designate which elements of a pattern need to be stored for later evaluation and which must be marked as being part of a match. $\mathit{SREMO}$ can be compiled into $\mathit{SRT}$ whose transitions can apply $n$ -ary formulas/conditions (with $n{>}1$ ) on multiple elements. As a result, $\mathit{SRT}$ are more expressive than symbolic and register automata, thus being suitable for practical CER applications, while, at the same time, their properties can be systematically investigated, as in standard automata theory. In fact, our model subsumes these two automaton models as special cases. It is also an extension of Symbolic Register Automata [DBLP:conf/cav/DAntoniFS019], which do not have any output on their transitions and cannot thus enumerate the detected complex events, since they do not have the ability to mark input events as being part of match. Moreover, the applicability of $\mathit{SRT}$ for CER is studied here for the first time. We show precisely how $\mathit{SRT}$ can be used for CER and how the use of $\mathit{SRT}$ provides expressive power without sacrificing clarity and rigor.

We initially presented the results regarding $\mathit{SRT}$ in [DBLP:journals/corr/abs-1804-09999] (we called them Register Match Automata in that report). The difference between that report and the present paper is that now we use a different formalism for expressing patterns at the language level. However, the automaton model remains essentially the same. Automaton models similar to $\mathit{SRT}$ have been independently presented in [DBLP:conf/cav/DAntoniFS019] and [DBLP:journals/corr/abs-2110-04032]. In both cases, the focus was on Symbolic Register Automata, i.e., on automata without any output on their transitions. The former work focused on an extensive theoretical analysis, while the latter on the theoretical applicability of this type of automata for CER, without presenting an implementation.

2.2 Extended automaton models as applied in CER

Automata with registers have been proposed in the past for CER, e.g., in SASE and Cayuga. However, previous systems typically provide operational semantics and it is not always clear a) what operators are allowed, b) at which combinations c) what the properties of their automaton models are. For example, SASE’s language seems to support nested Kleene operators. However, this is not the case. SASE constructs automata whose states are linearly ordered. Therefore, Kleene operators can only be applied to single states. They cannot be nested and they cannot contain other expressions, except for single events. As a result, disjunction is also not allowed. Cayuga attempts to address these issues of constraints on its expressive power through the method of resubscription, i.e., expressions which cannot be captured by a single automaton are compiled into multiple automata [demers2005general]. Each sub-automaton can then subscribe to the output of other automata, thus creating a hierarchy of automata. Although this is an interesting solution, the resulting semantics remains ambiguous, since the correctness and limits of this approach have not been thoroughly investigated. Our system does not suffer from these limitations. Its novelty is that it provides formal, compositional semantics which allows us to address all of the above issues. We show that negation is the only problematic operator. The other operators may be arbitrarily combined in a completely compositional manner and each pattern can be compiled into a single automaton, something which has not been previously achieved. CORE [DBLP:conf/icdt/GrezRU19, DBLP:conf/icdt/GrezRUV20] and Wayeb [DBLP:journals/vldb/AlevizosAP22, DBLP:conf/lpar/AlevizosAP18] constitute two more recent automata-based CER systems. CORE automata may be categorized under the class of “unary” symbolic automata (or transducers, to be more precise), i.e., they do not support patterns relating multiple events. The same is true for Wayeb, which also employs “unary” symbolic automata.

2.3 Extended automaton models beyond CER

An adaptation of finite automata in the context of Data Stream Management Systems (which have strong similarities to CER systems) has also been proposed in [DBLP:journals/pvldb/ChandramouliGM10]. These automata are called augmented finite automata (AFA) and are enriched with registers, in order to capture trends. With respect to compositionality, AFA are similar to $\mathit{SRT}$ : Like $\mathit{SRT}$ , Augmented Finite Automata (AFA) [DBLP:journals/pvldb/ChandramouliGM10] support arbitrary edges and are compositional. On the other hand, AFA have different limitations. Each AFA has a single register (one per active state), whereas there is no such restriction for $\mathit{SRT}$ . AFA are thus less expressive than $\mathit{SRT}$ . Additionally, AFA are not transducers and cannot enumerate the input events of a complex event. They can report event lifetimes, i.e., the duration of a complex event. $\mathit{SRT}$ can also report individual input events. The input events can be reconstructed in a port-processing step, if needed, from the lifetime, but this seems to hold only for contiguous patterns. It is unclear whether this is feasible for non-contiguous patterns. Finally, the properties of AFA have not been theoretically studied, for example with respect to determinization and negation. AFA can handle certain instances of negation, but there are strong reasons to suspect that they are not in general closed under complement, as is the case of register automata. In summary, $\mathit{SRT}$ are more expressive than AFA.

Another way to implement CER patterns, in relational databases, is through SQL’s MATCH_RECOGNIZE, a proposed clause that can perform pattern recognition on rows [MRISO, DBLP:journals/dbsk/Petkovic22]. MATCH_RECOGNIZE is very expressive and can in principle capture almost any pattern expressed in a CER language. However, it is uncertain whether it would work in a streaming setting as efficiently as CER systems. Recent work has proposed implementations of MATCH_RECOGNIZE that are more efficient than the one already available in Flink [DBLP:journals/pvldb/ZhuHC23, DBLP:conf/sigmod/KorberGS21]. The proposed optimizations rely on the use of prefiltering and clever indices so that the automaton responsible for pattern recognition is fed only with a small subset of the initial rows. They target the scenario of historical analysis and their extension to a streaming setting is not considered. It still remains an open issue whether and to what extent the proposed optimizations would work for patterns processing events in real time.

3 Symbolic Regular Expressions with Memory and Output

The field of CER has been growing strong for the past 20 years. It is thus no surprise that there is no lack of languages, formalisms and systems from which one may choose according to their needs. As a result, there is considerable variability concerning the most relevant and useful operators of CER patterns, their semantics and the corresponding computational models to be used for the actual detecting of complex events. On the one hand, this variability may be viewed as a sign of vigor for the field. On the other hand, the fact that operators and their semantics are sometimes defined informally makes it hard to compare different systems in terms of their expressive capabilities. It also makes it hard to study a single system in itself in a more systematic manner, other than actually running it and observing its behavior.

As an attempt to mitigate these problems, we present and describe a framework for CER which has formal, denotational semantics. We first present a language for CER and discuss its semantics. The main feature of this language is that it allows for most of the common CER operators (such as selection, sequence, disjunction and iteration), without imposing restrictions on how they may be used and nested. Our proposed language can also accommodate n-ary conditions, i.e., we can impose constraints on the patterns which relate multiple events of a stream, e.g., that the number of cells in a simulated tumor at the current timepoint is higher than their number at the previous timepoint. We also discuss the semantics of patterns written in our proposed language and show that these are well-defined. As a result, in order to know whether a given stream contains any complex events corresponding to a given pattern, we do not need to resort to a procedural computational model. The semantics of the language may be studied independently of the chosen computational model. Not only is this feature critical in itself, allowing for a systematic understanding of the use of operators, but it could also be of importance for optimization, which often relies on pattern re-writing, assuming that we can know when two patterns are equivalent without actually having to run their computational models. Previous work on CER has produced systems which are highly expressive (e.g., FlinkCEP [FlinkCEP]), but lack a proper, formal description. Some more recent work ([DBLP:journals/pvldb/BucchiGQRV22]) has attempted to construct a system which is both formal and efficient. However, it does not support n-ary expressions, allowing (non-temporal) constraints which are applied only to the last event read from a stream.

Before presenting $\mathit{SRT}$ , we first present a high-level formalism for defining CER patterns. We extend the work presented in [DBLP:journals/jcss/LibkinTV15], where the notion of regular expressions with memory (

Remark 1.

) was introduced. These regular expressions can store some terminal symbols in order to compare them later against a new input element for (in)equality. One important limitation of