Given an input stream S of size N, a ɸ-heavy hitter is an item that occurs at least ɸN times in S. The problem of finding heavy-hitters is extensively studied in the database literature.

We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ɸ N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity).

Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes).

We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead.

We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.

1 Introduction

Real-time monitoring of high-rate data streams, with the goal of detecting and preventing malicious events, is a critical component of defense systems for cybersecurity [46, 58, 61] as well as for physical systems, e.g., for water or power distribution [8, 44, 47]. In such a monitoring system, the stream elements represent the changes to the state of the system. Each detected/reported event could trigger an intervention. Analysts use more specialized tools to gauge the actual threat level. Newer systems are even beginning to take defensive actions, such as blocking a remote host automatically based on detected events [39, 50]. Accuracy (i.e., few false-positives and no false-negatives) and timeliness of event detection are essential to these systems.

Central to these applications is the problem of timely reporting of heavy hitters. In the heavy-hitters problem, we are given a stream

and a reporting threshold

, and we must report all elements that occur at least T times in S. In the preliminary version of this article [57], we introduced the real-time version of the heavy-hitters problem called the Timely Event Detection (TED) problem. In the TED problem, each heavy hitter must be reported soon after its Tth occurrence, where the acceptable reporting delay is defined by the application.

In network-security monitoring applications, N is huge and T can be very small. This is because anomalies in network streams are often small-sized events that develop slowly, appearing normal in the midst of large amounts of legitimate traffic [48, 60]. As an example of the demands placed on event-detection systems, the U.S. Department of Defense (DoD) and Sandia National Laboratories developed the Firehose streaming benchmark suite [4, 5] to measure the performance of TED algorithms. In the FireHose benchmark, the reporting threshold is preset to the representative value of

, i.e.,

The classic streaming algorithms for reporting heavy-hitters were designed assuming that only an in-RAM data structure can keep up with high-speed streams. The challenge of detecting events entirely within RAM has inspired a deep and beautiful literature on streaming algorithms and database systems [3, 16, 18, 19, 21, 22, 29, 34, 35, 36, 37, 45, 49].

However, streaming algorithms sacrifice accuracy to get solutions that can fit in RAM. First, most streaming heavy-hitter algorithms only work for high reporting thresholds, e.g., T is a constant fraction of N. Second, they allow false positives. Third, many streaming algorithms perform some kind of sampling, which leads to false negatives. These inaccuracies are not the fault of the streaming algorithms. They are an inherent limitation when you have a large stream and a much smaller RAM size. See Section 9 for a motivating application, where any of these three limitations would lead to failure.

Combining streaming and external memory. This work challenges the assumption that only in-RAM data structures can keep up with real-world streams and shows that by using modern storage devices and building upon recent advances in external-memory dictionaries, we can design on-disk data structures that can process millions of stream events per second.

In particular, we present algorithms in the external-memory model that support both exact and approximate reporting of heavy hitters. In the external-memory model [2], RAM has fixed size M, and accessing it is free. The disk has unbounded size and accessing it costs an input/output (I/O) transaction. An I/O transfers data between RAM and disk in blocks of size B. The algorithmic advantage of external memory is that there is unbounded storage. The algorithmic challenge is that I/Os are expensive.

External-memory enables us to overcome longstanding limitations in accuracy (i.e., no false-positives or negatives) and sensitivity (i.e., small ɸ) while maintaining timeliness in event reporting, but necessitates developing new heavy-hitters algorithms that use I/Os efficiently.

Our contributions. In this article, we present I/O-efficient external-memory algorithms that support both exact and approximate reporting of heavy hitters. Specifically, our TED algorithms can be generalized to solve the -heavy hitters problem—where every item that occurs

times must be reported, no item that occurs

times should be reported. Items with count in between

and ɸN may be reported and these are false positives.

The present article serves as the journal version for [57], and it also contains technical improvements. Our first contribution is theoretical. We include proofs for all lemmas and theorems, unlike [57], which, for space reasons, omitted essentially all proofs. Explaining and proving these results has more than doubled the length of the article, and thus includes results that are indeed reproducible. Furthermore, we generalize the results for power-law streams presented in Reference [57] by specifying the precise relationship between the reporting threshold ɸ and the power-law exponent θ; for details see Section 5.

Our second contribution is experimental. We include all of the experiments from Reference [57] along with additional ones. In particular, we give empirical analysis of the birthtime versus lifetime of items in the active-set generator of the Firehose streaming benchmark [4, 5]. Also, in the interest of reproducibility, we have included pseudocode for all data structures and algorithms.

Finally, we provide detailed explanation of how the constraints of the TED problem are motivated from practice. In particular, in Section 9, we discuss the national-security application that motivates the Firehose benchmark [4, 5], and how the TED problem captures the main computational bottleneck of this application.

Timeliness, not ingestion, is the challenge in external memory. Stream ingestion is not the bottleneck for on-disk data structures. Optimal external-memory (EM) dictionaries (including write-optimized dictionaries such as

-trees [11, 13, 25], COLAs [12], xDicts [24], buffered repository trees [26], write-optimized skip lists [15], log structured merge trees [55], and optimal external-memory hash tables [31, 42]) can ingest new observations at a significant fraction of disk bandwidth. The fastest can index using

I/Os per stream item, which is far less than one I/O per item. In practice, this means that even a system with just a single disk can ingest hundreds of thousands to millions of items per second.

For example, prior work at SuperComputing 2017 showed that a single computer can easily maintain an on-disk

-tree [25] index of all connections on a 600 gigabit/second network [10]. The system could efficiently answer offline queries. What the system could not do was detect events online.

Existing external-memory data structures do not solve the TED problem, because queries are too slow. For example, consider a straw-man solution in which we use an external-memory dictionary to implement the standard heavy-hitters algorithm, Misra-Gries [53]. Since Misra-Gries performs a query for each stream observation, this approach is bottlenecked on the dictionary searches. Once the dictionary is larger than RAM, for a random stream, most queries will miss the cache and require an I/O and hence will be bottlenecked on the latency of the storage device.

In this article, we show how to perform timely event detection for essentially the same cost as simply inserting the data into a

-tree or other optimal external-memory dictionary. Even so, we manage to answer the standing heavy-hitter query for each new stream element.

1.1 Results

In this article, we present external-memory algorithms for the TED problem. We evaluate these algorithms theoretically and empirically. In both cases, we show that these algorithms perform much less than one I/O per query and are limited only by I/O bandwidth (not latency). Furthermore, we show how to provide a tradeoff between reporting delay and I/O cost. We call these data structures leveled external-memory reporting tables (LERTs).

We begin by formally defining an event that must be reported in the TED problem. Given a stream

, a ɸ-heavy hitter is an element that occurs at least ɸN times in S. The heavy-hitters problem is to report all ɸ-heavy-hitters in S.

In the TED problem, we say that there is a ɸ-event at timestep t if stream element s_t occurs exactly

times in

. Thus for each ɸ-heavy hitter there is a single ɸ-event, which occurs when the element’s count reaches the reporting threshold

. In the TED problem, the goal is to report ɸ-events as soon as they occur.

Our first data structure, the Misra-Gries LERT, adapts the Misra-Gries heavy-hitter algorithm to solve the TED problem in external-memory with immediate reporting. In particular, the Misra-Gries LERT reports each ɸ-event as soon as it occurs (no delay) at an amortized cost of

I/Os, for sufficiently large ɸ. The guarantees of the Misra-Gries LERT hold for any input distribution; see Corollary 1.

The Misra-Gries LERT serves as the basis of our main algorithms that support much smaller ɸ, but permit some delay in reporting. We define two types of delay: time stretch and count stretch. We say an event-detection algorithm has time stretch

if each item s is reported at most

timesteps after s’s Tth occurrence, where F_s is the number of timesteps between s’s first and Tth occurrences. We say that an event-detection algorithm has count stretch

, if each item is reported before the item’s count reaches

We design a data structure, the time-stretch LERT, that solves the TED problem for any input stream and any

with time stretch

at an amortized cost of

I/Os per stream item. For constant

, this is asymptotically as fast as simply ingesting and indexing the data [12, 25, 26]. The time-stretch LERT guarantees hold for any input distribution; see Corollary 2.

In our evaluations, the time-stretch LERT with stretch 2 can ingest at

K insertions/second using a single thread. We also observed that the average empirical time stretch is 43% smaller than the theoretical upper bound.

Our count-stretch LERT is tailored to guarantee count-stretch on input stream distributions where the count for each item is drawn from a power-law distribution. In particular, given an input stream with item counts distributed according to a power-law with parameter

, which is the typical range [1, 16, 23, 30, 54], and parameters T and Ω, such that

, we show that the count-stretch LERT solves the TED problem with count stretch

at an amortized I/O cost

per stream item with high probability (w.h.p.). Thus, the count-stretch LERT avoids expensive point queries, matching the ingestion rate of write-optimized data structures. In our evaluations, we find that the count-stretch LERT with stretch 1.583 can ingest at

insertions/second using a single thread. With multi-threading and de-amortization, the count-stretch LERT scales to more than 11M insertions/second, and the variance of the instantaneous throughput goes down by several orders of magnitude relative to the amortized, single-threaded version; see Figure 9. Moreover, the average empirical count stretch is 21% smaller than the theoretical upper bound.

Finally, we show how to modify the count-stretch LERT to support immediate reporting. We call the resulting data structure the Immediate-report LERT and show that it solves the TED problem much faster than the Misra-Gries LERT for input streams with element counts drawn from power-law distributions; see Theorem 9 for the formal I/O cost. In our evaluation, we find that the Immediate-report LERT can ingest at ≈500K insertions/second using a single thread.

Additional Related Work

Heavy-hitter algorithms. The heavy-hitter problem has been extensively studied in the database literature; we refer readers to the survey by Cormode and Hadjieleftheriou [32].

Two main strategies have been used: deterministic counter-based approaches [19, 36, 43, 49, 51, 53] and randomized sketch-based approaches [29, 33]. The first is based on the classic Misra-Gries (MG) algorithm [53], which generalizes the Boyer-Moore majority finding algorithm [20].

Randomized sketch-based algorithms such as count-min sketch [33] maintain a small sketch of the frequency vectors using compact hash functions.

More recent work has focused on generalizations of the heavy-hitters problem. Ting [59] considers aggregating subset sums, rather than counts, and Ben-Basat et al. [9] generalize the heavy hitter problem to sliding windows. Multiple researchers [52, 62, 63] have designed heavy-hitter algorithms for detecting top flows in networking applications.

Database iceberg queries. The TED problem is related to the problem of answering iceberg queries in databases [17, 38, 40, 41]. An iceberg query computes an aggregate function over some database attribute and reports the values that are above some predetermined threshold. The main distinctions between the two problems is as follows: (a) iceberg queries are offline, i.e., performed on a static dataset, and (b) the number of reported results in iceberg queries is usually small; while the number of reported events can be large in the TED.

Database continuous queries. The TED problem is an instance of a continuous or standing query over a database [6, 7, 28]. A continuous query, once issued, runs as the database is updated through inserts and deletes. The system reports new query matches as the database is updated. In TED, the database D consists of the items from the stream seen so far, and the continuous query over D is whether there is an item with count exactly

2 Preliminaries

We formalize our model and review several building blocks of our data structures: the Misra-Gries heavy-hitters algorithm [53], counting quotient filters (CQF) [56], and cascade filters (CF) [12].

TED problem and model. The TED problem is as follows: given stream

, for each i, if there is a ɸ-event at time i, report s_i before time j, such that the reporting delay,

, is within an acceptable degree of tolerance. In the Misra-Gries LERT in Section 3.2, there is no reporting delay. In the time-stretch LERT in Section 4, the reporting delay is dependent on the flow time of the item (the time it takes for the item’s count to go from zero to

), and in the count-stretch LERT in Section 5.2, the reporting delay is count-dependent.

We measure time in terms of the number of stream observations. That is, in each timestep, the algorithm reads one stream observation, performs an arbitrary amount of computation and I/O, and generates an arbitrary number of reports. We say all reports generated during the ith timestep occur at time i.

The Misra-Gries frequency estimator. The MG algorithm estimates the frequency of items in a stream. Given an estimation error

and a stream S of N items from a universe

, the MG algorithm uses a single pass over S to construct a table

with at most

entries. Each table entry is an item

with a count, denoted

. For each

not in table

, let

. Let f_s be the number of occurrences of item s in stream S. The MG algorithm guarantees that

for all

MG initializes

to an empty table and processes items in the stream as described below. For each

in S,

•

, then increment counter

•

and

, then insert s_i into

. Set

•

and

, then for each

decrement

and delete its entry if

becomes 0.

To see why this algorithm ensures that

for all

, note that a

is incremented only for an occurrence of s in S. Thus

. For the upper bound, whenever we decrement

, then

other items have their count decremented. This can happen at most

times. Thus,

The MG algorithm can be used to solve the -heavy hitters problem as follows. Run the MG algorithm on the stream with error parameter

. Then iterate over the set

and report any item s with

For a frequency estimation error of

, Misra-Gries uses

words of storage, assuming each stream item and each count occupy

words.

Analogous to the

-heavy hitters problem, we define the approximate TEDproblem as follows: Report all ɸ-events soon after they occur and do not report any item with count

. Reported items with count in between are false positives.

Counting Quotient Filter. The CQF [56] can be viewed as a hash table based on Robin-Hood hashing [27]. The CQF consists of an array Q of

slots and a hash function h mapping stream elements to p-bit integers, where

. Robin-Hood hashing is a variant of linear probing in which we try to place an element a in slot

, but shift elements down when there are collisions. Furthermore, Robin-Hood hashing maintains the invariant that, if

, then a will be in an earlier slot than

The CQF supports efficient insertions, queries, updates, and deletions, just like any Robin-Hood hash table. Thus, it is straightforward to implement the Misra-Gries algorithm on top of a CQF, by using the CQF to store the table C.

Cascade Filter. The CF [14] is a write-optimized data structure based on the CQF [56] and the COLA [12]. The CF consists of multiple levels with exponentially increasing sizes where each level is a CQF. The first level

is in RAM and the rest are on SSD. There are

levels, where M is the size of RAM, N is the size of the dataset, and r is the factor by which levels grow in size.

Since the cascade filter is also a map, we can use it as the basis for an EM Misra-Gries algorithm. The total table size is

. The amortized I/O cost to update the table for each stream element is

. However, if we want to support immediate reporting in a CF, then a query is triggered after each insert that costs

I/Os. Thus the overall algorithm is bottlenecked on the queries performed for each stream element.

3 Immediate Reporting

In this section, we first design an efficient external-memory version of the core Misra-Gries frequency estimator and then extend our external-memory Misra-Gries algorithm to solve the TED problem with immediate reporting.

When

, then simply running the standard Misra-Gries algorithm can result in a cache miss for every stream element, incurring an amortized cost of

I/Os per element. Our construction reduces this to

, which is

when

3.1 External-memory Misra-Gries

Our external-memory Misra-Gries data structure is a sequence of Misra-Gries tables,

, where

and

is a parameter we set later. The size of the table

at level i is

, so the size of the last level is at least

Each level acts as a Misra-Gries data structure. Level 0 receives its input from the stream. Level

receives its input from level

, the level above. Whenever the standard Misra-Gries algorithm for the table

at level i would decrement an item count, the external-memory MG data structure decrements that item’s count by one on level i and sends one instance of that item to the level below (

). The decrements from the last level L are deleted.

The external-memory MG algorithm processes the input stream by inserting each item in the stream into

. To insert an item x into level i, do the following:

•

, then increment

•

, and

, then

•

and

, then, for each

, decrement

; remove

from

becomes 0. If

, then recursively insert

into

We call the process of decrementing the counts of all the items at level i and incrementing all the corresponding item counts at level

a flush.

Lemma 1 shows that every prefix of levels

in the external-memory MG data structure is an MG frequency estimator, with the accuracy of the estimates increasing with j.

Proof.

Decrementing the count for an element x in level

and inserting it on the next level does not change

. This means that

changes only when we insert an item x from the input stream into

or when we decrement the count of an element in level j. Thus, as in the MG algorithm,

is only incremented when x occurs in stream, and is decremented only when the counts for

other elements are also decremented. The first inequality follows from this and the MG analysis. The second inequality follows from the first and the fact that

.□

Thus, to report

-heavy hitters (at the end of the stream), we can iterate over the sets

and report any element x with counter

For the I/O analysis, we assume that each level of the external-memory MG structure is implemented as a cascade filter [14].

When no false positives are allowed, that is,

, the I/O complexity is

3.2 Misra-Gries LERT

We extend our external-memory MG data structure to support immediate reporting. That is, we show that for a threshold ɸ that is sufficiently large, it can report ɸ-events as soon as they occur.

A first attempt to add immediate reporting is to compute

for each stream event s_i and report s_i as soon as

. However, this requires querying

for

for every stream item and can cost up to

I/Os per stream item.

We avoid these expensive queries by using the properties of the in-memory MG estimates

. If

, then we know that

and we therefore do not have to report s_i, regardless of the count for s_i in the lower levels of the external-memory data structure.

We describe the new data-structure, the Misra-Gries LERT. Whenever we increment

from a value that is at most

to a value that is greater than

, we compute

and report s_i if

. For each entry

, we store a bit indicating whether we have performed a query for

, along with a second count

that stores the number of occurrences of x needed to hit reporting threshold

. We set

appropriately whenever we compute

without reporting x. When an instance of x arrives,

is incremented as in external-memory MG, and if the search bit is set, then we also decrement

; if a decrement of

causes it to become zero, then we report x. As in our external-memory MG structure, if the count for an entry

becomes 0, then we delete that entry (along with its metadata). This means we might query for the same item more than once; as we see below, this has no effect on the overall I/O cost of the algorithm.¹

To avoid reporting the same item more than once, we can maintain, with each entry in

, a bit indicating whether that item has already been reported.

Whenever we report a item x, we set the “reported” bit in

. Whenever we flush an item from level i to level

, we set the bit for that item on level

if it is set on level i. When we delete the entry for a item that has the bit set on level

, we add an entry for that item on a new level

. This new level contains only items that have already been reported. When we are checking whether to report a item during a query, we stop checking further and omit reporting as soon as we reach a level where the bit is set.

I/O complexity. For the analysis, we assume that the levels of the data structure are implemented as sorted arrays with fractional cascading, and thus computing

requires

I/Os.

Exact reporting. To solve the problem exactly, that is, with no false positives we set

in Theorem 3, and get the following corollary.

Summary. The Misra-Gries LERT supports a throughput at least as fast as optimal write-optimized dictionaries [12, 13, 15, 24, 25, 26], while estimating the counts as well as if it had an enormous RAM. It maintains count estimates at different granularities across the levels. Not all estimates are actually needed, but given a small number of levels, we can refine the estimates by looking in only a few additional locations.

The external-memory MG algorithm helps us solve the TED problem. The smallest MG sketch (which fits in memory) is the most important estimator here, because it serves to sparsify queries to the rest of the structure. When such a query gets triggered, we need the total counts from the remaining

levels for the (exact) online event-detection problem but only

levels when approximate thresholds are permitted. In the next two sections, we exploit other advantages of this cascading technique to support much lower ɸ without sacrificing I/O efficiency.

4 Time Stretch

The MG LERT described in Section 3.2 reports events immediately, albeit at a high amortized I/O cost to perform queries to recognize the need for reporting. In this section, we show that if we allow a bounded reporting delay proportional to the time it takes an item to become a ɸ-event, then we can significantly improve the I/O performance—in particular, we can perform timely event detection asymptotically as cheaply as if we reported all events only at the end of the stream.

Our data structure guarantees a time-stretch of

. That is, it reports an item x no later than time

, where

is the time of the first occurrence of x,

is the time of the ɸNth occurrence of x and

is the flow time of x.

4.1 Time-stretch LERT

We design a data structure to guarantee time-stretch, the time-stretch LERT. Similarly to the Misra-Gries LERT, the time-stretch LERT consists of

levels

. The ith level has size

. Items are flushed from lower to higher levels.

Unlike the Misra-Gries LERT, all events are detected during the flush operations. Thus, we never need to perform point queries. This means (1) we can use simple sorted arrays to represent each level and (2) we do not need to maintain the invariant that level 0 is a MG data structure on its own.

Data structure layout. We split the table at each level i into

equal-sized bins

, each of size

. The capacity of a bin is defined by the sum of the counts of the items in that bin, i.e., a bin at level i can become full, because it contains

items, each with count 1, or 1 item with count

, or any other such combination. See Figure 1. Flushing schedule. We maintain a strict flushing schedule to obtain the time-stretch guarantee. The flushes are performed at the granularity of bins (rather than entire levels). The scheduling algorithm is described below.

Fig. 1.

•

Let

be the bins (in order) on level i, where level 0 is RAM, and

•

Each stream item is inserted into

, the first bin in RAM.

•

Whenever a bin

becomes full, we shift all the bins on level i over by one, that is, we move the contents of bin

to the adjacent bin

The elements of the last bin at level i,

, are moved to

, the first bin on the next level.

Since the bins in level

are r times larger than the bins in level i, bin

becomes full after exactly r flushes from

. When this happens, we perform a shift and flush the last bin on level

and so on.

Count consolidation. Finally, during a flush involving levels

, where

, we scan these levels and for each item k, we sum its counts. If the total count is greater than

, (and we have not reported it before²), then we report k.

4.2 Analysis of Time-stretch LERT

Correctness. We show that our data structure guarantees time stretch.

Lemma 4.

The time-stretch LERT with stretch-parameter

reports each ɸ-event s_t occurring at time t by time

, where F_t is the flow time of s_t.

Proof.

Consider an item s_t with flow time F_t. Let

be the largest level containing an instance of s_t at time t when it hits the threshold count of ɸN. The flushing schedule guarantees that, for each level

, the item s_t must have waited

bins of size

on that level before being inserted to level

, where

. This is dominated by waiting time on level

. That is,

(1)

Level

participates in a flush again after

inserts, which is the number of observations that fill up a bin on level

. Using Equation (1), we get that

. Thus, s_t is reported at most

timesteps after t.□

I/O complexity. For the analysis, we treat each level as a sorted array.

For exact reporting (no false positives), we set

4.3 Implementation of Time-stretch LERT

We implement each level in the time-stretch LERT as an exact counting quotient filter [56]. In addition to the count, we store a few additional bits with each item to keep track of its age.

In the time-stretch LERT, each level is split into

equal-sized bins. In our implementation, instead of actually splitting levels into physical bins we assign a value (i.e., age of the item) of size

bits to each item that determines its bin. The age of the item on a level determines whether the item is ready to be flushed down from that level during a flush.

We also assign an age to each level, initialized to 0. Before a flush, the age of each level involved in the flush is incremented. The age of a level wraps back to 0 after c increments. The age of the level during the flush determines which items are eligible to be flushed down—if an item’s age is the same as the level’s age, then the item has survived c flushes on that level, and is therefore eligible to flush. When an item is inserted in a level, its age is set to the level’s age. However, if the level already has an instance of the item, then we just increment the count of the existing instance whatever its age.

We follow a fixed schedule for flushes. We trigger a flush after every group of

stream observations. Every

th flush, level i flushes to level

. That is, after every

flushes to level

, level

is involved in the next (rth) flush. To determine the number of levels involved in a flush, we maintain a counter per level for the number of times level

has been involved in a flush from

since its last flush to

. Algorithm 1 shows the pseudocode for the flush schedule of the levels in a time-stretch LERT.

Note that only “eligible items” are flushed down a level during these flush operations, in particular, items that have aged enough to be in the last bin at a level—or equivalently, items whose age is equal to the level’s age.

Consolidating item-counts during a flush is implemented as a k-way merge sort. We first aggregate the count of an item across all k levels involved in the flush. We then decide based on the age of the instance of the item in the last level whether to move it to the next level. If the instance of the item in the last level is aged, then we insert the item with the aggregate count in the next level. Otherwise, we update the count of the instance in the last level to the aggregate count. Algorithm 2 shows the pseudocode for flushing items in a time-stretch LERT. We use

to denote the reporting threshold in the implementation.

Summary. By allowing a little delay, we can solve the timely event-detection problem at the same asymptotic cost as simply indexing our data [12, 13, 15, 24, 25, 26].

Recall that in the online solution the increments and decrements of the MG algorithm determined the flushes from one level to the other. In contrast, these flushing decisions in the time-stretch solution are based entirely on the age of the items. The MG style count estimates came essentially for free from the size and cascading nature of the levels. Thus, we get different reporting guarantees depending on whether we flush based on age or count.

Our experimental results for TED problem with immediate reporting and with time stretch show that there is a spectrum between completely online and completely offline, and it is tunable with little I/O cost.

5 Power-law Distributions

Our results in Sections 3 and 4 hold for worst-case input streams. In this section, we design TED algorithms tailored to perform well on practical input streams, in particular where the item-counts follow a power-law distribution. Note that the order of arrivals can still be adversarial.

The item counts in the stream follow a power-law distribution with exponent θ if the probability that an item has count c is equal to

, where Z is the normalization constant.

Berinde et al. [16] consider streams where the item counts follow a Zipfian distribution. A stream follows a Zipfian distribution with exponent

if and only if it follows a power-law distribution with exponent

[1]. They show that for Zipfian distributions with

(power-law distributions with

), the MG algorithm can solve the approximate heavy-hitter problem with error

using only

words. Alternatively, on such Zipfian distributions, the MG algorithm achieves an improved error bound

using

words. The error bound

is in fact stronger as it can be applied to the tail frequency of the stream, rather than the whole stream. In particular, if c_i is the true count of item i, and

is the estimate, then on zipfian distributions with

space,

, where

is the sum of counts of all keys except the top-

most frequent keys [16]. Our Misra-Gries LERT data structure based on the MG algorithm automatically inherits these improved bounds.

We give improved results for power-law streams with exponent

, a range that is representative of power-law distributions observed in practice [54]. In Section 5.1, we study the exact TED problem and design algorithms tailored for such a distribution. In Section 5.2, we present a data structure that has improved I/O performance and guarantees a count-dependent bounded delay.

Preliminaries. We use the continuous power-law definition [54]: the count of an item with a power-law distribution has a probability

of taking a value in the interval from x to

, where

and Z is the normalization constant,³

Thus,

.⁴

We use the cumulative distribution of a power law:

(2)

For our analysis, we assume that the input stream S is constructed offline as follows. Let U denote the number of distinct keys in the stream S. The count for each key is drawn independently from a power-law distribution. Then the instances of the keys in S are ordered arbitrarily. That is, we do not make any assumptions on the arrival order of keys. Next, we analyze some properties of the input stream.

Lemma 6.

In the input stream with U distict keys, where the count of each key is drawn independently from a power-law distribution with

, the following holds with high probability with respect to U:

(1)

the number of keys with count greater than c is

;

(2)

the size of the stream

5.1 Immediate-report LERT

First, we present the layout of our data structure, the Immediate-report LERT, and then we present its main algorithms, shuffle merge and immediate-reporting query. Finally we analyze its correctness and I/O performance.

Data structure layout. Similarly to the data structures in the previous sections, the Immediate-report LERT consists of a cascade of tables, where M is the size of the table in RAM. There are

levels on disk, where N is the size of the stream. The size of level i is

Each level on disk has an explicit upper bound on the number of instances of an item that can be stored on that level. This is different from the MG algorithm, where this upper bound is implicit and is based on the level’s size. In particular, each level i in the Immediate-report LERT has a level threshold

for

, (

), where

indicates the maximum count of a key that can be stored on level i.

Threshold invariant. We maintain the invariant that at most

instances of an item can be stored on level i. Later, we show how to set the

’s based on the input stream’s power-law exponent θ.

Shuffle merge. The Misra-Gries LERT and time-stretch LERT use two different flushing strategies. Here we present a third strategy called the shuffle merge.

•

The level in RAM receives inputs from the stream one at a time.

•

When attempting to insert to a level i that is at capacity, we find the smallest level

that has enough empty space to hold all items from levels

•

We aggregate the count of each item k on levels

, resulting in a consolidated count

•

, (and we have not reported it before⁵), then we report k. Otherwise, we distribute instances of k in a bottom-up fashion on levels j to 0, while maintaining the threshold invariant. In particular, we place

instances of k on level j, and

instances of k on level y for

In the above algorithm, notice that items can end up in higher levels (compared to the level they were before), which is why we call this operation a shuffle-merge instead of a merge. Also, observe that the threshold invariant prevents us from flushing too many counts of an item down. Thus, items can get packed at a level and cannot be flushed down. Specifically, we say an item is packed at level if its count exceeds

To maintain efficient shuffle merges, the number of packed items at a level should not occupy more than a constant fraction of the size of the level. In Lemma 7, we show that given a power-law stream with exponent θ, we can set the thresholds based on θ so as to satisfy this requirement.

Proof.

We prove by induction on the number of levels. We start at level

. An item is packed at level

if its count is greater than

. By Lemma 6 (1), there are

such items. By Lemma 6 (2), the size of the stream

. The size of level

. Thus, number of items packed at level

times the size of level

Suppose the lemma holds for level

. We show that it holds for level i. An item is packed at level

if its count is greater than

. Using Lemma 6 (1), and the induction hypothesis, the expected number of such items is

(3)

Finally, an item is packed at level

if its count is greater than

. Using Lemma 6 (1) and Inequality (3), the expected number of items packed at level i is

□

Immediate reporting. As soon as the count of an item k in RAM reaches a threshold of

, the data structure triggers an immediate-reporting query, which sweeps all L levels, consolidates the counts of k at all levels into RAM and reports if the consolidated count reaches threshold

. Reported items are remembered, so that each event is reported exactly once.

Analysis. Next, we prove correctness of the Immediate-report LERT and analyze its I/O complexity. We set

, which minimizes the insertion cost (in Theorem 9).

First, we prove that the Immediate-report LERT reports all ɸ-events as soon as they occur.

Lemma 8.

Let S be a stream of size N where the item counts follow a power-law distribution with exponent

. The Immediate-report LERT solves the TED problem with immediate reporting on S with high probability.

Proof.

Let

denote the count estimate of key i in RAM in the Immediate-report LERT. Because of the threshold invariant at most

instances of a key can be stored on disk at any time.

Suppose

, the count in RAM for key i, is incremented to the search threshold

at time t. This triggers an immediate-report query. The counts from all levels of the disk are added to

to give an accurate count c_i. Because of the threshold invariants, we have

. If

, then we report item i at time t, exactly as its count reaches the reporting threshold. Otherwise, the system sets a bit to indicate the count includes all occurences of the key in the data structure; when this (accurate) count c_i reaches ɸN, it is reported immediately.□

Next, we analyze the I/O complexity of the Immediate-report LERT. Similarly to Section 3.2, we assume the levels of the Immediate-report LERT are implemented as a cascade filter [14].

Theorem 9.

Let S be a stream of size N where the item counts follow a power-law distribution with exponent

. Then the Immediate-report LERT can solve the TED problem on S w.h.p. for thresholds

, where

. The amortized I/O complexity of the data structure is

per stream item.

Remark 2.

The relationship between the reporting threshold ɸN and the power-law exponent θ identified in Theorem 9 for the Immediate-report LERT is a generalization of the relationship presented in Reference [57], which provides a weaker bound.

Supporting smaller reporting thresholds for power-law streams. The Immediate-report LERT—tailored for power-law streams—lets us support smaller reporting-thresholds ɸN compared to the Misra-Gries LERT for immediate reporting (in particular, the reporting threshold in Corollary 1). To see why this is true, notice that

the lower bound on ɸN in Theorem 9, consists of two parts multiplied together: The first term

depends only on θ, and for the range-of-interest

, this term is a small constant. Specifically, for

. The second term

decreases exponentially as θ increases. Conversely, the reporting threshold ɸN that Misra-Gries LERT can support must be at least as large as

Thus, under reasonable conditions on N, M, and θ, the Immediate-report LERT can support smaller reporting thresholds for immediate reporting than previous data structures. For example, when

, where

, we have

5.2 Count-stretch LERT

In this section, we show that if we eliminate expensive immediate-reporting queries from the Immediate-report LERT, then the data structure still supports bounded-delay reporting with a count-dependent delay. We say that a TED algorithm has count stretch

if it reports each key by the time its count hits

. In particular, the notion of count stretch relaxes the reporting threshold, which leads to reduced random disk accesses.

The count-stretch LERT is the following modification of the Immediate-report LERT: We eliminate immediate-reporting queries and report an item when its count in RAM hits ɸN. The data structure layout, thresholds and shuffle-merges (including reporting during shuffle-merges) are the same as in the Immediate-report LERT.

A count-stretch guarantee does not imply any time-stretch guarantee. This is because the item’s arrival distribution may be irregular: a sudden burst may give a key a count of ɸN quickly, with unfortunate shuffle-merge timing moving the maximum number of occurences to disk before the RAM count hits ɸN. It could take much longer to get from the ɸNth occurrence to the

th occurrence.

Proof.

The amortized I/O complexity of the count-stretch LERT follows from the insertion cost of Theorem 9, without the expensive immediate-reporting queries. Recall that the insertion cost is minimized by setting

For a count stretch of

it is sufficient to show that when an item hits a count of ɸN in RAM, there are at most

occurrences of that item stored in the lower levels of the data structure on disk.

By the threshold invariant of the count-stretch LERT, we can bound the total occurrences of an item in levels

on disk as

. Below, we show that this quantity must be at most

For

, we can upper bound this sum as follows:

Since

, it follows that there can be at most

occurrences of an item stored on disk at any time. Thus, when the count estimate of an item in RAM reaches ɸN, its true count is at most

.□

Remark on dynamically setting thresholds. If the power-law exponent θ is not known ahead of time, but a feasible setting of level thresholds exist, then we can dynamically update the thresholds to ensure that no level of the data structure has too many packed items. In particular, to satisfy Lemma 7, for

, it is sufficient to ensure that the number of items packed at any level i does not exceed

its size.

We incrementally update the level thresholds to satisfy this condition as follows. Initially,

for each level i. During a shuffle merge involving the first j levels on disk, we set

to the minimum value such that the number of keys packed at level j is no more than half its size. Thus, we increment

’s monotonically from 0 to their feasible settings, without relying on the exponent θ.

Summary. With a power-law distribution, we can support a much lower threshold ɸ for the TED problem. In the Misra-Gries LERT (Section 3.1), the upper bounds on the counts at each level are implicit. We show that for power-law distributions, we can achieve better performance by explicitly setting these bounds in the form of thresholds.

5.3 Implementation of Count-stretch and Immediate-report LERT

We describe the implementation details of the count-stretch and immediate-report LERT, including further optimizations. Similar to the time-stretch LERT, each level is an exact counting quotient filter [56]. In the count-stretch LERT, in addition to the count of each key, we store a few additional bits to mark whether an item has its absolute count at a level (its aggregate count across all the levels).

Similarly to the flush schedule in the time-stretch LERT, we follow a fixed shuffle-merge schedule. A shuffle-merge is invoked from RAM after every M observations. The level thresholds determine how many instances of an item can be stored at that level. To satisfy threshold constraints, during a shuffle merge, we first aggregate the count of each item and then smear it across all levels involved in the shuffle-merge in a bottom-up fashion without violating the thresholds. Algorithm 3 shows the pseudocode for the shuffle-merge in a count-stretch LERT.

Optimization. We also implement an optimization in the count-stretch LERT that further reduces I/O costs by following a “greedy” flushing schedule instead of a fixed schedule. This is based on the observation that unlike time stretch, the count stretch does not depend on the number of observations in the stream. Therefore, we do not need to perform shuffle merges at regular intervals. We only invoke a shuffle-merge if it is needed, i.e., when the RAM is at capacity. The greedy flushing optimization is implemented as an additional input flag that can be turned on or off.

The CQF uses a variable-length encoding for storing counts and uses much less space compared to a unary-encoding. Therefore, the actual number of slots needed for storing M observations can be much smaller than M slots, if there are duplicates in the stream. This is the case for streams such as the one from Firehose, where counts have a power-law distribution. The greedy shuffle-merge schedule avoids unnecessary I/Os that a fixed schedule would incur during shuffle-merges.

As explained in Section 5.1, in the Immediate-report LERT we perform an immediate-reporting query when the count in RAM reaches

. To compute the aggregate count we perform point queries to each level on disk and aggregate the counts. If the aggregate count in RAM and on disk is T, then we report the item. Otherwise, we insert the aggregate count in RAM and set a bit, the absolute bit, that indicates that all the counts for the item have been found. This avoids unnecessary point queries to disk later on. We use a lazy policy to delete the instances of items from disk. They are garbage collected during the next shuffle merge.

6 Deamortization to Support Consistent Ingestion Rates

The LERTs consider observation t to occur exactly one timestep before observation

. In practice, however, observation t might trigger a significant rebuild of the data structure, delaying observation

. In a high-speed streaming context, that observation, and potentially millions after it, would be dropped while a rebuild is going on.

To mitigate this problem, we now describe how to deamortize LERTs. Our deamortization strategy works in serial, and also provides the foundation of the multithreading strategy we introduce in Section 7.

To deamortize, we decompose the data structure into C independent parts called cones that partition the space of hashed items. Each stream item is mapped to exactly one of these cones using a uniform-random hash function. A cone is an independent instance of the LERT with the same expansion factor r and the same number of levels, each of which is

-th the size of the corresponding complete level.

Each cone is independent, following its own merge schedule. Incoming items are routed to the appropriate cone for independent insertion and potential reporting. Thus, given uniform-random hashing, each cone accounts for roughly

-th of the aggregate I/O.

Deamortization timeliness guarantees. We consider the timeliness guarantees for the deamortized serial version of the count-stretch and time stretch LERT. When streams are split into substreams based on hash values, we must revisit these guarantees. We note that count-stretch is unaffected:

Lemma 11.

A deamortized count-stretch LERT provides the same count stretch guarantee as the original count-stretch LERT when run on the same input stream.

Proof.

The count stretch of an item in a count-stretch LERT depends only upon the item’s final count when it is reported. This final count is independent of the rest of stream. In the deamortized count-stretch LERT, all observations of an item go to a single cone, and each cone independently provides the same count stretch as the amortized count-stretch LERT for items mapped to that cone.□

Lemma 12.

There exists an input stream for which the deamortized time-stretch LERT provides no global time stretch guarantee.

Proof.

We construct an arrival distribution that causes an arbitrarily long time stretch for an item in a deamortized time-stretch LERT. It begins with

observations of an item I followed by enough distinct items that all go to the item I’s cone (C) to cause a flush in cone C. The sequence then has one more observation of item I followed by an arbitrarily long sequence of observations, none of which go to cone C. Thus, cone C has an arbitrary delay before its next merge and item I has an unbounded reporting delay.□

Theorem 13.

Consider a random stream where each arriving item maps to a cone via a fixed probability distribution. If cone i runs a time-stretch LERT guaranteeing a time stretch of

, then the deamortized time-stretch LERT will have a time stretch of

in expectation with respect to the full stream.

Proof.

Suppose each item maps to cone i with probability

. Consider a key k that maps to cone i with its first appearance at index

and its Tth occurence at index I_T. Let

. The time-stretch LERT without cones will report k by time (index)

. In the deamortized version, cone i receives

items between indices

and I_T in expectation. So it will report k when another

items arrive at cone i. But cone i should receive that many items in the

items after I_T. Thus we expect cone i to report k at time T_D. A similar argument holds when the stream is a random permutation of a finite stream with

elements from cone i.□

7 Multi-threading

We now describe thread-safe versions of the deamortized count-stretch and time-stretch LERT. A thread-safe implementation enables ingesting observations using multiple threads. This is crucial for two reasons: (1) We can scale the ingestion throughput to support high-speed streams, and (2) multiple threads performing I/Os simultaneously can utilize the full SSD bandwidth, which would be wasted otherwise.

We use two types of locks in our design, a cone-level lock and a CQF-level lock. The cone-level lock is a distributed readers-writer lock implemented using a partitioned counter (i.e., a per-CPU counter). This ensures that readers do not thrash on the cache line containing the count of the number of readers holding the lock. The CQF-level lock is a spin lock as described by Pandey et al. [56].

We assign a small local insertion buffer to each thread. See Figure 2. Each insertion thread performs the same set of operations. It starts by first receiving a packet of observations over a network port or reading a small chunk (usually 1,024) of observations from an input file. It then processes each observation in the packet one-by-one.

Fig. 2.

Each thread must acquire two locks to do an insertion: one read lock on the item’s cone and one lock on the region of the CQF (i.e., the RAM level of the cone) to which the item hashes. It tries once to acquire each lock. It does not spin or sleep upon failing to acquire either lock. If it does not get either of the locks in the first attempt, then it releases any acquired lock, inserts the observation in its local insertion buffer, and continues to the next observation. When the local buffer is full, the thread dumps the items in the buffer into theirrespective cones. When dumping a buffer, the threads wait for the locks.

If a thread acquires both the locks in the first attempt, then it performs the insertion and releases the lock on the relevant region of the CQF. It then checks whether the cone needs to perform a flush or shuffle-merge. If so, then it first releases the read lock and then tries to acquire a write lock on the cone. If it gets the write lock in the first attempt, then it performs the flush/shuffle-merge. If it fails to acquire the write lock in its first attempt, then some other thread is already performing a flush/shuffle-merge. This thread can continue.

We avoid heavy contention among threads via the local buffers, even when every thread tries to lock the same cone. This is because threads do not wait to acquire a lock on the cone for every insertion and continue to make progress. Also, item counts are consolidated in local buffers. Thus during the buffer dump, only one insertion for each item is required instead of multiple insertions for each instance of the same item. Our method scales well with increasing number of insertion threads even for streams with skewed distributions. We show this empirically in Section 8.8.

Using readers-writer locks at the cone level allows multiple threads to simultaneously insert in different regions of the RAM CQF of a cone by acquiring a read lock. A thread upgrades to a write lock when it needs to do a flush/shuffle-merge. Readers-writer locks allows us to use more threads than cones. Even if all cones flush simultaneously, there would still be threads processing incoming observations.

7.1 Timeliness with Multi-threading

We now discuss the effect of multithreading on the timeliness guarantees of the count-stretch and time-stretch LERT.

Measuring time. One issue that immediately arises when trying to analyze time- and count-stretch in the multi-threaded case is: How do we measure time? In the single-threaded case, we measure time in terms of the number of stream observations that the process has ingested, i.e., in each timestep, the algorithm gets to read one stream observation, perform an arbitrary amount of computation and I/O, and generate an arbitrary number of reports. We say all reports generated during the ith timestep occur at time i.

We generalize this in the multi-threaded model: When a thread reports items, it uses the index of the last observation pulled by any thread as the reporting time. This can cause the reporting index of an item be much higher compared to the single-threaded case, because multiple threads each pull a chunk (usually 1,024) of observations simultaneously. Therefore, multi-threading adds an extra delay to the timeliness guarantees of the time-stretch LERT and extra counts to the guarantees of the count-stretch LERT. We analyze this empirically in Section 8.4.

Count stretch. The multi-threaded count-stretch LERT has only one new source of delay: the time that an item might spend sitting in a thread’s local buffers. In the worst case, an item could accumulate up to

occurrences in each thread’s local buffer, in addition to

occurrences in the main data structure, so that it does not get reported until it reaches a count of

To limit this pathological case, we implement a policy to upper bound the total count that an item can have in a thread’s local buffer. For example, we enforce that no thread can hold more than

instances of an item in its local buffer. Whenever the count of an item in the local buffer equals

the thread must move that item from the thread’s local buffer to the main data structure. This way we can bound the maximum count of an item when it is reported.

Lemma 14.

Given Ω and T such that

, where P is the number of threads, a multi-threaded count-stretch LERT guarantees a count stretch of

Time stretch. It is harder to provide a time-stretch guarantee with multiple threads compared to the count-stretch guarantee. This is because time stretch depends on the arrival distribution of other items in the stream, while count stretch is independent of that.

When multiple threads are simultaneously performing ingestion, each thread can pick a chunk of observations from the stream. These observations can be inserted in the data structure out-of-order based on the contention among threads. To guarantee a time stretch with multiple threads we need a global ordering on the observations.

Model. In each timestep, a thread gets to read one observation from the stream and perform all the work on that observation. The work includes taking a lock and inserting the observation in the cone, inserting the observation in the local buffer, dumping contents of the local buffer in cones, and performing a flush/shuffle-merge on the cone. As above, we constrain how long a thread can go before dumping its local buffer. Every thread has to dump its local buffer after every t timesteps.

Based on the above model and constraints, we can now guarantee that the time stretch in the multi-threading case will not be much worse than the single-threaded case.

Observation 1.

In a multi-threaded time-stretch LERT in which each thread dumps its local buffer every t timesteps, we guarantee that an item s is reported in at most

additional timesteps (after the item-count reaches T), where F_s is the flow time of s.

8 Evaluation

In this section, we evaluate our implementations of the time-stretch LERT (TSL), count-stretch LERT (CSL), and Immediate-report LERT (IRL) for timeliness, robustness to input distributions, I/O performance, insertion throughput, and scalability with multiple threads. Our implementation is publicly available at https://github.com/splatlab/lerts.

We compare our implementations against Bender et al.’s cascade filter [14] as a baseline for timeliness. This baseline is an external-memory data structure with no timeliness guarantee. We show that reporting delays can be quite large when data structures take no special steps to ensure timeliness.

We also evaluate an implementation of the Misra-Gries data structure as a baseline for in-memory insertion throughput. We implement the Misra-Gries data structure with an exact counting data structure (counting quotient filter) to forbid false positives. This gives an upper bound on the insertion throughput one can achieve in-memory while performing immediate event-detection. The objective of this baseline is to evaluate the effect of disk accesses during flushes/shuffle-merges in our implementations of the TSL, CSL, and IRL.

We address the following performance questions for the time-stretch, count-stretch and immediate-report LERT:

(1)

How does the empirical timeliness of reported items compare to the theoretical bounds?

(2)

How robust is the time-stretch LERT to different input distributions?

(3)

How does deamortization and multi-threading affect the empirical timeliness of reported items?

(4)

How does the buffering strategy affect count stretch and throughput?

(5)

How does LERT total I/O compare to theoretical bounds?

(6)

What is the insertion throughput of the time-stretch, count-stretch and immediate-report LERT?

(7)

How does deamortization and multiple threads affect instantaneous throughput?

(8)

How does insertion throughput scale with number of threads?

8.1 Experimental Setup

In this section, we describe how we designed experiments to answer the questions above and describe our workloads,

Our experiments fall into two categories: validation experiments and scalability experiments. The validation experiments require an offline analysis of the dataset to compute the lifetime and measure the stretch of every key to perform the validation. We use smaller datasets (64 million) for the validation experiments. For scalability experiments, we use bigger datasets (4 billion).

Workload. Firehose [5] is a suite of benchmarks simulating a network-event monitoring workload. A Firehose benchmark consists of a generator that feeds keys to the analytic, being benchmarked. The analytic must detect and report each key that has 24 observations.

Firehose includes two generators: the power-law generator selects from a static ground set of 100,000 keys according to a power-law distribution, while the active-set generator allows the ground set to drift over an infinite key space. We use the active-set generator, because an infinite key space more closely matches many real-world streaming workloads. To simulate a stream of keys drawn from a huge key-space we increase the key space of the active set to one million.

Figure 3 shows the distribution of birthtime (the index of the first occurrence of an item) vs. the lifetime (number of observations between the first and the Tth occurrence) of items in the stream from active-set generator. The stream contains 50M observations and the active-set size is 1M.

Fig. 3.

The longest lifetime is ≈22M. Whenever a new item is added to the active set it is assigned a count value from the set of counts based on the power-law distribution. Therefore, we see bands of items that have similar lifetime but are born at different times throughout the stream. The lifetime of items in these bands tend to increase slightly as the items are born later in the stream due to different selection probabilities of items from the active set. In all of our experiments we have used dataset from the active-set generator unless noted otherwise.

Other workloads. Apart from Firehose, we use four other simulated workloads to evaluate the empirical stretch in the time-stretch LERT. These four workloads are generated to show the robustness of the data structure to non-power-law distributions. In the first distribution, M (where M is the size of the level in RAM) keys appear with a count between 24 and 50 and rest of the keys are chosen uniformly at random from a big universe. In the second, M keys appear 24 times and the rest of the keys appear 23 times. In the third, M keys appear round robin each with a count

24. In the fourth, for each key we pick the count uniformly at random between 1 and 25.

Reporting. During insertion, we record each reported item and the index in the stream at which it is reported by the data structure. We record by inserting the reported item in an exact CQF (anomaly CQF) and encoding the index as the count of the item in the anomaly CQF. We also use the anomaly CQF to check if an incoming item has already been reported. We only insert the item if it is not reported yet. This prevents duplicate reports.

Timeliness. For the timeliness evaluation, we measure the reporting delay after its Tth occurrence. We have two measures of timeliness: time stretch and count stretch.

The time-stretch LERT upper bounds the reporting delay of an item based on its lifetime (i.e., time between its first and Tth instance). To validate the timeliness of the time-stretch LERT, we first perform an offline analysis of the stream and calculate the lifetime of each reportable item. Given a reporting threshold T, we record the index of the first occurrence of the item (

) and the index of the Tth occurrence of the item (I_T). During ingestion, we record the index (I_R) at which the time-stretch LERT reports the item. We calculate the time stretch (

) for each reported item as

and verify that

Multiple threads process chunks of 1024 observations from the input stream. We consider all reports a thread generates while processing the ith observation to occur at time i. Due to concurrency, two observations of the same key may be inserted into the data structure in a different order than they are pulled off of the input stream. This may introduce some noise in our time-stretch measurements. However, our experimental results with and without multi-threading were nearly identical, indicating that the noise is small.

In the count-stretch LERT, the upper bound is on the count of the item when it is reported. To validate timeliness, we first record indexes at which items are reported by the count-stretch LERT (I_R). We then perform an offline analysis to determine the count of the item at index I_R (

) in the stream. We then calculate the count stretch (

) as

and validate that

To perform the offline analysis of the stream we first generate the stream from the active-set generator and dump it in a file. We then read the stream from the file for the analysis and for streaming it to the data structure. For timeliness validation experiments we use a stream of 512 Million observations from the active-set generator.

I/O performance. In our implementation of the time-stretch, count-stretch, and immediate-report LERT, we allocate space for the data structure by mmap-ing each level (i.e., the CQF) to a file on SSD. To force the data structure to keep all levels except the first one on SSD we limit the RAM available to the insertion process using the “cgroups” utility in linux. We calculate the total RAM needed by the insertion process to only keep the first level in RAM by adding the size of the first level, the space used by the anomaly CQF to record reported keys, the space used by thread-local buffers, and a small amount of extra space to read the stream sequentially from SSD. We then provision the RAM to the next power-of-two of the total sum.

To measure the total I/O performed by the data structure we use the “iotop” utility in linux. Using iotop we can measure the total amount of reads and writes in KB performed by the process doing insertions.

To validate, we calculate the total amount of I/O performed by the data structure based on the number of merges (shuffle-merges in case of the count-stretch LERT) and time-stretch LERT and sizes of levels involved in those merges.

Similarly to empirical stretch validation, we first dump the stream to a file and then feed the stream to the data structure by streaming it from the file. We use a stream of 64 Million observations from the active-set generator.

Average insertion throughput and scalability. To measure the average insertion throughput, we first generate the stream from the active-set generator and dump it in a file. We then feed the stream to the data structure by streaming it from the file and measure the total time.

To evaluate scalability, we measure how data-structure throughput changes with increasing number of threads. We evaluate power-of-2 thread counts between 1 and 64.

To deamortize the data structures we divide them into 2,048 cones. We use a stream of 4 Billion observations from the active-set generator. We evaluate the insertion performance and scalability for three values (16, 32, and 64) of the DatasetSize-to-RAM-ratio (i.e., the ratio of the data set size to the available RAM).

Instantaneous insertion throughput. We also evaluate the instantaneous throughput of the data structure when run using either a single cone and thread or multiple cones and threads. We approximate instantaneous throughput by calculating throughput (using system timestamps) every

observations. In our evaluation, we fix

Machine specifications. The OS for all experiments was 64-bit Ubuntu 18.04 running Linux kernel 4.15.0-34-generic The machine for all timeliness and I/O performance benchmarks had an Intel Skylake CPU (Core i7-6700HQ CPU @ 2.60 GHz with 4 cores and 6 MB L3 cache) with 32 GB RAM and a 1-TB Toshiba SSD. The machine for all scalability benchmarks had an Intel Xeon(R) CPU (E5-2683 v4 @ 2.10 GHz with 64 cores and 20 MB L3 cache) with 512 GB RAM and a 1-TB Samsung 860 SSD.

For all the experiments, we use a reporting threshold of 24, since it is the default in the Firehose benchmarking suite.

8.2 Timely Reporting

Cascade filter. Figure 4(a) and (b) show the distribution of count stretch and time stretch of reported items in the cascade filter. The cascade filter’s maximum count-stretch is 3.0 and maximum time stretch is

12, much higher than any single-threaded count-stretch or time-stretch LERT.

Fig. 4.

Count-stretch LERT. Figure 4(a) validates worst-case count stretch for the count-stretch LERT. The total on-disk count for an element is 14, so the maximum possible count when reported is 38 (i.e.,

), for a maximum count stretch of 1.583. The maximum reported count stretch is 1.583.

Time-stretch LERT. Figure 4(b) shows the time-stretch LERT meets the time-stretch requirements. The maximum reported time stretch is 1.59, which is smaller than the maximum allowable time stretch of 2. Figure 4(c) shows the distribution of empirical time stretches with changing

values. The time stretch of any reported element is always smaller than the maximum allowable time stretch. As the number of age bits increases,

decreases and the time stretch decreases.

8.3 Robustness with Input Distributions

Figure 5(a) shows the robustness of empirical time stretch (ETS) on four input distributions other than the Firehose power-law distribution. The ETS is less than 2, the theoretical limit of the data structure for all input distributions.

Fig. 5.

8.4 Effect of Deamortization/Threading

Figure 4(a) and (b) show the effect of deamortization and multi-threading on timeliness in the count-stretch LERT and time-stretch LERT.

Using 8 cones instead of one does not change the timeliness of any reported item. This is because the distribution of items in the stream is random (see Section 8.1) and we use a uniform-random hash function to distribute items to each cone. Each cone gets a similar number of items and the cones perform shuffle-merges in sync (refer to Section 6).

Running the count-stretch and time-stretch LERT with 8 cones and 8 threads does affect timeliness of reported items. Some items are reported later than the theoretical upper bound. The reported maximum time- and count-stretch is

5. This is because each thread inserts items into a local buffer when it cannot immediately acquire the cone lock. We empty local buffers only when they are full. The maximum delay happens when an item’s lifetime is similar to the time it takes for a cone to incur a full flush involving all levels of the data structure. Figure 6 shows the stretch of reported items and their lifetime. The maximum-stretch items have a lifetime ≈16 M observations, which is the number of observations it takes for a cone to incur a full flush.

Fig. 6.

8.5 Effect of Buffering

Figure 5(b) shows the empirical count stretch with three different buffering strategies. In the first, we use buffers without any constraint on the count of a key inside a buffer. We dump the buffer into the main data structure when it is full. In the second, we constrain the maximum count a key can have in a buffer to

(for

and

the max count is 3). In the third, we do not use buffers. Threads try to acquire the lock on the cone and wait if the lock is not available.

The empirical stretch is smallest without buffers. However, not using the buffers increases contention among threads and reduces insertion throughput. Using the buffers is

faster compared to not using the buffer.

8.6 I/O Performance and Throughput

Figure 7 shows the total amount of I/O performed by the count-stretch, time-stretch, and immediate-report LERT while ingesting a stream. For all data structures, the total I/O calculated and total I/O measured using iotop is similar.

Fig. 7.

The count-stretch LERT does the least I/O, because it performs the fewest shuffle-merges. The I/O for the time-stretch LERT grows by a factor of two as the number of bins increases, as predicted by the theory. The I/O for Immediate-report LERT is similar to that of the time-stretch LERT with stretch 2. This shows that when item counts follow a power-law distribution, we can achieve immediate reporting with the same amount of I/O as with a time stretch of 2. Insertion throughput. Figure 8(a) shows insertion throughput using the same configuration and stream as the total-I/O experiments. The count-stretch LERT has the highest throughput, because it performs the fewest I/Os. The Immediate-report LERT has lower throughput, because it performs extra random point queries. The time-stretch LERT throughput decreases as we add bins and decrease the stretch.

Fig. 8.

The Misra-Gries data structure throughput is 2.2 Million ops/second in-memory. This acts a baseline for in-memory insertion throughput. The in-memory MG data structure is only twice as fast as the on-disk count-stretch LERT.

8.7 Instantaneous Throughput

Figure 9 shows the instantaneous throughput of the count-stretch LERT. De-amortization and multi-threading improve both average throughput and throughput variance. With one thread and one cone, the data structure periodically stops processing inputs to perform flushes, causing throughput to crash to 0. With 1,024 cones and four threads, the system has much smoother throughput, never stops processing inputs, and has about 3

greater average throughput.

Fig. 9.

8.8 Scaling with Multiple Threads

Figure 8(b) shows count-stretch LERT throughput with increasing number of threads. The scalability will follow for other variants, since they all have the same insertion and SSD access pattern. The insertion throughput increases with thread count. We used three values of DatasetSize-to-RAM-ratio: 16, 32, and 64. All have similar scalability curves.

9 Motivating National Security Application

In this section, we describe the more complex national-security setting that motivates our modeling constraints. Firehose [4, 5] is a clean benchmark that captures the fundamental elements of this setting. The TED problem in this article in turn distills the most difficult part of the Firehose benchmark. Therefore our solutions have direct line of sight to important national-security applications.

An ideal solution for TED would have (1) no false negatives, (2) no false positives, (3) immediate reporting of a stream element that upon arrival hits the reporting threshold, and (4) speed sufficient to keep up with real sensor data streams. To better allow (1) and (4), in this article we relax (2) and (3). Our algorithms limit false positives to keys that are “close” to reportable and bound reporting delay by either time or count. Our use case explains why we can tolerate these relaxations. It also explains why we cannot relax the no-false-negative requirement. This critical aspect of the model means we cannot consider sampling-based or randomized algorithms for finding reportable items, since these can miss events.

We are motivated by monitoring systems for national security [4, 5], where experts associate special patterns in a cyberstream to rare, high-consequence real-life events. These patterns are formed by a small number of “puzzle pieces,” as shown in Figure 10. Each piece is associated with a key such as an IP address or a hostname. The pieces arrive over time. When an entire puzzle associated with a particular key is complete, this is an event, which should be reported as soon as the final puzzle piece falls into place. In Figure 10, the first stage is like our TED problem algorithm, except that it must store puzzle pieces with each key rather than a count and the reporting trigger is a complete puzzle, not a count threshold.

Fig. 10.

There can still be a fair number of matches to this special pattern, most of which are still not the critically bad event. This might overwhelm a human analyst, who would then not use the system. However, automated tools, shown in the second stage of Figure 10, can pare these down to the few events worthy of analyst attention.

The first stage filter, like our TED problem solution, must struggle to handle a massively large, fast stream. It is reasonable to allow a few false positives in the first stage to improve its speed. The second stage can screen out almost all of these false positives as long as the stream is significantly reduced. The second stage is a slower, more careful tool that cannot keep up with the initial stream. This second tool cannot, however, repair false negatives, since anything the first filter misses is gone forever. So the first tool cannot drop any matches to the pattern. Experts have gone to great effort to find a pattern that is a good filter for the high-consequence events. We do not allow false negatives, because the high-consequence events that match this carefully crafted pattern can and must be detected.

Each of these patterns are small with respect to the stream size, so the detection algorithm must be scalable, that is, must be able to support a small threshold T. The consequences of missing an event (false negative) are so severe that it is not reasonable to risk facing those consequences just to save a little space. Thus we must save all partial patterns, motivating our use of external memory.

The ability to tolerate a reporting delay depends upon how much lead time the search pattern gives before possible damage. There will be some additional delay from the second-stage testing. Reports are still “better late than never.” Even if some damage has occurred, the system operators still have significantly more information than they would have if they had received no report.

The DoD Firehose benchmark captures the essence of this setting [5]. In Firehose, the input stream has (key,value) pairs. When a key is seen for the 24th time, the system must return a function of the associated 24 values. The most difficult part of this is determining when the 24th instance of a key arrives. Thus, like Firehose, the TED problem captures the essence of the motivating application.

10 Conclusion

This work bridges external-memory and streaming algorithms. By taking advantage of external memory, we can solve timely event detection problems at a level of precision that is not possible in the streaming model, and with little or no sacrifice in terms of the timeliness of reports.

Even though streaming algorithms, such as Misra-Gries, were developed for a space-constrained setting, we show that they can be made efficient in the external-memory setting, where storage is plentiful but accessing the data is expensive.

Acknowledgments

We thank Tyler Mayer for helpful discussions.

Image attributions for Figure 10: Fire Hydrant by Claire Jones, skill magic stream by Maxicons, puzzle pieces by Iconika, puzzle pieces by studiographic, water Drop by Aldiki Gustiyan Putra, and man and woman by Alice Design; all icons from the Noun Project (https://nounproject.com).

Footnotes

It is possible to prevent repeated queries for an item but we allow it as it does not hurt the asymptotic performance.

For each reported item, we set a flag in RAM that indicates it has been reported, to avoid duplicate reporting of events.

In general, the power-law distribution may hold above some value

. For simplicity, we let

—for this choice

and

⁴

In principle, one could have power-law distributions with

, but these distributions cannot be normalized and are not common [54].

⁵

Each reported item is stored in a separate table in RAM to avoid duplicate reporting of events.

References

[1]

L. A. Adamic. 2008. Zipf, Power Law, Pareto: A ranking tutorial. HP Research. Retrieved from http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html.

Google Scholar

[2]

Alok Aggarwal and Jeffrey Vitter. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31, 9 (1988), 1116–1127.

Digital Library

Google Scholar

[3]

Noga Alon, Yossi Matias, and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing (STOC’96). 20–29.

Digital Library

Google Scholar

[4]

Karl Anderson. 2016. FireHose Benchmarking Streaming Architectures. Retrieved July 9, 2021 from https://www.clsac.org/uploads/5/0/6/3/50633811/anderson-clsac-2016.pdf.

Google Scholar

[5]

Karl Anderson and Steve Plimpton. 2013. FireHose Streaming Benchmarks. Retrieved December 11, 2018 from https://github.com/stream-benchmarking/firehose.

Google Scholar

[6]

Shivnath Babu and Jennifer Widom. 2001. Continuous queries over data streams. ACM SIGMOD Rec. 30, 3 (2001), 109–120.

Digital Library

Google Scholar

[7]

Daniel Barbará. 1999. The characterization of continuous queries. Int. J. Cooperat. Inf. Syst. 8, 04 (1999), 295–323.

Crossref

Google Scholar

[8]

Tim Bartrand, Walter Grayman, and Terra Haxton. 2017. Drinking Water Treatment Source Water Warly Warning System State of the Science Review. Technical Report EPA/600/R-17/405.

Google Scholar

[9]

Ran Ben-Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. 2016. Heavy hitters in streams and sliding windows. In Proceedings of the 35th Annual IEEE International Conference on Computer Communications (INFOCOM’16). IEEE, 1–9.

Digital Library

Google Scholar

[10]

Michael A. Bender, Jonathan W. Berry, Martin Farach-Colton, Justin Jacobs, Rob Johnson, Thomas M. Kroeger, Tyler Mayer, Samuel McCauley, Prashant Pandey, Cynthia A. Phillips, Alexandra Porter, Shikha Singh, Justin Raizes, Helen Xu, and David Zage. 2018. Advanced Data Structures for Improved Cyber Resilience and Awareness in Untrusted Environments: LDRD Report. Technical Report SAND2018-5404. Sandia National Laboratories.

Crossref

Google Scholar

[11]

Michael A. Bender, Alex Conway, Martin Farach-Colton, William Jannen, Yizheng Jiao, Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, Prashant Pandey, Donald E. Porter, Jun Yuan, and Yang Zhan. 2019. Small refinements to the DAM can have big consequences for data-structure design. In Proceedings of the 31st ACM on Symposium on Parallelism in Algorithms and Architectures (SPAA’19). 265–274.

Digital Library

Google Scholar

[12]

Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-oblivious streaming B-trees. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’07). 81–92.

Digital Library

Google Scholar

[13]

Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Yang Zhan. 2015. An introduction to B

-trees and write-optimization. :login; mag. 40, 5 (Oct. 2015), 22–28.

Google Scholar

[14]

Michael A. Bender, Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C. Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P. Spillane, and Erez Zadok. 2012. Don’t thrash: How to cache your hash on flash. Proc. VLDB 5, 11 (2012), 1627–1637.

Digital Library

Google Scholar

[15]

Michael A. Bender, Martín Farach-Colton, Rob Johnson, Simon Mauras, Tyler Mayer, Cynthia A. Phillips, and Helen Xu. 2017. Write-optimized skip lists. In Proceedings of the 36th Symposium on Principles of Database Systems (PODS’17). ACM, 69–78.

Digital Library

Google Scholar

[16]

Radu Berinde, Piotr Indyk, Graham Cormode, and Martin J. Strauss. 2010. Space-optimal heavy hitters with strong error bounds. ACM Trans. Database Syst. 35, 4 (2010), 26.

Digital Library

Google Scholar

[17]

Kevin Beyer and Raghu Ramakrishnan. 1999. Bottom-up computation of sparse and iceberg cube. In ACM SIGMOD Record, Vol. 28. 359–370.

Digital Library

Google Scholar

[18]

Arnab Bhattacharyya, Palash Dey, and David P. Woodruff. 2016. An optimal algorithm for l1-heavy hitters in insertion streams and related problems. In Proceedings of the 35th ACM Symposium on Principles of Database Systems (PODS’16). 385–400.

Digital Library

Google Scholar

[19]

Prosenjit Bose, Evangelos Kranakis, Pat Morin, and Yihui Tang. 2003. Bounds for frequency estimation of packet streams. In Proceedings of the 28th International Colloquium on Structural Information and Communication Complexity (SIROCCO’03). 33–42.

Google Scholar

[20]

Robert S. Boyer and J. Strother Moore. 1991. MJRTY—A fast majority vote algorithm. In Automated Reasoning. Springer, 105–117.

Google Scholar

[21]

Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P. Woodruff. 2017. BPTree: An

heavy hitters algorithm using constant memory. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’17). ACM, 361–376.

Digital Library

Google Scholar

[22]

Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, and David P. Woodruff. 2016. Beating CountSketch for heavy hitters in insertion streams. In Proceedings of the 48th Annual Symposium on Theory of Computing (STOC’16). ACM, 740–753.

Digital Library

Google Scholar

[23]

Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. 1999. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of the Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 1. 126–134.

Crossref

Google Scholar

[24]

Gerth Stølting Brodal, Erik D. Demaine, Jeremy T. Fineman, John Iacono, Stefan Langerman, and J. Ian Munro. 2010. Cache-Oblivious dynamic dictionaries with update/query tradeoffs. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10). 1448–1456.

Digital Library

Google Scholar

[25]

Gerth Stølting Brodal and Rolf Fagerberg. 2003. Lower bounds for external memory dictionaries. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’03). 546–554.

Digital Library

Google Scholar

[26]

Adam L. Buchsbaum, Michael Goldwasser, Suresh Venkatasubramanian, and Jeffery R. Westbrook. 2000. On external memory graph traversal. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’00). 859–860.

Digital Library

Google Scholar

[27]

Pedro Celis, Per-Ake Larson, and J. Ian Munro. 1985. Robin hood hashing. In Proceedings of the 26th Annual Symposium on Foundations of Computer Science (sfcs’85). IEEE, 281–288.

Digital Library

Google Scholar

[28]

Sirish Chandrasekaran and Michael J. Franklin. 2002. Streaming queries over streaming data. In Proceedings of the 28th International conference on Very Large Data Bases. VLDB Endowment, 203–214.

Digital Library

Google Scholar

[29]

Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages, and Programming (ICALP’02). 693–703.

Digital Library

Google Scholar

[30]

Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. 2009. Power-law distributions in empirical data. SIAM Rev. 51, 4 (2009), 661–703.

Digital Library

Google Scholar

[31]

Alex Conway, Martin Farach-Colton, and Philip Shilane. 2018. Optimal hashing in external memory. In Proceedings of the 45th International Colloquium on Automata, Languages, and Programming (ICALP’18). 39:1–39:14.

Google Scholar

[32]

Graham Cormode and Marios Hadjieleftheriou. 2010. Methods for finding frequent items in data streams. VLDB J. 19, 1 (2010), 3–20.

Digital Library

Google Scholar

[33]

Graham Cormode and S Muthukrishnan. 2004. An improved data stream summary: The count-min sketch and its applications. In Proceedings of the Latin American Symposium on Theoretical Informatics. 29–38.

Crossref

Google Scholar

[34]

Graham Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1 (2005), 58–75.

Digital Library

Google Scholar

[35]

Graham Cormode and S. Muthukrishnan. 2005. What’s hot and what’s not: Tracking most frequent items dynamically. ACM Trans. Datab. Syst. 30, 1 (2005), 249–278.

Digital Library

Google Scholar

[36]

Erik D. Demaine, Alejandro López-Ortiz, and J. Ian Munro. 2002. Frequency estimation of internet packet streams with limited space. In Proceedings of the European Symposium on Algorithms (ESA’02). Springer, 348–360.

Digital Library

Google Scholar

[37]

Xenofontas Dimitropoulos, Paul Hurley, and Andreas Kind. 2008. Probabilistic lossy counting: An efficient algorithm for finding heavy hitters. ACM SIGCOMM Comput. Commun. Rev. 38, 1 (2008), 5.

Digital Library

Google Scholar

[38]

Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. 1998. Computing iceberg queries efficiently. In Proceedings of the 24rd International Conference on Very Large Databases (VLDB’98). 299–310.

Digital Library

Google Scholar

[39]

Jose M. Gonzalez, Vern Paxson, and Nicholas Weaver. 2007. Shunting: A hardware/software architecture for flexible, high-performance network intrusion prevention. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS’07). 139–149.

Digital Library

Google Scholar

[40]

Jiawei Han, Jian Pei, Guozhu Dong, and Ke Wang. 2001. Efficient computation of iceberg cubes with complex measures. In ACM SIGMOD Record, Vol. 30. 1–12.

Digital Library

Google Scholar

[41]

John Hershberger, Nisheeth Shrivastava, Subhash Suri, and Csaba D. Tóth. 2005. Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In Proceedings of the 24th Symposium on Principles of Database Systems (PODS’05). ACM, 338–347.

Digital Library

Google Scholar

[42]

John Iacono and Mihai Pătraşcu. 2012. Using hashing to solve the dictionary problem. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’12). 570–582.

Digital Library

Google Scholar

[43]

Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28, 1 (2003), 51–55.

Digital Library

Google Scholar

[44]

M. Kezunovic. 2006. Monitoring of power system topology in real-time. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences, Vol. 10. 244b–244b. DOI:https://doi.org/10.1109/HICSS.2006.355

Google Scholar

[45]

Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, and Mikkel Thorup. 2016. Heavy hitters via cluster-preserving clustering. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS’16). 61–70.

Crossref

Google Scholar

[46]

Quentin Le Sceller, ElMouatez Billah Karbab, Mourad Debbabi, and Farkhund Iqbal. 2017. SONAR: Automatic detection of cyber security events over the twitter stream. In Proceedings of the 12th International Conference on Availability, Reliability and Security.

Digital Library

Google Scholar

[47]

E. Litvinov. 2006. Real-time stability in power systems: Techniques for early detection of the risk of blackout [Book Review]. IEEE Power Energy Mag. 4, 3 (May 2006), 68–70. DOI:

Crossref

Google Scholar

[48]

Jianning Mai, Chen-Nee Chuah, Ashwin Sridharan, Tao Ye, and Hui Zang. 2006. Is sampled data sufficient for anomaly detection? In Proceedings of the 6th ACM SIGCOMM Conference on Internet Measurement. 165–176.

Digital Library

Google Scholar

[49]

Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases. VLDB Endowment, 346–357.

Digital Library

Google Scholar

[50]

Chad R. Meiners, Jignesh Patel, Eric Norige, Eric Torng, and Alex X. Liu. 2010. Fast regular expression matching using small TCAMs for network intrusion detection and prevention systems. In Proceedings of the 19th USENIX Conference on Security.

Digital Library

Google Scholar

[51]

Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the International Conference on Database Theory. Springer, 398–412.

Digital Library

Google Scholar

[52]

Katsiaryna Mirylenka, Graham Cormode, Themis Palpanas, and Divesh Srivastava. 2015. Conditional heavy hitters: Detecting interesting correlations in data streams. VLDB J. 24, 3 (2015), 395–414.

Digital Library

Google Scholar

[53]

Jayadev Misra and David Gries. 1982. Finding repeated elements. Sci. Comput. Program. 2, 2 (1982), 143–152.

Digital Library

Google Scholar

[54]

Mark E. J. Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46, 5 (2005), 323–351.

Crossref

Google Scholar

[55]

Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (1996), 351–385.

Digital Library

Google Scholar

[56]

Prashant Pandey, Michael A. Bender, Rob Johnson, and Robert Patro. 2017. A general-purpose counting filter: Making every bit count. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17). ACM, 775–787. DOI:https://doi.org/10.1145/3035918.3035963

Google Scholar

[57]

Prashant Pandey, Shikha Singh, Michael A. Bender, Jonathan W. Berry, Martín Farach-Colton, Rob Johnson, Thomas M. Kroeger, and Cynthia A. Phillips. 2020. Timely reporting of heavy hitters using external memory. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1431–1446.

Digital Library

Google Scholar

[58]

Shahid Raza, Linus Wallgren, and Thiemo Voigt. 2013. SVELTE: Real-time intrusion detection in the Internet of Things. Ad Hoc Netw. 11, 8 (2013), 2661–2674. http://dblp.uni-trier.de/db/journals/adhoc/adhoc11.html#RazaWV13.

Digital Library

Google Scholar

[59]

Daniel Ting. 2018. Data sketches for disaggregated subset sum and frequent item estimation. In Proceedings of the International Conference on Management of Data. 1129–1140.

Digital Library

Google Scholar

[60]

Shobha Venkataraman, Dawn Song, Phillip B. Gibbons, and Avrim Blum. 2005. New streaming algorithms for fast detection of superspreaders. In Proceedings of the Network and Distributed Systems Security Symposium (NDSS’05).

Google Scholar

[61]

H. Yan, R. Oliveira, K. Burnett, D. Matthews, L. Zhang, and D. Massey. 2009. BGPmon: A real-time, scalable, extensible monitoring system. In Proceedings of the Cybersecurity Applications Technology Conference for Homeland Security. 212–223. DOI:https://doi.org/10.1109/CATCH.2009.28

Google Scholar

[62]

Tong Yang, Haowei Zhang, Jinyang Li, Junzhi Gong, Steve Uhlig, Shigang Chen, and Xiaoming Li. 2019. HeavyKeeper: An accurate algorithm for finding Top-k elephant flows. IEEE/ACM Trans. Netw. 27, 5 (2019), 1845–1858.

Digital Library

Google Scholar

[63]

Yu Zhang, BinXing Fang, and YongZheng Zhang. 2010. Identifying heavy hitters in high-speed network monitoring. Science Chin. Inf. Sci. 53, 3 (2010), 659–676.

Crossref

Google Scholar

Index Terms

Timely Reporting of Heavy Hitters Using External Memory
1. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
    2. Streaming, sublinear and near linear time algorithms

Recommendations

Timely Reporting of Heavy Hitters using External Memory
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Given an input stream of size N, a φ-heavy hitter is an item that occurs at least φ N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must ...
An Optimal Algorithm for ℓ₁-Heavy Hitters in Insertion Streams and Related Problems

We give the first optimal bounds for returning the ℓ₁-heavy hitters in a data stream of insertions, together with their approximate frequencies, closing a long line of work on this problem. For a stream of m items in { 1, 2, … , n} and parameters 0 < ε < ...
Finding heavy distinct hitters in data streams
SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

A simple indicator for an anomaly in a network is a rapid increase in the total number of distinct network connections. While it is fairly easy to maintain an accurate estimate of the current total number of distinct connections using streaming ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 46, Issue 4

December 2021

169 pages

ISSN:0362-5915

EISSN:1557-4644

DOI:10.1145/3492445

Editor:
Christopher Jermaine
Rice University, USA

Issue’s Table of Contents

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2021

Accepted: 01 June 2021

Revised: 01 June 2021

Received: 01 December 2020

Published in TODS Volume 46, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

NSF
Laboratory-Directed Research-and-Development program at Sandia National Laboratories
National Technology and Engineering Solutions of Sandia, LLC.
Honeywell International, Inc.
U.S. Department of Energy’s National Nuclear Security Administration
U.S. Department of Energy or the United States Government
Advanced Scientific Computing Research (ASCR)
Office of Science of the DOE
NERSC
Exascale Computing Project
U.S. Department of Energy Office of Science and the National Nuclear Security Administration

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
793
Total Downloads

Downloads (Last 12 months)321
Downloads (Last 6 weeks)35

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

1 Introduction

1.1 Results

Additional Related Work

2 Preliminaries

3 Immediate Reporting

3.1 External-memory Misra-Gries

3.2 Misra-Gries LERT

4 Time Stretch

4.1 Time-stretch LERT

4.2 Analysis of Time-stretch LERT

4.3 Implementation of Time-stretch LERT

5 Power-law Distributions

5.1 Immediate-report LERT

5.2 Count-stretch LERT

5.3 Implementation of Count-stretch and Immediate-report LERT

6 Deamortization to Support Consistent Ingestion Rates

7 Multi-threading

7.1 Timeliness with Multi-threading

8 Evaluation

8.1 Experimental Setup

8.2 Timely Reporting

8.3 Robustness with Input Distributions

8.4 Effect of Deamortization/Threading

8.5 Effect of Buffering

8.6 I/O Performance and Throughput

8.7 Instantaneous Throughput

8.8 Scaling with Multiple Threads

9 Motivating National Security Application

10 Conclusion

Acknowledgments

Footnotes

References

Index Terms

Recommendations

Timely Reporting of Heavy Hitters using External Memory

An Optimal Algorithm for ℓ1-Heavy Hitters in Insertion Streams and Related Problems

Finding heavy distinct hitters in data streams

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations

An Optimal Algorithm for ℓ₁-Heavy Hitters in Insertion Streams and Related Problems