Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Timely Reporting of Heavy Hitters Using External Memory

Published: 15 November 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Given an input stream S of size N, a ɸ-heavy hitter is an item that occurs at least ɸN times in S. The problem of finding heavy-hitters is extensively studied in the database literature.
    We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ɸ N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity).
    Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes).
    We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead.
    We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.

    1 Introduction

    Real-time monitoring of high-rate data streams, with the goal of detecting and preventing malicious events, is a critical component of defense systems for cybersecurity [46, 58, 61] as well as for physical systems, e.g., for water or power distribution [8, 44, 47]. In such a monitoring system, the stream elements represent the changes to the state of the system. Each detected/reported event could trigger an intervention. Analysts use more specialized tools to gauge the actual threat level. Newer systems are even beginning to take defensive actions, such as blocking a remote host automatically based on detected events [39, 50]. Accuracy (i.e., few false-positives and no false-negatives) and timeliness of event detection are essential to these systems.
    Central to these applications is the problem of timely reporting of heavy hitters. In the heavy-hitters problem, we are given a stream and a reporting threshold , and we must report all elements that occur at least T times in S. In the preliminary version of this article [57], we introduced the real-time version of the heavy-hitters problem called the Timely Event Detection (TED) problem. In the TED problem, each heavy hitter must be reported soon after its Tth occurrence, where the acceptable reporting delay is defined by the application.
    In network-security monitoring applications, N is huge and T can be very small. This is because anomalies in network streams are often small-sized events that develop slowly, appearing normal in the midst of large amounts of legitimate traffic [48, 60]. As an example of the demands placed on event-detection systems, the U.S. Department of Defense (DoD) and Sandia National Laboratories developed the Firehose streaming benchmark suite [4, 5] to measure the performance of TED algorithms. In the FireHose benchmark, the reporting threshold is preset to the representative value of , i.e., .
    The classic streaming algorithms for reporting heavy-hitters were designed assuming that only an in-RAM data structure can keep up with high-speed streams. The challenge of detecting events entirely within RAM has inspired a deep and beautiful literature on streaming algorithms and database systems [3, 16, 18, 19, 21, 22, 29, 34, 35, 36, 37, 45, 49].
    However, streaming algorithms sacrifice accuracy to get solutions that can fit in RAM. First, most streaming heavy-hitter algorithms only work for high reporting thresholds, e.g., T is a constant fraction of N. Second, they allow false positives. Third, many streaming algorithms perform some kind of sampling, which leads to false negatives. These inaccuracies are not the fault of the streaming algorithms. They are an inherent limitation when you have a large stream and a much smaller RAM size. See Section 9 for a motivating application, where any of these three limitations would lead to failure.
    Combining streaming and external memory. This work challenges the assumption that only in-RAM data structures can keep up with real-world streams and shows that by using modern storage devices and building upon recent advances in external-memory dictionaries, we can design on-disk data structures that can process millions of stream events per second.
    In particular, we present algorithms in the external-memory model that support both exact and approximate reporting of heavy hitters. In the external-memory model [2], RAM has fixed size M, and accessing it is free. The disk has unbounded size and accessing it costs an input/output (I/O) transaction. An I/O transfers data between RAM and disk in blocks of size B. The algorithmic advantage of external memory is that there is unbounded storage. The algorithmic challenge is that I/Os are expensive.
    External-memory enables us to overcome longstanding limitations in accuracy (i.e., no false-positives or negatives) and sensitivity (i.e., small ɸ) while maintaining timeliness in event reporting, but necessitates developing new heavy-hitters algorithms that use I/Os efficiently.
    Our contributions. In this article, we present I/O-efficient external-memory algorithms that support both exact and approximate reporting of heavy hitters. Specifically, our TED algorithms can be generalized to solve the -heavy hitters problem—where every item that occurs times must be reported, no item that occurs times should be reported. Items with count in between and ɸN may be reported and these are false positives.
    The present article serves as the journal version for [57], and it also contains technical improvements. Our first contribution is theoretical. We include proofs for all lemmas and theorems, unlike [57], which, for space reasons, omitted essentially all proofs. Explaining and proving these results has more than doubled the length of the article, and thus includes results that are indeed reproducible. Furthermore, we generalize the results for power-law streams presented in Reference [57] by specifying the precise relationship between the reporting threshold ɸ and the power-law exponent θ; for details see Section 5.
    Our second contribution is experimental. We include all of the experiments from Reference [57] along with additional ones. In particular, we give empirical analysis of the birthtime versus lifetime of items in the active-set generator of the Firehose streaming benchmark [4, 5]. Also, in the interest of reproducibility, we have included pseudocode for all data structures and algorithms.
    Finally, we provide detailed explanation of how the constraints of the TED problem are motivated from practice. In particular, in Section 9, we discuss the national-security application that motivates the Firehose benchmark [4, 5], and how the TED problem captures the main computational bottleneck of this application.
    Timeliness, not ingestion, is the challenge in external memory. Stream ingestion is not the bottleneck for on-disk data structures. Optimal external-memory (EM) dictionaries (including write-optimized dictionaries such as -trees [11, 13, 25], COLAs [12], xDicts [24], buffered repository trees [26], write-optimized skip lists [15], log structured merge trees [55], and optimal external-memory hash tables [31, 42]) can ingest new observations at a significant fraction of disk bandwidth. The fastest can index using I/Os per stream item, which is far less than one I/O per item. In practice, this means that even a system with just a single disk can ingest hundreds of thousands to millions of items per second.
    For example, prior work at SuperComputing 2017 showed that a single computer can easily maintain an on-disk -tree [25] index of all connections on a 600 gigabit/second network [10]. The system could efficiently answer offline queries. What the system could not do was detect events online.
    Existing external-memory data structures do not solve the TED problem, because queries are too slow. For example, consider a straw-man solution in which we use an external-memory dictionary to implement the standard heavy-hitters algorithm, Misra-Gries [53]. Since Misra-Gries performs a query for each stream observation, this approach is bottlenecked on the dictionary searches. Once the dictionary is larger than RAM, for a random stream, most queries will miss the cache and require an I/O and hence will be bottlenecked on the latency of the storage device.
    In this article, we show how to perform timely event detection for essentially the same cost as simply inserting the data into a -tree or other optimal external-memory dictionary. Even so, we manage to answer the standing heavy-hitter query for each new stream element.

    1.1 Results

    In this article, we present external-memory algorithms for the TED problem. We evaluate these algorithms theoretically and empirically. In both cases, we show that these algorithms perform much less than one I/O per query and are limited only by I/O bandwidth (not latency). Furthermore, we show how to provide a tradeoff between reporting delay and I/O cost. We call these data structures leveled external-memory reporting tables (LERTs).
    We begin by formally defining an event that must be reported in the TED problem. Given a stream , a ɸ-heavy hitter is an element that occurs at least ɸN times in S. The heavy-hitters problem is to report all ɸ-heavy-hitters in S.
    In the TED problem, we say that there is a ɸ-event at timestep t if stream element st occurs exactly times in . Thus for each ɸ-heavy hitter there is a single ɸ-event, which occurs when the element’s count reaches the reporting threshold . In the TED problem, the goal is to report ɸ-events as soon as they occur.
    Our first data structure, the Misra-Gries LERT, adapts the Misra-Gries heavy-hitter algorithm to solve the TED problem in external-memory with immediate reporting. In particular, the Misra-Gries LERT reports each ɸ-event as soon as it occurs (no delay) at an amortized cost of I/Os, for sufficiently large ɸ. The guarantees of the Misra-Gries LERT hold for any input distribution; see Corollary 1.
    The Misra-Gries LERT serves as the basis of our main algorithms that support much smaller ɸ, but permit some delay in reporting. We define two types of delay: time stretch and count stretch. We say an event-detection algorithm has time stretch if each item s is reported at most timesteps after s’s Tth occurrence, where Fs is the number of timesteps between s’s first and Tth occurrences. We say that an event-detection algorithm has count stretch , if each item is reported before the item’s count reaches .
    We design a data structure, the time-stretch LERT, that solves the TED problem for any input stream and any with time stretch at an amortized cost of I/Os per stream item. For constant , this is asymptotically as fast as simply ingesting and indexing the data [12, 25, 26]. The time-stretch LERT guarantees hold for any input distribution; see Corollary 2.
    In our evaluations, the time-stretch LERT with stretch 2 can ingest at K insertions/second using a single thread. We also observed that the average empirical time stretch is 43% smaller than the theoretical upper bound.
    Our count-stretch LERT is tailored to guarantee count-stretch on input stream distributions where the count for each item is drawn from a power-law distribution. In particular, given an input stream with item counts distributed according to a power-law with parameter , which is the typical range [1, 16, 23, 30, 54], and parameters T and Ω, such that , we show that the count-stretch LERT solves the TED problem with count stretch at an amortized I/O cost per stream item with high probability (w.h.p.). Thus, the count-stretch LERT avoids expensive point queries, matching the ingestion rate of write-optimized data structures. In our evaluations, we find that the count-stretch LERT with stretch 1.583 can ingest at insertions/second using a single thread. With multi-threading and de-amortization, the count-stretch LERT scales to more than 11M insertions/second, and the variance of the instantaneous throughput goes down by several orders of magnitude relative to the amortized, single-threaded version; see Figure 9. Moreover, the average empirical count stretch is 21% smaller than the theoretical upper bound.
    Finally, we show how to modify the count-stretch LERT to support immediate reporting. We call the resulting data structure the Immediate-report LERT and show that it solves the TED problem much faster than the Misra-Gries LERT for input streams with element counts drawn from power-law distributions; see Theorem 9 for the formal I/O cost. In our evaluation, we find that the Immediate-report LERT can ingest at ≈500K insertions/second using a single thread.

    Additional Related Work

    Heavy-hitter algorithms. The heavy-hitter problem has been extensively studied in the database literature; we refer readers to the survey by Cormode and Hadjieleftheriou [32].
    Two main strategies have been used: deterministic counter-based approaches [19, 36, 43, 49, 51, 53] and randomized sketch-based approaches [29, 33]. The first is based on the classic Misra-Gries (MG) algorithm [53], which generalizes the Boyer-Moore majority finding algorithm [20].
    Randomized sketch-based algorithms such as count-min sketch [33] maintain a small sketch of the frequency vectors using compact hash functions.
    More recent work has focused on generalizations of the heavy-hitters problem. Ting [59] considers aggregating subset sums, rather than counts, and Ben-Basat et al. [9] generalize the heavy hitter problem to sliding windows. Multiple researchers [52, 62, 63] have designed heavy-hitter algorithms for detecting top flows in networking applications.
    Database iceberg queries. The TED problem is related to the problem of answering iceberg queries in databases [17, 38, 40, 41]. An iceberg query computes an aggregate function over some database attribute and reports the values that are above some predetermined threshold. The main distinctions between the two problems is as follows: (a) iceberg queries are offline, i.e., performed on a static dataset, and (b) the number of reported results in iceberg queries is usually small; while the number of reported events can be large in the TED.
    Database continuous queries. The TED problem is an instance of a continuous or standing query over a database [6, 7, 28]. A continuous query, once issued, runs as the database is updated through inserts and deletes. The system reports new query matches as the database is updated. In TED, the database D consists of the items from the stream seen so far, and the continuous query over D is whether there is an item with count exactly .

    2 Preliminaries

    We formalize our model and review several building blocks of our data structures: the Misra-Gries heavy-hitters algorithm [53], counting quotient filters (CQF) [56], and cascade filters (CF) [12].
    TED problem and model. The TED problem is as follows: given stream , for each i, if there is a ɸ-event at time i, report si before time j, such that the reporting delay, , is within an acceptable degree of tolerance. In the Misra-Gries LERT in Section 3.2, there is no reporting delay. In the time-stretch LERT in Section 4, the reporting delay is dependent on the flow time of the item (the time it takes for the item’s count to go from zero to ), and in the count-stretch LERT in Section 5.2, the reporting delay is count-dependent.
    We measure time in terms of the number of stream observations. That is, in each timestep, the algorithm reads one stream observation, performs an arbitrary amount of computation and I/O, and generates an arbitrary number of reports. We say all reports generated during the ith timestep occur at time i.
    The Misra-Gries frequency estimator. The MG algorithm estimates the frequency of items in a stream. Given an estimation error and a stream S of N items from a universe , the MG algorithm uses a single pass over S to construct a table with at most entries. Each table entry is an item with a count, denoted . For each not in table , let . Let fs be the number of occurrences of item s in stream S. The MG algorithm guarantees that for all .
    MG initializes to an empty table and processes items in the stream as described below. For each in S,
    If , then increment counter .
    If and , then insert si into . Set .
    If and , then for each decrement and delete its entry if becomes 0.
    To see why this algorithm ensures that for all , note that a is incremented only for an occurrence of s in S. Thus . For the upper bound, whenever we decrement , then other items have their count decremented. This can happen at most times. Thus, .
    The MG algorithm can be used to solve the -heavy hitters problem as follows. Run the MG algorithm on the stream with error parameter . Then iterate over the set and report any item s with .
    For a frequency estimation error of , Misra-Gries uses words of storage, assuming each stream item and each count occupy words.
    Analogous to the -heavy hitters problem, we define the approximate TEDproblem as follows: Report all ɸ-events soon after they occur and do not report any item with count . Reported items with count in between are false positives.
    Counting Quotient Filter. The CQF [56] can be viewed as a hash table based on Robin-Hood hashing [27]. The CQF consists of an array Q of slots and a hash function h mapping stream elements to p-bit integers, where . Robin-Hood hashing is a variant of linear probing in which we try to place an element a in slot , but shift elements down when there are collisions. Furthermore, Robin-Hood hashing maintains the invariant that, if , then a will be in an earlier slot than .
    The CQF supports efficient insertions, queries, updates, and deletions, just like any Robin-Hood hash table. Thus, it is straightforward to implement the Misra-Gries algorithm on top of a CQF, by using the CQF to store the table C.
    Cascade Filter. The CF [14] is a write-optimized data structure based on the CQF [56] and the COLA [12]. The CF consists of multiple levels with exponentially increasing sizes where each level is a CQF. The first level is in RAM and the rest are on SSD. There are levels, where M is the size of RAM, N is the size of the dataset, and r is the factor by which levels grow in size.
    Since the cascade filter is also a map, we can use it as the basis for an EM Misra-Gries algorithm. The total table size is . The amortized I/O cost to update the table for each stream element is . However, if we want to support immediate reporting in a CF, then a query is triggered after each insert that costs I/Os. Thus the overall algorithm is bottlenecked on the queries performed for each stream element.

    3 Immediate Reporting

    In this section, we first design an efficient external-memory version of the core Misra-Gries frequency estimator and then extend our external-memory Misra-Gries algorithm to solve the TED problem with immediate reporting.
    When , then simply running the standard Misra-Gries algorithm can result in a cache miss for every stream element, incurring an amortized cost of I/Os per element. Our construction reduces this to , which is when .

    3.1 External-memory Misra-Gries

    Our external-memory Misra-Gries data structure is a sequence of Misra-Gries tables, , where and is a parameter we set later. The size of the table at level i is , so the size of the last level is at least .
    Each level acts as a Misra-Gries data structure. Level 0 receives its input from the stream. Level receives its input from level , the level above. Whenever the standard Misra-Gries algorithm for the table at level i would decrement an item count, the external-memory MG data structure decrements that item’s count by one on level i and sends one instance of that item to the level below (). The decrements from the last level L are deleted.
    The external-memory MG algorithm processes the input stream by inserting each item in the stream into . To insert an item x into level i, do the following:
    If , then increment .
    If , and , then .
    If and , then, for each in , decrement ; remove from if becomes 0. If , then recursively insert into .
    We call the process of decrementing the counts of all the items at level i and incrementing all the corresponding item counts at level a flush.
    Lemma 1 shows that every prefix of levels in the external-memory MG data structure is an MG frequency estimator, with the accuracy of the estimates increasing with j.
    Lemma 1.
    Let (where if ). Then, the following holds:
    , and,
    .
    Proof.
    Decrementing the count for an element x in level and inserting it on the next level does not change . This means that changes only when we insert an item x from the input stream into or when we decrement the count of an element in level j. Thus, as in the MG algorithm, is only incremented when x occurs in stream, and is decremented only when the counts for other elements are also decremented. The first inequality follows from this and the MG analysis. The second inequality follows from the first and the fact that .□
    Thus, to report -heavy hitters (at the end of the stream), we can iterate over the sets and report any element x with counter .
    For the I/O analysis, we assume that each level of the external-memory MG structure is implemented as a cascade filter [14].
    Lemma 2.
    Given , the amortized I/O cost of insertion in the external-memory MG data structure is .
    Proof.
    A flush from level i to level in a cascade filter is implemented by scanning both both levels, which can be done in I/Os. Each such flush moves at least stream elements down one level, so the amortized cost to move one stream element down one level is I/Os. Each stream element can be moved down at most L levels. Thus, the overall amortized I/O cost is , which is minimized at .□
    When no false positives are allowed, that is, , the I/O complexity is .

    3.2 Misra-Gries LERT

    We extend our external-memory MG data structure to support immediate reporting. That is, we show that for a threshold ɸ that is sufficiently large, it can report ɸ-events as soon as they occur.
    A first attempt to add immediate reporting is to compute for each stream event si and report si as soon as . However, this requires querying for for every stream item and can cost up to I/Os per stream item.
    We avoid these expensive queries by using the properties of the in-memory MG estimates . If , then we know that and we therefore do not have to report si, regardless of the count for si in the lower levels of the external-memory data structure.
    We describe the new data-structure, the Misra-Gries LERT. Whenever we increment from a value that is at most to a value that is greater than , we compute and report si if . For each entry , we store a bit indicating whether we have performed a query for , along with a second count that stores the number of occurrences of x needed to hit reporting threshold . We set appropriately whenever we compute without reporting x. When an instance of x arrives, is incremented as in external-memory MG, and if the search bit is set, then we also decrement ; if a decrement of causes it to become zero, then we report x. As in our external-memory MG structure, if the count for an entry becomes 0, then we delete that entry (along with its metadata). This means we might query for the same item more than once; as we see below, this has no effect on the overall I/O cost of the algorithm.1
    To avoid reporting the same item more than once, we can maintain, with each entry in , a bit indicating whether that item has already been reported.
    Whenever we report a item x, we set the “reported” bit in . Whenever we flush an item from level i to level , we set the bit for that item on level if it is set on level i. When we delete the entry for a item that has the bit set on level , we add an entry for that item on a new level . This new level contains only items that have already been reported. When we are checking whether to report a item during a query, we stop checking further and omit reporting as soon as we reach a level where the bit is set.
    I/O complexity. For the analysis, we assume that the levels of the data structure are implemented as sorted arrays with fractional cascading, and thus computing requires I/Os.
    Theorem 3.
    Given a stream of size N and parameters and ɸ, where and , the approximate TED problem can be solved with immediate reporting at amortized I/O cost per stream item.
    Proof.
    The amortized cost of performing insertions is .
    To analyze the query costs, let , i.e., the frequency error of the in-memory level. Since we perform at most one query each time an item’s count in goes from 0 to , the total number of queries is at most . Since each query costs I/Os, the overall amortized I/O complexity of the queries is .□
    Exact reporting. To solve the problem exactly, that is, with no false positives we set in Theorem 3, and get the following corollary.
    Corollary 1.
    Given a stream of size N and , the TED problem can be solved with immediate reporting at amortized I/O cost per stream item.
    Remark 1.
    The following example shows that the analysis of the Misra-Gries LERT is asymptotically tight. In particular, when the RAM threshold is reached for an item (its count in RAM reaches ), then the item’s counts are spread across all L levels of the data structure, requiring a full sweep to consolidate its count and report; and that the total number of such queries can be .
    Let and , so the Misra-Gries LERT has levels. Let the threshold , which satisfies the condition that , and let . Consider the stream S defined below.
    where , all the ’s and yi’s are all distinct and not equal to item a. For this stream, every th unique element causes a decrement to the count of all items at the ith level, pushing instances of item a down to level . When item a reaches the reporting threshold of ɸN during the last phase, its instances occur all the way down to the last level in the Misra-Gries LERT.
    Thus, when the instance of a that triggers a report enters the system, we must collect at least one instance of a from every single level to recognize the need to report. Furthermore, every unique element of the stream triggers a query in RAM (since the RAM threshold is ), and there are such queries.
    Summary. The Misra-Gries LERT supports a throughput at least as fast as optimal write-optimized dictionaries [12, 13, 15, 24, 25, 26], while estimating the counts as well as if it had an enormous RAM. It maintains count estimates at different granularities across the levels. Not all estimates are actually needed, but given a small number of levels, we can refine the estimates by looking in only a few additional locations.
    The external-memory MG algorithm helps us solve the TED problem. The smallest MG sketch (which fits in memory) is the most important estimator here, because it serves to sparsify queries to the rest of the structure. When such a query gets triggered, we need the total counts from the remaining levels for the (exact) online event-detection problem but only levels when approximate thresholds are permitted. In the next two sections, we exploit other advantages of this cascading technique to support much lower ɸ without sacrificing I/O efficiency.

    4 Time Stretch

    The MG LERT described in Section 3.2 reports events immediately, albeit at a high amortized I/O cost to perform queries to recognize the need for reporting. In this section, we show that if we allow a bounded reporting delay proportional to the time it takes an item to become a ɸ-event, then we can significantly improve the I/O performance—in particular, we can perform timely event detection asymptotically as cheaply as if we reported all events only at the end of the stream.
    Our data structure guarantees a time-stretch of . That is, it reports an item x no later than time , where is the time of the first occurrence of x, is the time of the ɸNth occurrence of x and is the flow time of x.

    4.1 Time-stretch LERT

    We design a data structure to guarantee time-stretch, the time-stretch LERT. Similarly to the Misra-Gries LERT, the time-stretch LERT consists of levels . The ith level has size . Items are flushed from lower to higher levels.
    Unlike the Misra-Gries LERT, all events are detected during the flush operations. Thus, we never need to perform point queries. This means (1) we can use simple sorted arrays to represent each level and (2) we do not need to maintain the invariant that level 0 is a MG data structure on its own.
    Data structure layout. We split the table at each level i into equal-sized bins , each of size . The capacity of a bin is defined by the sum of the counts of the items in that bin, i.e., a bin at level i can become full, because it contains items, each with count 1, or 1 item with count , or any other such combination. See Figure 1. Flushing schedule. We maintain a strict flushing schedule to obtain the time-stretch guarantee. The flushes are performed at the granularity of bins (rather than entire levels). The scheduling algorithm is described below.
    Fig. 1.
    Fig. 1. A depiction of bins at each level of the Time-stretch LERT; is the time-stretch parameter. EM stands for external memory. All bins are equal sized.
    Let be the bins (in order) on level i, where level 0 is RAM, and .
    Each stream item is inserted into , the first bin in RAM.
    Whenever a bin becomes full, we shift all the bins on level i over by one, that is, we move the contents of bin to the adjacent bin The elements of the last bin at level i, , are moved to , the first bin on the next level.
    Since the bins in level are r times larger than the bins in level i, bin becomes full after exactly r flushes from . When this happens, we perform a shift and flush the last bin on level and so on.
    Count consolidation. Finally, during a flush involving levels , where , we scan these levels and for each item k, we sum its counts. If the total count is greater than , (and we have not reported it before2), then we report k.

    4.2 Analysis of Time-stretch LERT

    Correctness. We show that our data structure guarantees time stretch.
    Lemma 4.
    The time-stretch LERT with stretch-parameter reports each ɸ-event st occurring at time t by time , where Ft is the flow time of st.
    Proof.
    Consider an item st with flow time Ft. Let be the largest level containing an instance of st at time t when it hits the threshold count of ɸN. The flushing schedule guarantees that, for each level , the item st must have waited bins of size on that level before being inserted to level , where . This is dominated by waiting time on level . That is,
    (1)
    Level participates in a flush again after inserts, which is the number of observations that fill up a bin on level . Using Equation (1), we get that . Thus, st is reported at most timesteps after t.□
    I/O complexity. For the analysis, we treat each level as a sorted array.
    Theorem 5.
    Given a stream of size N and parameters , and , where , the approximate TED problem can be solved with time-stretch at amortized I/O cost per stream item.
    Proof.
    A flush from level i to costs I/Os, and moves stream items down one level, where . Thus, the amortized cost to move one stream item down one level is I/Os.
    Each stream item can be moved down at most L levels, thus the overall amortized I/O cost of an insert is , which is minimized at .□
    For exact reporting (no false positives), we set .
    Corollary 2.
    Given a stream of size N, , and , the TED problem can be solved with time stretch at amortized I/O cost per stream item.

    4.3 Implementation of Time-stretch LERT

    We implement each level in the time-stretch LERT as an exact counting quotient filter [56]. In addition to the count, we store a few additional bits with each item to keep track of its age.
    In the time-stretch LERT, each level is split into equal-sized bins. In our implementation, instead of actually splitting levels into physical bins we assign a value (i.e., age of the item) of size bits to each item that determines its bin. The age of the item on a level determines whether the item is ready to be flushed down from that level during a flush.
    We also assign an age to each level, initialized to 0. Before a flush, the age of each level involved in the flush is incremented. The age of a level wraps back to 0 after c increments. The age of the level during the flush determines which items are eligible to be flushed down—if an item’s age is the same as the level’s age, then the item has survived c flushes on that level, and is therefore eligible to flush. When an item is inserted in a level, its age is set to the level’s age. However, if the level already has an instance of the item, then we just increment the count of the existing instance whatever its age.
    We follow a fixed schedule for flushes. We trigger a flush after every group of stream observations. Every th flush, level i flushes to level . That is, after every flushes to level , level is involved in the next (rth) flush. To determine the number of levels involved in a flush, we maintain a counter per level for the number of times level has been involved in a flush from since its last flush to . Algorithm 1 shows the pseudocode for the flush schedule of the levels in a time-stretch LERT.
    Note that only “eligible items” are flushed down a level during these flush operations, in particular, items that have aged enough to be in the last bin at a level—or equivalently, items whose age is equal to the level’s age.
    Consolidating item-counts during a flush is implemented as a k-way merge sort. We first aggregate the count of an item across all k levels involved in the flush. We then decide based on the age of the instance of the item in the last level whether to move it to the next level. If the instance of the item in the last level is aged, then we insert the item with the aggregate count in the next level. Otherwise, we update the count of the instance in the last level to the aggregate count. Algorithm 2 shows the pseudocode for flushing items in a time-stretch LERT. We use to denote the reporting threshold in the implementation.
    Summary. By allowing a little delay, we can solve the timely event-detection problem at the same asymptotic cost as simply indexing our data [12, 13, 15, 24, 25, 26].
    Recall that in the online solution the increments and decrements of the MG algorithm determined the flushes from one level to the other. In contrast, these flushing decisions in the time-stretch solution are based entirely on the age of the items. The MG style count estimates came essentially for free from the size and cascading nature of the levels. Thus, we get different reporting guarantees depending on whether we flush based on age or count.
    Our experimental results for TED problem with immediate reporting and with time stretch show that there is a spectrum between completely online and completely offline, and it is tunable with little I/O cost.

    5 Power-law Distributions

    Our results in Sections 3 and 4 hold for worst-case input streams. In this section, we design TED algorithms tailored to perform well on practical input streams, in particular where the item-counts follow a power-law distribution. Note that the order of arrivals can still be adversarial.
    The item counts in the stream follow a power-law distribution with exponent θ if the probability that an item has count c is equal to , where Z is the normalization constant.
    Berinde et al. [16] consider streams where the item counts follow a Zipfian distribution. A stream follows a Zipfian distribution with exponent if and only if it follows a power-law distribution with exponent [1]. They show that for Zipfian distributions with (power-law distributions with ), the MG algorithm can solve the approximate heavy-hitter problem with error using only words. Alternatively, on such Zipfian distributions, the MG algorithm achieves an improved error bound using words. The error bound is in fact stronger as it can be applied to the tail frequency of the stream, rather than the whole stream. In particular, if ci is the true count of item i, and is the estimate, then on zipfian distributions with space, , where is the sum of counts of all keys except the top- most frequent keys [16]. Our Misra-Gries LERT data structure based on the MG algorithm automatically inherits these improved bounds.
    We give improved results for power-law streams with exponent , a range that is representative of power-law distributions observed in practice [54]. In Section 5.1, we study the exact TED problem and design algorithms tailored for such a distribution. In Section 5.2, we present a data structure that has improved I/O performance and guarantees a count-dependent bounded delay.
    Preliminaries. We use the continuous power-law definition [54]: the count of an item with a power-law distribution has a probability of taking a value in the interval from x to , where , where and Z is the normalization constant,3
    Thus, .4
    We use the cumulative distribution of a power law:
    (2)
    For our analysis, we assume that the input stream S is constructed offline as follows. Let U denote the number of distinct keys in the stream S. The count for each key is drawn independently from a power-law distribution. Then the instances of the keys in S are ordered arbitrarily. That is, we do not make any assumptions on the arrival order of keys. Next, we analyze some properties of the input stream.
    Lemma 6.
    In the input stream with U distict keys, where the count of each key is drawn independently from a power-law distribution with , the following holds with high probability with respect to U:
    (1)
    the number of keys with count greater than c is ;
    (2)
    the size of the stream .
    Proof.
    Let denote the indicator random variable that is 1 if key i has count greater than c and 0 otherwise. Let . Then . This also holds with high probability with respect to U using Chernoff bounds. This proves (1) in the above lemma.
    Next, let be the random variable denoting that key i has count c. Let and . Then
    The result holds with high probability with respect to U using a Chernoff bound argument.□

    5.1 Immediate-report LERT

    First, we present the layout of our data structure, the Immediate-report LERT, and then we present its main algorithms, shuffle merge and immediate-reporting query. Finally we analyze its correctness and I/O performance.
    Data structure layout. Similarly to the data structures in the previous sections, the Immediate-report LERT consists of a cascade of tables, where M is the size of the table in RAM. There are levels on disk, where N is the size of the stream. The size of level i is .
    Each level on disk has an explicit upper bound on the number of instances of an item that can be stored on that level. This is different from the MG algorithm, where this upper bound is implicit and is based on the level’s size. In particular, each level i in the Immediate-report LERT has a level threshold for , (), where indicates the maximum count of a key that can be stored on level i.
    Threshold invariant. We maintain the invariant that at most instances of an item can be stored on level i. Later, we show how to set the ’s based on the input stream’s power-law exponent θ.
    Shuffle merge. The Misra-Gries LERT and time-stretch LERT use two different flushing strategies. Here we present a third strategy called the shuffle merge.
    The level in RAM receives inputs from the stream one at a time.
    When attempting to insert to a level i that is at capacity, we find the smallest level that has enough empty space to hold all items from levels .
    We aggregate the count of each item k on levels , resulting in a consolidated count .
    If , (and we have not reported it before5), then we report k. Otherwise, we distribute instances of k in a bottom-up fashion on levels j to 0, while maintaining the threshold invariant. In particular, we place instances of k on level j, and instances of k on level y for .
    In the above algorithm, notice that items can end up in higher levels (compared to the level they were before), which is why we call this operation a shuffle-merge instead of a merge. Also, observe that the threshold invariant prevents us from flushing too many counts of an item down. Thus, items can get packed at a level and cannot be flushed down. Specifically, we say an item is packed at level if its count exceeds .
    To maintain efficient shuffle merges, the number of packed items at a level should not occupy more than a constant fraction of the size of the level. In Lemma 7, we show that given a power-law stream with exponent θ, we can set the thresholds based on θ so as to satisfy this requirement.
    Lemma 7.
    Let the counts of U distinct items in the stream of size N follow a power-law distribution with exponent . Let for and . The number of keys packed at level i is at most times the size of level i.
    Proof.
    We prove by induction on the number of levels. We start at level . An item is packed at level if its count is greater than . By Lemma 6 (1), there are such items. By Lemma 6 (2), the size of the stream . The size of level is . Thus, number of items packed at level is times the size of level .
    Suppose the lemma holds for level . We show that it holds for level i. An item is packed at level if its count is greater than . Using Lemma 6 (1), and the induction hypothesis, the expected number of such items is
    (3)
    Finally, an item is packed at level if its count is greater than . Using Lemma 6 (1) and Inequality (3), the expected number of items packed at level i is
    Immediate reporting. As soon as the count of an item k in RAM reaches a threshold of , the data structure triggers an immediate-reporting query, which sweeps all L levels, consolidates the counts of k at all levels into RAM and reports if the consolidated count reaches threshold . Reported items are remembered, so that each event is reported exactly once.
    Analysis. Next, we prove correctness of the Immediate-report LERT and analyze its I/O complexity. We set , which minimizes the insertion cost (in Theorem 9).
    First, we prove that the Immediate-report LERT reports all ɸ-events as soon as they occur.
    Lemma 8.
    Let S be a stream of size N where the item counts follow a power-law distribution with exponent . The Immediate-report LERT solves the TED problem with immediate reporting on S with high probability.
    Proof.
    Let denote the count estimate of key i in RAM in the Immediate-report LERT. Because of the threshold invariant at most instances of a key can be stored on disk at any time.
    Suppose , the count in RAM for key i, is incremented to the search threshold at time t. This triggers an immediate-report query. The counts from all levels of the disk are added to to give an accurate count ci. Because of the threshold invariants, we have . If , then we report item i at time t, exactly as its count reaches the reporting threshold. Otherwise, the system sets a bit to indicate the count includes all occurences of the key in the data structure; when this (accurate) count ci reaches ɸN, it is reported immediately.□
    Next, we analyze the I/O complexity of the Immediate-report LERT. Similarly to Section 3.2, we assume the levels of the Immediate-report LERT are implemented as a cascade filter [14].
    Theorem 9.
    Let S be a stream of size N where the item counts follow a power-law distribution with exponent . Then the Immediate-report LERT can solve the TED problem on S w.h.p. for thresholds , where . The amortized I/O complexity of the data structure is per stream item.
    Proof.
    During a shuffle-merge at level i, the items that are not packed are flushed down to level incurring an I/O cost of . We can charge this to unpacked items that get flushed down, and the amortized cost is . This can happen at most L time and thus the amortized insert cost is . This cost is minimized at .
    We perform at most one query each time an item’s count in RAM reaches . We upper bound the total number of items in the stream that have count at least c, given that .
    Thus, . The total number of items in the stream with count at least c is by Lemma 6 (1). Using the lower bound on c, we get that .
    A query in a cascade filter costs I/Os as it requires one I/O per level. Thus, the overall amortized I/O complexity of the queries over N elements is
    Putting it all together, substituting and ignoring multiplicative factors of θ, we conclude that the amortized I/O complexity of the data structure is .□
    Remark 2.
    The relationship between the reporting threshold ɸN and the power-law exponent θ identified in Theorem 9 for the Immediate-report LERT is a generalization of the relationship presented in Reference [57], which provides a weaker bound.
    Supporting smaller reporting thresholds for power-law streams. The Immediate-report LERT—tailored for power-law streams—lets us support smaller reporting-thresholds ɸN compared to the Misra-Gries LERT for immediate reporting (in particular, the reporting threshold in Corollary 1). To see why this is true, notice that the lower bound on ɸN in Theorem 9, consists of two parts multiplied together: The first term depends only on θ, and for the range-of-interest , this term is a small constant. Specifically, for , . The second term decreases exponentially as θ increases. Conversely, the reporting threshold ɸN that Misra-Gries LERT can support must be at least as large as .
    Thus, under reasonable conditions on N, M, and θ, the Immediate-report LERT can support smaller reporting thresholds for immediate reporting than previous data structures. For example, when , where , we have .

    5.2 Count-stretch LERT

    In this section, we show that if we eliminate expensive immediate-reporting queries from the Immediate-report LERT, then the data structure still supports bounded-delay reporting with a count-dependent delay. We say that a TED algorithm has count stretch if it reports each key by the time its count hits . In particular, the notion of count stretch relaxes the reporting threshold, which leads to reduced random disk accesses.
    The count-stretch LERT is the following modification of the Immediate-report LERT: We eliminate immediate-reporting queries and report an item when its count in RAM hits ɸN. The data structure layout, thresholds and shuffle-merges (including reporting during shuffle-merges) are the same as in the Immediate-report LERT.
    A count-stretch guarantee does not imply any time-stretch guarantee. This is because the item’s arrival distribution may be irregular: a sudden burst may give a key a count of ɸN quickly, with unfortunate shuffle-merge timing moving the maximum number of occurences to disk before the RAM count hits ɸN. It could take much longer to get from the ɸNth occurrence to the th occurrence.
    Theorem 10.
    Let S be a stream of size N where the item counts follow a power-law distribution with exponent , and let parameters ɸ, Ω be such that . Then the count-stretch LERT solves the TED problem on S w.h.p. with count stretch at amortized I/O cost per stream item.
    Proof.
    The amortized I/O complexity of the count-stretch LERT follows from the insertion cost of Theorem 9, without the expensive immediate-reporting queries. Recall that the insertion cost is minimized by setting .
    For a count stretch of it is sufficient to show that when an item hits a count of ɸN in RAM, there are at most occurrences of that item stored in the lower levels of the data structure on disk.
    By the threshold invariant of the count-stretch LERT, we can bound the total occurrences of an item in levels on disk as . Below, we show that this quantity must be at most .
    For , we can upper bound this sum as follows:
    Since , it follows that there can be at most occurrences of an item stored on disk at any time. Thus, when the count estimate of an item in RAM reaches ɸN, its true count is at most .□
    Remark 3.
    For power-law exponent , a range that is typically observed in practice [1, 16, 23, 30, 54], the term in Theorem 10 is a small constant. For example, for , , and thus, the count-stretch LERT can support parameters ɸ, Ω such that .
    Remark on dynamically setting thresholds. If the power-law exponent θ is not known ahead of time, but a feasible setting of level thresholds exist, then we can dynamically update the thresholds to ensure that no level of the data structure has too many packed items. In particular, to satisfy Lemma 7, for , it is sufficient to ensure that the number of items packed at any level i does not exceed its size.
    We incrementally update the level thresholds to satisfy this condition as follows. Initially, for each level i. During a shuffle merge involving the first j levels on disk, we set to the minimum value such that the number of keys packed at level j is no more than half its size. Thus, we increment ’s monotonically from 0 to their feasible settings, without relying on the exponent θ.
    Summary. With a power-law distribution, we can support a much lower threshold ɸ for the TED problem. In the Misra-Gries LERT (Section 3.1), the upper bounds on the counts at each level are implicit. We show that for power-law distributions, we can achieve better performance by explicitly setting these bounds in the form of thresholds.

    5.3 Implementation of Count-stretch and Immediate-report LERT

    We describe the implementation details of the count-stretch and immediate-report LERT, including further optimizations. Similar to the time-stretch LERT, each level is an exact counting quotient filter [56]. In the count-stretch LERT, in addition to the count of each key, we store a few additional bits to mark whether an item has its absolute count at a level (its aggregate count across all the levels).
    Similarly to the flush schedule in the time-stretch LERT, we follow a fixed shuffle-merge schedule. A shuffle-merge is invoked from RAM after every M observations. The level thresholds determine how many instances of an item can be stored at that level. To satisfy threshold constraints, during a shuffle merge, we first aggregate the count of each item and then smear it across all levels involved in the shuffle-merge in a bottom-up fashion without violating the thresholds. Algorithm 3 shows the pseudocode for the shuffle-merge in a count-stretch LERT.
    Optimization. We also implement an optimization in the count-stretch LERT that further reduces I/O costs by following a “greedy” flushing schedule instead of a fixed schedule. This is based on the observation that unlike time stretch, the count stretch does not depend on the number of observations in the stream. Therefore, we do not need to perform shuffle merges at regular intervals. We only invoke a shuffle-merge if it is needed, i.e., when the RAM is at capacity. The greedy flushing optimization is implemented as an additional input flag that can be turned on or off.
    The CQF uses a variable-length encoding for storing counts and uses much less space compared to a unary-encoding. Therefore, the actual number of slots needed for storing M observations can be much smaller than M slots, if there are duplicates in the stream. This is the case for streams such as the one from Firehose, where counts have a power-law distribution. The greedy shuffle-merge schedule avoids unnecessary I/Os that a fixed schedule would incur during shuffle-merges.
    As explained in Section 5.1, in the Immediate-report LERT we perform an immediate-reporting query when the count in RAM reaches . To compute the aggregate count we perform point queries to each level on disk and aggregate the counts. If the aggregate count in RAM and on disk is T, then we report the item. Otherwise, we insert the aggregate count in RAM and set a bit, the absolute bit, that indicates that all the counts for the item have been found. This avoids unnecessary point queries to disk later on. We use a lazy policy to delete the instances of items from disk. They are garbage collected during the next shuffle merge.

    6 Deamortization to Support Consistent Ingestion Rates

    The LERTs consider observation t to occur exactly one timestep before observation . In practice, however, observation t might trigger a significant rebuild of the data structure, delaying observation . In a high-speed streaming context, that observation, and potentially millions after it, would be dropped while a rebuild is going on.
    To mitigate this problem, we now describe how to deamortize LERTs. Our deamortization strategy works in serial, and also provides the foundation of the multithreading strategy we introduce in Section 7.
    To deamortize, we decompose the data structure into C independent parts called cones that partition the space of hashed items. Each stream item is mapped to exactly one of these cones using a uniform-random hash function. A cone is an independent instance of the LERT with the same expansion factor r and the same number of levels, each of which is -th the size of the corresponding complete level.
    Each cone is independent, following its own merge schedule. Incoming items are routed to the appropriate cone for independent insertion and potential reporting. Thus, given uniform-random hashing, each cone accounts for roughly -th of the aggregate I/O.
    Deamortization timeliness guarantees. We consider the timeliness guarantees for the deamortized serial version of the count-stretch and time stretch LERT. When streams are split into substreams based on hash values, we must revisit these guarantees. We note that count-stretch is unaffected:
    Lemma 11.
    A deamortized count-stretch LERT provides the same count stretch guarantee as the original count-stretch LERT when run on the same input stream.
    Proof.
    The count stretch of an item in a count-stretch LERT depends only upon the item’s final count when it is reported. This final count is independent of the rest of stream. In the deamortized count-stretch LERT, all observations of an item go to a single cone, and each cone independently provides the same count stretch as the amortized count-stretch LERT for items mapped to that cone.□
    Lemma 12.
    There exists an input stream for which the deamortized time-stretch LERT provides no global time stretch guarantee.
    Proof.
    We construct an arrival distribution that causes an arbitrarily long time stretch for an item in a deamortized time-stretch LERT. It begins with observations of an item I followed by enough distinct items that all go to the item I’s cone (C) to cause a flush in cone C. The sequence then has one more observation of item I followed by an arbitrarily long sequence of observations, none of which go to cone C. Thus, cone C has an arbitrary delay before its next merge and item I has an unbounded reporting delay.□
    Theorem 13.
    Consider a random stream where each arriving item maps to a cone via a fixed probability distribution. If cone i runs a time-stretch LERT guaranteeing a time stretch of , then the deamortized time-stretch LERT will have a time stretch of in expectation with respect to the full stream.
    Proof.
    Suppose each item maps to cone i with probability . Consider a key k that maps to cone i with its first appearance at index and its Tth occurence at index IT. Let . The time-stretch LERT without cones will report k by time (index) . In the deamortized version, cone i receives items between indices and IT in expectation. So it will report k when another items arrive at cone i. But cone i should receive that many items in the items after IT. Thus we expect cone i to report k at time TD. A similar argument holds when the stream is a random permutation of a finite stream with elements from cone i.□

    7 Multi-threading

    We now describe thread-safe versions of the deamortized count-stretch and time-stretch LERT. A thread-safe implementation enables ingesting observations using multiple threads. This is crucial for two reasons: (1) We can scale the ingestion throughput to support high-speed streams, and (2) multiple threads performing I/Os simultaneously can utilize the full SSD bandwidth, which would be wasted otherwise.
    We use two types of locks in our design, a cone-level lock and a CQF-level lock. The cone-level lock is a distributed readers-writer lock implemented using a partitioned counter (i.e., a per-CPU counter). This ensures that readers do not thrash on the cache line containing the count of the number of readers holding the lock. The CQF-level lock is a spin lock as described by Pandey et al. [56].
    We assign a small local insertion buffer to each thread. See Figure 2. Each insertion thread performs the same set of operations. It starts by first receiving a packet of observations over a network port or reading a small chunk (usually 1,024) of observations from an input file. It then processes each observation in the packet one-by-one.
    Fig. 2.
    Fig. 2. A depiction of multi-threading with cones in a LERT.
    Each thread must acquire two locks to do an insertion: one read lock on the item’s cone and one lock on the region of the CQF (i.e., the RAM level of the cone) to which the item hashes. It tries once to acquire each lock. It does not spin or sleep upon failing to acquire either lock. If it does not get either of the locks in the first attempt, then it releases any acquired lock, inserts the observation in its local insertion buffer, and continues to the next observation. When the local buffer is full, the thread dumps the items in the buffer into theirrespective cones. When dumping a buffer, the threads wait for the locks.
    If a thread acquires both the locks in the first attempt, then it performs the insertion and releases the lock on the relevant region of the CQF. It then checks whether the cone needs to perform a flush or shuffle-merge. If so, then it first releases the read lock and then tries to acquire a write lock on the cone. If it gets the write lock in the first attempt, then it performs the flush/shuffle-merge. If it fails to acquire the write lock in its first attempt, then some other thread is already performing a flush/shuffle-merge. This thread can continue.
    We avoid heavy contention among threads via the local buffers, even when every thread tries to lock the same cone. This is because threads do not wait to acquire a lock on the cone for every insertion and continue to make progress. Also, item counts are consolidated in local buffers. Thus during the buffer dump, only one insertion for each item is required instead of multiple insertions for each instance of the same item. Our method scales well with increasing number of insertion threads even for streams with skewed distributions. We show this empirically in Section 8.8.
    Using readers-writer locks at the cone level allows multiple threads to simultaneously insert in different regions of the RAM CQF of a cone by acquiring a read lock. A thread upgrades to a write lock when it needs to do a flush/shuffle-merge. Readers-writer locks allows us to use more threads than cones. Even if all cones flush simultaneously, there would still be threads processing incoming observations.

    7.1 Timeliness with Multi-threading

    We now discuss the effect of multithreading on the timeliness guarantees of the count-stretch and time-stretch LERT.
    Measuring time. One issue that immediately arises when trying to analyze time- and count-stretch in the multi-threaded case is: How do we measure time? In the single-threaded case, we measure time in terms of the number of stream observations that the process has ingested, i.e., in each timestep, the algorithm gets to read one stream observation, perform an arbitrary amount of computation and I/O, and generate an arbitrary number of reports. We say all reports generated during the ith timestep occur at time i.
    We generalize this in the multi-threaded model: When a thread reports items, it uses the index of the last observation pulled by any thread as the reporting time. This can cause the reporting index of an item be much higher compared to the single-threaded case, because multiple threads each pull a chunk (usually 1,024) of observations simultaneously. Therefore, multi-threading adds an extra delay to the timeliness guarantees of the time-stretch LERT and extra counts to the guarantees of the count-stretch LERT. We analyze this empirically in Section 8.4.
    Count stretch. The multi-threaded count-stretch LERT has only one new source of delay: the time that an item might spend sitting in a thread’s local buffers. In the worst case, an item could accumulate up to occurrences in each thread’s local buffer, in addition to occurrences in the main data structure, so that it does not get reported until it reaches a count of .
    To limit this pathological case, we implement a policy to upper bound the total count that an item can have in a thread’s local buffer. For example, we enforce that no thread can hold more than instances of an item in its local buffer. Whenever the count of an item in the local buffer equals the thread must move that item from the thread’s local buffer to the main data structure. This way we can bound the maximum count of an item when it is reported.
    Lemma 14.
    Given Ω and T such that , where P is the number of threads, a multi-threaded count-stretch LERT guarantees a count stretch of .
    Proof.
    Because the maximum count of an item in a thread’s local buffer is , for P threads the maximum count for any item is . An individual cone with count-stretch guarantee Ω will report an item when it holds at most instances of that item. Thus the maximum number of instances in the system at the time of the report is .□
    Time stretch. It is harder to provide a time-stretch guarantee with multiple threads compared to the count-stretch guarantee. This is because time stretch depends on the arrival distribution of other items in the stream, while count stretch is independent of that.
    When multiple threads are simultaneously performing ingestion, each thread can pick a chunk of observations from the stream. These observations can be inserted in the data structure out-of-order based on the contention among threads. To guarantee a time stretch with multiple threads we need a global ordering on the observations.
    Model. In each timestep, a thread gets to read one observation from the stream and perform all the work on that observation. The work includes taking a lock and inserting the observation in the cone, inserting the observation in the local buffer, dumping contents of the local buffer in cones, and performing a flush/shuffle-merge on the cone. As above, we constrain how long a thread can go before dumping its local buffer. Every thread has to dump its local buffer after every t timesteps.
    Based on the above model and constraints, we can now guarantee that the time stretch in the multi-threading case will not be much worse than the single-threaded case.
    Observation 1.
    In a multi-threaded time-stretch LERT in which each thread dumps its local buffer every t timesteps, we guarantee that an item s is reported in at most additional timesteps (after the item-count reaches T), where Fs is the flow time of s.

    8 Evaluation

    In this section, we evaluate our implementations of the time-stretch LERT (TSL), count-stretch LERT (CSL), and Immediate-report LERT (IRL) for timeliness, robustness to input distributions, I/O performance, insertion throughput, and scalability with multiple threads. Our implementation is publicly available at https://github.com/splatlab/lerts.
    We compare our implementations against Bender et al.’s cascade filter [14] as a baseline for timeliness. This baseline is an external-memory data structure with no timeliness guarantee. We show that reporting delays can be quite large when data structures take no special steps to ensure timeliness.
    We also evaluate an implementation of the Misra-Gries data structure as a baseline for in-memory insertion throughput. We implement the Misra-Gries data structure with an exact counting data structure (counting quotient filter) to forbid false positives. This gives an upper bound on the insertion throughput one can achieve in-memory while performing immediate event-detection. The objective of this baseline is to evaluate the effect of disk accesses during flushes/shuffle-merges in our implementations of the TSL, CSL, and IRL.
    We address the following performance questions for the time-stretch, count-stretch and immediate-report LERT:
    (1)
    How does the empirical timeliness of reported items compare to the theoretical bounds?
    (2)
    How robust is the time-stretch LERT to different input distributions?
    (3)
    How does deamortization and multi-threading affect the empirical timeliness of reported items?
    (4)
    How does the buffering strategy affect count stretch and throughput?
    (5)
    How does LERT total I/O compare to theoretical bounds?
    (6)
    What is the insertion throughput of the time-stretch, count-stretch and immediate-report LERT?
    (7)
    How does deamortization and multiple threads affect instantaneous throughput?
    (8)
    How does insertion throughput scale with number of threads?

    8.1 Experimental Setup

    In this section, we describe how we designed experiments to answer the questions above and describe our workloads,
    Our experiments fall into two categories: validation experiments and scalability experiments. The validation experiments require an offline analysis of the dataset to compute the lifetime and measure the stretch of every key to perform the validation. We use smaller datasets (64 million) for the validation experiments. For scalability experiments, we use bigger datasets (4 billion).
    Workload. Firehose [5] is a suite of benchmarks simulating a network-event monitoring workload. A Firehose benchmark consists of a generator that feeds keys to the analytic, being benchmarked. The analytic must detect and report each key that has 24 observations.
    Firehose includes two generators: the power-law generator selects from a static ground set of 100,000 keys according to a power-law distribution, while the active-set generator allows the ground set to drift over an infinite key space. We use the active-set generator, because an infinite key space more closely matches many real-world streaming workloads. To simulate a stream of keys drawn from a huge key-space we increase the key space of the active set to one million.
    Figure 3 shows the distribution of birthtime (the index of the first occurrence of an item) vs. the lifetime (number of observations between the first and the Tth occurrence) of items in the stream from active-set generator. The stream contains 50M observations and the active-set size is 1M.
    Fig. 3.
    Fig. 3. Birthtime vs. the lifetime of each reportable item in the active-set generator dataset consisting of 50M observations.
    The longest lifetime is ≈22M. Whenever a new item is added to the active set it is assigned a count value from the set of counts based on the power-law distribution. Therefore, we see bands of items that have similar lifetime but are born at different times throughout the stream. The lifetime of items in these bands tend to increase slightly as the items are born later in the stream due to different selection probabilities of items from the active set. In all of our experiments we have used dataset from the active-set generator unless noted otherwise.
    Other workloads. Apart from Firehose, we use four other simulated workloads to evaluate the empirical stretch in the time-stretch LERT. These four workloads are generated to show the robustness of the data structure to non-power-law distributions. In the first distribution, M (where M is the size of the level in RAM) keys appear with a count between 24 and 50 and rest of the keys are chosen uniformly at random from a big universe. In the second, M keys appear 24 times and the rest of the keys appear 23 times. In the third, M keys appear round robin each with a count 24. In the fourth, for each key we pick the count uniformly at random between 1 and 25.
    Reporting. During insertion, we record each reported item and the index in the stream at which it is reported by the data structure. We record by inserting the reported item in an exact CQF (anomaly CQF) and encoding the index as the count of the item in the anomaly CQF. We also use the anomaly CQF to check if an incoming item has already been reported. We only insert the item if it is not reported yet. This prevents duplicate reports.
    Timeliness. For the timeliness evaluation, we measure the reporting delay after its Tth occurrence. We have two measures of timeliness: time stretch and count stretch.
    The time-stretch LERT upper bounds the reporting delay of an item based on its lifetime (i.e., time between its first and Tth instance). To validate the timeliness of the time-stretch LERT, we first perform an offline analysis of the stream and calculate the lifetime of each reportable item. Given a reporting threshold T, we record the index of the first occurrence of the item () and the index of the Tth occurrence of the item (IT). During ingestion, we record the index (IR) at which the time-stretch LERT reports the item. We calculate the time stretch () for each reported item as and verify that .
    Multiple threads process chunks of 1024 observations from the input stream. We consider all reports a thread generates while processing the ith observation to occur at time i. Due to concurrency, two observations of the same key may be inserted into the data structure in a different order than they are pulled off of the input stream. This may introduce some noise in our time-stretch measurements. However, our experimental results with and without multi-threading were nearly identical, indicating that the noise is small.
    In the count-stretch LERT, the upper bound is on the count of the item when it is reported. To validate timeliness, we first record indexes at which items are reported by the count-stretch LERT (IR). We then perform an offline analysis to determine the count of the item at index IR () in the stream. We then calculate the count stretch () as and validate that .
    To perform the offline analysis of the stream we first generate the stream from the active-set generator and dump it in a file. We then read the stream from the file for the analysis and for streaming it to the data structure. For timeliness validation experiments we use a stream of 512 Million observations from the active-set generator.
    I/O performance. In our implementation of the time-stretch, count-stretch, and immediate-report LERT, we allocate space for the data structure by mmap-ing each level (i.e., the CQF) to a file on SSD. To force the data structure to keep all levels except the first one on SSD we limit the RAM available to the insertion process using the “cgroups” utility in linux. We calculate the total RAM needed by the insertion process to only keep the first level in RAM by adding the size of the first level, the space used by the anomaly CQF to record reported keys, the space used by thread-local buffers, and a small amount of extra space to read the stream sequentially from SSD. We then provision the RAM to the next power-of-two of the total sum.
    To measure the total I/O performed by the data structure we use the “iotop” utility in linux. Using iotop we can measure the total amount of reads and writes in KB performed by the process doing insertions.
    To validate, we calculate the total amount of I/O performed by the data structure based on the number of merges (shuffle-merges in case of the count-stretch LERT) and time-stretch LERT and sizes of levels involved in those merges.
    Similarly to empirical stretch validation, we first dump the stream to a file and then feed the stream to the data structure by streaming it from the file. We use a stream of 64 Million observations from the active-set generator.
    Average insertion throughput and scalability. To measure the average insertion throughput, we first generate the stream from the active-set generator and dump it in a file. We then feed the stream to the data structure by streaming it from the file and measure the total time.
    To evaluate scalability, we measure how data-structure throughput changes with increasing number of threads. We evaluate power-of-2 thread counts between 1 and 64.
    To deamortize the data structures we divide them into 2,048 cones. We use a stream of 4 Billion observations from the active-set generator. We evaluate the insertion performance and scalability for three values (16, 32, and 64) of the DatasetSize-to-RAM-ratio (i.e., the ratio of the data set size to the available RAM).
    Instantaneous insertion throughput. We also evaluate the instantaneous throughput of the data structure when run using either a single cone and thread or multiple cones and threads. We approximate instantaneous throughput by calculating throughput (using system timestamps) every observations. In our evaluation, we fix .
    Machine specifications. The OS for all experiments was 64-bit Ubuntu 18.04 running Linux kernel 4.15.0-34-generic The machine for all timeliness and I/O performance benchmarks had an Intel Skylake CPU (Core i7-6700HQ CPU @ 2.60 GHz with 4 cores and 6 MB L3 cache) with 32 GB RAM and a 1-TB Toshiba SSD. The machine for all scalability benchmarks had an Intel Xeon(R) CPU (E5-2683 v4 @ 2.10 GHz with 64 cores and 20 MB L3 cache) with 512 GB RAM and a 1-TB Samsung 860 SSD.
    For all the experiments, we use a reporting threshold of 24, since it is the default in the Firehose benchmarking suite.

    8.2 Timely Reporting

    Cascade filter. Figure 4(a) and (b) show the distribution of count stretch and time stretch of reported items in the cascade filter. The cascade filter’s maximum count-stretch is 3.0 and maximum time stretch is 12, much higher than any single-threaded count-stretch or time-stretch LERT.
    Fig. 4.
    Fig. 4. Data structure configuration: RAM level: 8,388,608 slots in the CQF, levels: 4, growth factor: 4, level thresholds: (2, 4, 8), cones: 8, threads: 8, number of observations: 512M. Data structures: CF, count-stretch LERT (CSL), time-stretch LERT (TSL), (CSL and TSL) with cones, (CSL and TSL) with cones and threads. Time-stretch LERT with age bits 1 (TSL1) , 2 (TSL2) , 3 (TSL3) , and 4 (TSL4) .
    Count-stretch LERT. Figure 4(a) validates worst-case count stretch for the count-stretch LERT. The total on-disk count for an element is 14, so the maximum possible count when reported is 38 (i.e., ), for a maximum count stretch of 1.583. The maximum reported count stretch is 1.583.
    Time-stretch LERT. Figure 4(b) shows the time-stretch LERT meets the time-stretch requirements. The maximum reported time stretch is 1.59, which is smaller than the maximum allowable time stretch of 2. Figure 4(c) shows the distribution of empirical time stretches with changing values. The time stretch of any reported element is always smaller than the maximum allowable time stretch. As the number of age bits increases, decreases and the time stretch decreases.

    8.3 Robustness with Input Distributions

    Figure 5(a) shows the robustness of empirical time stretch (ETS) on four input distributions other than the Firehose power-law distribution. The ETS is less than 2, the theoretical limit of the data structure for all input distributions.
    Fig. 5.
    Fig. 5. Data structure configuration: RAM level: 8,388,608 slots in the CQF, levels: 4, growth factor: 4, level thresholds for on-disk level: (2, 4, 8), cones: 8, threads: 8, number of observations: 512M.

    8.4 Effect of Deamortization/Threading

    Figure 4(a) and (b) show the effect of deamortization and multi-threading on timeliness in the count-stretch LERT and time-stretch LERT.
    Using 8 cones instead of one does not change the timeliness of any reported item. This is because the distribution of items in the stream is random (see Section 8.1) and we use a uniform-random hash function to distribute items to each cone. Each cone gets a similar number of items and the cones perform shuffle-merges in sync (refer to Section 6).
    Running the count-stretch and time-stretch LERT with 8 cones and 8 threads does affect timeliness of reported items. Some items are reported later than the theoretical upper bound. The reported maximum time- and count-stretch is 5. This is because each thread inserts items into a local buffer when it cannot immediately acquire the cone lock. We empty local buffers only when they are full. The maximum delay happens when an item’s lifetime is similar to the time it takes for a cone to incur a full flush involving all levels of the data structure. Figure 6 shows the stretch of reported items and their lifetime. The maximum-stretch items have a lifetime ≈16 M observations, which is the number of observations it takes for a cone to incur a full flush.
    Fig. 6.
    Fig. 6. Data structure configuration: RAM level: 8,388,608 slots in the CQF, levels: 4, growth factor: 4, level thresholds for on-disk level: (2, 4, 8), cones: 8, threads: 8, number of observations: 512M.

    8.5 Effect of Buffering

    Figure 5(b) shows the empirical count stretch with three different buffering strategies. In the first, we use buffers without any constraint on the count of a key inside a buffer. We dump the buffer into the main data structure when it is full. In the second, we constrain the maximum count a key can have in a buffer to (for and the max count is 3). In the third, we do not use buffers. Threads try to acquire the lock on the cone and wait if the lock is not available.
    The empirical stretch is smallest without buffers. However, not using the buffers increases contention among threads and reduces insertion throughput. Using the buffers is faster compared to not using the buffer.

    8.6 I/O Performance and Throughput

    Figure 7 shows the total amount of I/O performed by the count-stretch, time-stretch, and immediate-report LERT while ingesting a stream. For all data structures, the total I/O calculated and total I/O measured using iotop is similar.
    Fig. 7.
    Fig. 7. Total I/O performed by the count-stretch, time-stretch, and immediate report LERT. Data structure configuration: RAM level: 4,194,304 slots in the CQF, levels: 3, growth factor: 4, number of observations: 64M. IRL.
    The count-stretch LERT does the least I/O, because it performs the fewest shuffle-merges. The I/O for the time-stretch LERT grows by a factor of two as the number of bins increases, as predicted by the theory. The I/O for Immediate-report LERT is similar to that of the time-stretch LERT with stretch 2. This shows that when item counts follow a power-law distribution, we can achieve immediate reporting with the same amount of I/O as with a time stretch of 2. Insertion throughput. Figure 8(a) shows insertion throughput using the same configuration and stream as the total-I/O experiments. The count-stretch LERT has the highest throughput, because it performs the fewest I/Os. The Immediate-report LERT has lower throughput, because it performs extra random point queries. The time-stretch LERT throughput decreases as we add bins and decrease the stretch.
    Fig. 8.
    Fig. 8. Data structure configuration for (a): RAM level: 4,194,304 slots in the CQF, levels: 3, growth factor: 4, number of observations: 64M. DatasetSize-to-RAM-ratio: 12.5. For (b): RAM level: 67,108,864 slots in the CQF, levels: 4, growth factor: 4, level thresholds for on-disk level(): (2, 4, 8), cones: 2,048 with greedy flushing, DatasetSize-to-RAM-ratio: 16, 32, and 64.
    The Misra-Gries data structure throughput is 2.2 Million ops/second in-memory. This acts a baseline for in-memory insertion throughput. The in-memory MG data structure is only twice as fast as the on-disk count-stretch LERT.

    8.7 Instantaneous Throughput

    Figure 9 shows the instantaneous throughput of the count-stretch LERT. De-amortization and multi-threading improve both average throughput and throughput variance. With one thread and one cone, the data structure periodically stops processing inputs to perform flushes, causing throughput to crash to 0. With 1,024 cones and four threads, the system has much smoother throughput, never stops processing inputs, and has about 3 greater average throughput.
    Fig. 9.
    Fig. 9. Instantaneous throughput of the count-stretch LERT with 1 cone and 1 thread and 1,024 cones and 4 threads. Same configuration as Figure 4(a).

    8.8 Scaling with Multiple Threads

    Figure 8(b) shows count-stretch LERT throughput with increasing number of threads. The scalability will follow for other variants, since they all have the same insertion and SSD access pattern. The insertion throughput increases with thread count. We used three values of DatasetSize-to-RAM-ratio: 16, 32, and 64. All have similar scalability curves.

    9 Motivating National Security Application

    In this section, we describe the more complex national-security setting that motivates our modeling constraints. Firehose [4, 5] is a clean benchmark that captures the fundamental elements of this setting. The TED problem in this article in turn distills the most difficult part of the Firehose benchmark. Therefore our solutions have direct line of sight to important national-security applications.
    An ideal solution for TED would have (1) no false negatives, (2) no false positives, (3) immediate reporting of a stream element that upon arrival hits the reporting threshold, and (4) speed sufficient to keep up with real sensor data streams. To better allow (1) and (4), in this article we relax (2) and (3). Our algorithms limit false positives to keys that are “close” to reportable and bound reporting delay by either time or count. Our use case explains why we can tolerate these relaxations. It also explains why we cannot relax the no-false-negative requirement. This critical aspect of the model means we cannot consider sampling-based or randomized algorithms for finding reportable items, since these can miss events.
    We are motivated by monitoring systems for national security [4, 5], where experts associate special patterns in a cyberstream to rare, high-consequence real-life events. These patterns are formed by a small number of “puzzle pieces,” as shown in Figure 10. Each piece is associated with a key such as an IP address or a hostname. The pieces arrive over time. When an entire puzzle associated with a particular key is complete, this is an event, which should be reported as soon as the final puzzle piece falls into place. In Figure 10, the first stage is like our TED problem algorithm, except that it must store puzzle pieces with each key rather than a count and the reporting trigger is a complete puzzle, not a count threshold.
    Fig. 10.
    Fig. 10. The analysis pipeline that motivates our TED problem solution. Analysts associate a multi-piece pattern, represented by the 4-piece puzzle, to a high-consequence event. The pieces arrive slowly over time, mixed with innocent traffic in a high-throughput “firehose” stream. Our database stores many partial matches to the pattern reporting all instances of the pattern. There still may be a fair number of matches, which are pared down by an automated system to a small number (essentially droplets compared to the original stream) of matches worthy of human inspection.
    There can still be a fair number of matches to this special pattern, most of which are still not the critically bad event. This might overwhelm a human analyst, who would then not use the system. However, automated tools, shown in the second stage of Figure 10, can pare these down to the few events worthy of analyst attention.
    The first stage filter, like our TED problem solution, must struggle to handle a massively large, fast stream. It is reasonable to allow a few false positives in the first stage to improve its speed. The second stage can screen out almost all of these false positives as long as the stream is significantly reduced. The second stage is a slower, more careful tool that cannot keep up with the initial stream. This second tool cannot, however, repair false negatives, since anything the first filter misses is gone forever. So the first tool cannot drop any matches to the pattern. Experts have gone to great effort to find a pattern that is a good filter for the high-consequence events. We do not allow false negatives, because the high-consequence events that match this carefully crafted pattern can and must be detected.
    Each of these patterns are small with respect to the stream size, so the detection algorithm must be scalable, that is, must be able to support a small threshold T. The consequences of missing an event (false negative) are so severe that it is not reasonable to risk facing those consequences just to save a little space. Thus we must save all partial patterns, motivating our use of external memory.
    The ability to tolerate a reporting delay depends upon how much lead time the search pattern gives before possible damage. There will be some additional delay from the second-stage testing. Reports are still “better late than never.” Even if some damage has occurred, the system operators still have significantly more information than they would have if they had received no report.
    The DoD Firehose benchmark captures the essence of this setting [5]. In Firehose, the input stream has (key,value) pairs. When a key is seen for the 24th time, the system must return a function of the associated 24 values. The most difficult part of this is determining when the 24th instance of a key arrives. Thus, like Firehose, the TED problem captures the essence of the motivating application.

    10 Conclusion

    This work bridges external-memory and streaming algorithms. By taking advantage of external memory, we can solve timely event detection problems at a level of precision that is not possible in the streaming model, and with little or no sacrifice in terms of the timeliness of reports.
    Even though streaming algorithms, such as Misra-Gries, were developed for a space-constrained setting, we show that they can be made efficient in the external-memory setting, where storage is plentiful but accessing the data is expensive.

    Acknowledgments

    We thank Tyler Mayer for helpful discussions.
    Image attributions for Figure 10: Fire Hydrant by Claire Jones, skill magic stream by Maxicons, puzzle pieces by Iconika, puzzle pieces by studiographic, water Drop by Aldiki Gustiyan Putra, and man and woman by Alice Design; all icons from the Noun Project (https://nounproject.com).

    Footnotes

    1
    It is possible to prevent repeated queries for an item but we allow it as it does not hurt the asymptotic performance.
    2
    For each reported item, we set a flag in RAM that indicates it has been reported, to avoid duplicate reporting of events.
    3
    In general, the power-law distribution may hold above some value . For simplicity, we let —for this choice and .
    4
    In principle, one could have power-law distributions with , but these distributions cannot be normalized and are not common [54].
    5
    Each reported item is stored in a separate table in RAM to avoid duplicate reporting of events.

    References

    [1]
    L. A. Adamic. 2008. Zipf, Power Law, Pareto: A ranking tutorial. HP Research. Retrieved from http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html.
    [2]
    Alok Aggarwal and Jeffrey Vitter. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31, 9 (1988), 1116–1127.
    [3]
    Noga Alon, Yossi Matias, and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing (STOC’96). 20–29.
    [4]
    Karl Anderson. 2016. FireHose Benchmarking Streaming Architectures. Retrieved July 9, 2021 from https://www.clsac.org/uploads/5/0/6/3/50633811/anderson-clsac-2016.pdf.
    [5]
    Karl Anderson and Steve Plimpton. 2013. FireHose Streaming Benchmarks. Retrieved December 11, 2018 from https://github.com/stream-benchmarking/firehose.
    [6]
    Shivnath Babu and Jennifer Widom. 2001. Continuous queries over data streams. ACM SIGMOD Rec. 30, 3 (2001), 109–120.
    [7]
    Daniel Barbará. 1999. The characterization of continuous queries. Int. J. Cooperat. Inf. Syst. 8, 04 (1999), 295–323.
    [8]
    Tim Bartrand, Walter Grayman, and Terra Haxton. 2017. Drinking Water Treatment Source Water Warly Warning System State of the Science Review. Technical Report EPA/600/R-17/405.
    [9]
    Ran Ben-Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. 2016. Heavy hitters in streams and sliding windows. In Proceedings of the 35th Annual IEEE International Conference on Computer Communications (INFOCOM’16). IEEE, 1–9.
    [10]
    Michael A. Bender, Jonathan W. Berry, Martin Farach-Colton, Justin Jacobs, Rob Johnson, Thomas M. Kroeger, Tyler Mayer, Samuel McCauley, Prashant Pandey, Cynthia A. Phillips, Alexandra Porter, Shikha Singh, Justin Raizes, Helen Xu, and David Zage. 2018. Advanced Data Structures for Improved Cyber Resilience and Awareness in Untrusted Environments: LDRD Report. Technical Report SAND2018-5404. Sandia National Laboratories.
    [11]
    Michael A. Bender, Alex Conway, Martin Farach-Colton, William Jannen, Yizheng Jiao, Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, Prashant Pandey, Donald E. Porter, Jun Yuan, and Yang Zhan. 2019. Small refinements to the DAM can have big consequences for data-structure design. In Proceedings of the 31st ACM on Symposium on Parallelism in Algorithms and Architectures (SPAA’19). 265–274.
    [12]
    Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-oblivious streaming B-trees. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’07). 81–92.
    [13]
    Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Yang Zhan. 2015. An introduction to B-trees and write-optimization. :login; mag. 40, 5 (Oct. 2015), 22–28.
    [14]
    Michael A. Bender, Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C. Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P. Spillane, and Erez Zadok. 2012. Don’t thrash: How to cache your hash on flash. Proc. VLDB 5, 11 (2012), 1627–1637.
    [15]
    Michael A. Bender, Martín Farach-Colton, Rob Johnson, Simon Mauras, Tyler Mayer, Cynthia A. Phillips, and Helen Xu. 2017. Write-optimized skip lists. In Proceedings of the 36th Symposium on Principles of Database Systems (PODS’17). ACM, 69–78.
    [16]
    Radu Berinde, Piotr Indyk, Graham Cormode, and Martin J. Strauss. 2010. Space-optimal heavy hitters with strong error bounds. ACM Trans. Database Syst. 35, 4 (2010), 26.
    [17]
    Kevin Beyer and Raghu Ramakrishnan. 1999. Bottom-up computation of sparse and iceberg cube. In ACM SIGMOD Record, Vol. 28. 359–370.
    [18]
    Arnab Bhattacharyya, Palash Dey, and David P. Woodruff. 2016. An optimal algorithm for l1-heavy hitters in insertion streams and related problems. In Proceedings of the 35th ACM Symposium on Principles of Database Systems (PODS’16). 385–400.
    [19]
    Prosenjit Bose, Evangelos Kranakis, Pat Morin, and Yihui Tang. 2003. Bounds for frequency estimation of packet streams. In Proceedings of the 28th International Colloquium on Structural Information and Communication Complexity (SIROCCO’03). 33–42.
    [20]
    Robert S. Boyer and J. Strother Moore. 1991. MJRTY—A fast majority vote algorithm. In Automated Reasoning. Springer, 105–117.
    [21]
    Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P. Woodruff. 2017. BPTree: An heavy hitters algorithm using constant memory. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’17). ACM, 361–376.
    [22]
    Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, and David P. Woodruff. 2016. Beating CountSketch for heavy hitters in insertion streams. In Proceedings of the 48th Annual Symposium on Theory of Computing (STOC’16). ACM, 740–753.
    [23]
    Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. 1999. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of the Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 1. 126–134.
    [24]
    Gerth Stølting Brodal, Erik D. Demaine, Jeremy T. Fineman, John Iacono, Stefan Langerman, and J. Ian Munro. 2010. Cache-Oblivious dynamic dictionaries with update/query tradeoffs. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10). 1448–1456.
    [25]
    Gerth Stølting Brodal and Rolf Fagerberg. 2003. Lower bounds for external memory dictionaries. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’03). 546–554.
    [26]
    Adam L. Buchsbaum, Michael Goldwasser, Suresh Venkatasubramanian, and Jeffery R. Westbrook. 2000. On external memory graph traversal. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’00). 859–860.
    [27]
    Pedro Celis, Per-Ake Larson, and J. Ian Munro. 1985. Robin hood hashing. In Proceedings of the 26th Annual Symposium on Foundations of Computer Science (sfcs’85). IEEE, 281–288.
    [28]
    Sirish Chandrasekaran and Michael J. Franklin. 2002. Streaming queries over streaming data. In Proceedings of the 28th International conference on Very Large Data Bases. VLDB Endowment, 203–214.
    [29]
    Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages, and Programming (ICALP’02). 693–703.
    [30]
    Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. 2009. Power-law distributions in empirical data. SIAM Rev. 51, 4 (2009), 661–703.
    [31]
    Alex Conway, Martin Farach-Colton, and Philip Shilane. 2018. Optimal hashing in external memory. In Proceedings of the 45th International Colloquium on Automata, Languages, and Programming (ICALP’18). 39:1–39:14.
    [32]
    Graham Cormode and Marios Hadjieleftheriou. 2010. Methods for finding frequent items in data streams. VLDB J. 19, 1 (2010), 3–20.
    [33]
    Graham Cormode and S Muthukrishnan. 2004. An improved data stream summary: The count-min sketch and its applications. In Proceedings of the Latin American Symposium on Theoretical Informatics. 29–38.
    [34]
    Graham Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1 (2005), 58–75.
    [35]
    Graham Cormode and S. Muthukrishnan. 2005. What’s hot and what’s not: Tracking most frequent items dynamically. ACM Trans. Datab. Syst. 30, 1 (2005), 249–278.
    [36]
    Erik D. Demaine, Alejandro López-Ortiz, and J. Ian Munro. 2002. Frequency estimation of internet packet streams with limited space. In Proceedings of the European Symposium on Algorithms (ESA’02). Springer, 348–360.
    [37]
    Xenofontas Dimitropoulos, Paul Hurley, and Andreas Kind. 2008. Probabilistic lossy counting: An efficient algorithm for finding heavy hitters. ACM SIGCOMM Comput. Commun. Rev. 38, 1 (2008), 5.
    [38]
    Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. 1998. Computing iceberg queries efficiently. In Proceedings of the 24rd International Conference on Very Large Databases (VLDB’98). 299–310.
    [39]
    Jose M. Gonzalez, Vern Paxson, and Nicholas Weaver. 2007. Shunting: A hardware/software architecture for flexible, high-performance network intrusion prevention. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS’07). 139–149.
    [40]
    Jiawei Han, Jian Pei, Guozhu Dong, and Ke Wang. 2001. Efficient computation of iceberg cubes with complex measures. In ACM SIGMOD Record, Vol. 30. 1–12.
    [41]
    John Hershberger, Nisheeth Shrivastava, Subhash Suri, and Csaba D. Tóth. 2005. Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In Proceedings of the 24th Symposium on Principles of Database Systems (PODS’05). ACM, 338–347.
    [42]
    John Iacono and Mihai Pătraşcu. 2012. Using hashing to solve the dictionary problem. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’12). 570–582.
    [43]
    Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28, 1 (2003), 51–55.
    [44]
    M. Kezunovic. 2006. Monitoring of power system topology in real-time. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences, Vol. 10. 244b–244b. DOI:https://doi.org/10.1109/HICSS.2006.355
    [45]
    Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, and Mikkel Thorup. 2016. Heavy hitters via cluster-preserving clustering. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS’16). 61–70.
    [46]
    Quentin Le Sceller, ElMouatez Billah Karbab, Mourad Debbabi, and Farkhund Iqbal. 2017. SONAR: Automatic detection of cyber security events over the twitter stream. In Proceedings of the 12th International Conference on Availability, Reliability and Security.
    [47]
    E. Litvinov. 2006. Real-time stability in power systems: Techniques for early detection of the risk of blackout [Book Review]. IEEE Power Energy Mag. 4, 3 (May 2006), 68–70. DOI:
    [48]
    Jianning Mai, Chen-Nee Chuah, Ashwin Sridharan, Tao Ye, and Hui Zang. 2006. Is sampled data sufficient for anomaly detection? In Proceedings of the 6th ACM SIGCOMM Conference on Internet Measurement. 165–176.
    [49]
    Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases. VLDB Endowment, 346–357.
    [50]
    Chad R. Meiners, Jignesh Patel, Eric Norige, Eric Torng, and Alex X. Liu. 2010. Fast regular expression matching using small TCAMs for network intrusion detection and prevention systems. In Proceedings of the 19th USENIX Conference on Security.
    [51]
    Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the International Conference on Database Theory. Springer, 398–412.
    [52]
    Katsiaryna Mirylenka, Graham Cormode, Themis Palpanas, and Divesh Srivastava. 2015. Conditional heavy hitters: Detecting interesting correlations in data streams. VLDB J. 24, 3 (2015), 395–414.
    [53]
    Jayadev Misra and David Gries. 1982. Finding repeated elements. Sci. Comput. Program. 2, 2 (1982), 143–152.
    [54]
    Mark E. J. Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46, 5 (2005), 323–351.
    [55]
    Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (1996), 351–385.
    [56]
    Prashant Pandey, Michael A. Bender, Rob Johnson, and Robert Patro. 2017. A general-purpose counting filter: Making every bit count. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17). ACM, 775–787. DOI:https://doi.org/10.1145/3035918.3035963
    [57]
    Prashant Pandey, Shikha Singh, Michael A. Bender, Jonathan W. Berry, Martín Farach-Colton, Rob Johnson, Thomas M. Kroeger, and Cynthia A. Phillips. 2020. Timely reporting of heavy hitters using external memory. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1431–1446.
    [58]
    Shahid Raza, Linus Wallgren, and Thiemo Voigt. 2013. SVELTE: Real-time intrusion detection in the Internet of Things. Ad Hoc Netw. 11, 8 (2013), 2661–2674. http://dblp.uni-trier.de/db/journals/adhoc/adhoc11.html#RazaWV13.
    [59]
    Daniel Ting. 2018. Data sketches for disaggregated subset sum and frequent item estimation. In Proceedings of the International Conference on Management of Data. 1129–1140.
    [60]
    Shobha Venkataraman, Dawn Song, Phillip B. Gibbons, and Avrim Blum. 2005. New streaming algorithms for fast detection of superspreaders. In Proceedings of the Network and Distributed Systems Security Symposium (NDSS’05).
    [61]
    H. Yan, R. Oliveira, K. Burnett, D. Matthews, L. Zhang, and D. Massey. 2009. BGPmon: A real-time, scalable, extensible monitoring system. In Proceedings of the Cybersecurity Applications Technology Conference for Homeland Security. 212–223. DOI:https://doi.org/10.1109/CATCH.2009.28
    [62]
    Tong Yang, Haowei Zhang, Jinyang Li, Junzhi Gong, Steve Uhlig, Shigang Chen, and Xiaoming Li. 2019. HeavyKeeper: An accurate algorithm for finding Top-k elephant flows. IEEE/ACM Trans. Netw. 27, 5 (2019), 1845–1858.
    [63]
    Yu Zhang, BinXing Fang, and YongZheng Zhang. 2010. Identifying heavy hitters in high-speed network monitoring. Science Chin. Inf. Sci. 53, 3 (2010), 659–676.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 46, Issue 4
    December 2021
    169 pages
    ISSN:0362-5915
    EISSN:1557-4644
    DOI:10.1145/3492445
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 November 2021
    Accepted: 01 June 2021
    Revised: 01 June 2021
    Received: 01 December 2020
    Published in TODS Volume 46, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Dictionary data structure
    2. streaming algorithms
    3. external-memory algorithms

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • NSF
    • Laboratory-Directed Research-and-Development program at Sandia National Laboratories
    • National Technology and Engineering Solutions of Sandia, LLC.
    • Honeywell International, Inc.
    • U.S. Department of Energy’s National Nuclear Security Administration
    • U.S. Department of Energy or the United States Government
    • Advanced Scientific Computing Research (ASCR)
    • Office of Science of the DOE
    • NERSC
    • Exascale Computing Project
    • U.S. Department of Energy Office of Science and the National Nuclear Security Administration

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 793
      Total Downloads
    • Downloads (Last 12 months)321
    • Downloads (Last 6 weeks)35
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media