Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

p657 Dalt

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Bayesian Sketches for Volume Estimation in Data Streams

Francesco Da Dalt Simon Scherrer Adrian Perrig


ETH Zürich ETH Zürich ETH Zürich
Zürich, Switzerland Zürich, Switzerland Zürich, Switzerland
fdadalt@student.ethz.ch simon.scherrer@inf.ethz.ch adrian.perrig@inf.ethz.ch

ABSTRACT Table 1: Comparison of sketches regarding accuracy, compu-


tation time and memory needed for queries (for a synthetic
Given large data streams of items, each attributable to a certain key
trace with 10𝑘 keys and Poisson-distributed key volumes,
and possessing a certain volume, the aggregate volume associated
6400 counters). CCB-Sketch is our algorithm. Seq-Sketch
with a key is difficult to estimate in a way that is both efficient
[20] is omitted as it performs worse than the PR-Sketch.
and accurate. On the one hand, exact counting with dedicated
counters incurs unacceptable overhead during stream processing.
On the other hand, sketch algorithms, i.e., approximate-counting Algorithm Rel. Error Time [ms] Memory [MB]
techniques that share counters among keys, have suffered from CM-Sketch [10] 99.092 10.1 27.5
a trade-off between accuracy and query efficiency: Classic sketch C-Sketch [12] 1.264 20.1 27.5
algorithms allow to compute rough estimates in an efficient way, CCB-Sketch 0.031 9.9 27.5
whereas more recent proposals yield highly accurate estimates at PR-Sketch [30] 0.024 ∼ 1.2 · 106 7260.1
the cost of greatly increased computation time.
In this work, we propose three sketch algorithms that overcome
this trade-off, computing highly accurate estimates with lightweight keys (also: per-key aggregation) is an essential task, which, however,
procedures. To reconcile these desiderata, we employ novel esti- is often time-critical and therefore challenging.
mation methods that rely on Bayesian probability theory, counter- Because of such high-speed requirements, a stream-processing
cardinality information, and basic machine-learning techniques. algorithm must operate exclusively in low-latency memory (e.g.,
The combination of these techniques enables highly accurate es- SRAM) to generate its analysis result. The limited availability of
timates, which we demonstrate by both a theoretical worst-case such low-latency memory prohibits the naive approach of keeping
analysis and an experimental evaluation. Concretely, our sketches a counter per key and adjusting the counter whenever encountering
allow to efficiently produce volume estimates with an average rela- the corresponding key, as data streams may contain billions of dis-
tive error of < 4%, which previous methods could only achieve with tinct keys. This requirement of processing efficiency has led to the
computations that are several orders of magnitude more expensive. development and usage of sketching techniques, which share coun-
ters among multiple keys and reconstruct key volumes from this
PVLDB Reference Format: compressed data structure [10, 12, 14, 20, 25, 26, 29, 30, 34]. Alas, this
Francesco Da Dalt, Simon Scherrer, and Adrian Perrig. Bayesian Sketches
compression naturally causes inaccuracy in the volume estimate. In
for Volume Estimation in Data Streams. PVLDB, 16(4): 657 - 669, 2022.
fact, classic sketch algorithms like the Count-Min Sketch [12] have
doi:10.14778/3574245.3574252
been shown to deliver high accuracy (e.g., relative errors below
PVLDB Artifact Availability: 10%) only with impractical memory consumption [20, 29]. Hence,
The source code, data, and/or other artifacts have been made available at classic sketch algorithms are subject to an undesirable trade-off
https://github.com/FrancescoDaDalt/bayes-sketch. between estimation accuracy and memory efficiency.
Recent research has made significant progress in mitigating this
1 INTRODUCTION trade-off by designing advanced query methods of sketch algorithms,
The analysis of data streams is a key component in numerous ap- i.e., the methods for computing a volume estimate from a synop-
plications, enabling functionality as diverse as topic mining from sis. In particular, Seq-Sketch [20] employs a compressed-sensing
text streams [13, 23], traffic monitoring and policing in the In- approach, and PR-Sketch [30] relies on solving a system of linear
ternet [3, 14, 19, 29, 33], data aggregation in sensor networks [24], equations, enabling both algorithms to achieve nearly zero relative
buffer dimensioning in VoIP networks [32, 35], and forecasting from error given a compact data-stream synopsis.
time-series data in financial markets [1, 4]. Such stream processing However, the accuracy improvements of these recent proposals
includes a single iteration over a data stream, which corresponds to come at the cost of query efficiency, i.e., the time and space com-
a sequence of items, each assignable to a certain key and possessing plexity of the query methods: Both Seq-Sketch and PR-Sketch
a certain volume. Estimating the total volume associated to certain solve regularized optimization problems with potentially millions
of variables, which introduces considerable complexity even with
This work is licensed under the Creative Commons BY-NC-ND 4.0 International state-of-the-art numerical solvers (cf. Table 1). However, for exam-
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
ple, high-frequency trading must be highly responsive to changes
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights in transaction streams [2], DDoS defense systems must identify
licensed to the VLDB Endowment. and block large flows as fast as possible to avert damage from other
Proceedings of the VLDB Endowment, Vol. 16, No. 4 ISSN 2150-8097.
doi:10.14778/3574245.3574252 flows [29], and data mining from sensor-network streams relies
on tight feedback loops for efficient operation of machinery or

657
vehicles [11]. For these usecases, the complex query operations of Table 2: Notation
contemporary high-accuracy sketches are a significant impediment.
In this paper, we tackle this problem of high-accuracy sketches by Symbol Description
proposing three sketching techniques that significantly reduce the
I Set of distinct keys in the data stream
computational cost of query execution and exhibit high accuracy
both in theory and practice. Each of these sketches embodies a 𝑒𝑖 ∈ I × R 𝑖-th item in the data stream
distinct query technique, each replacing the complex optimization- 𝑓𝑖 ∈ I Key associated with item 𝑒𝑖
based queries of previous high-accuracy sketches with a handful of 𝑞𝑖 ∈ R Volume associated with item 𝑒𝑖
closed-form evaluations. These closed-form evaluations result from 𝑎 ∈ Rℓ0 Vector of total volumes of all keys in I
an in-depth theoretical analysis of the volume-estimation problem. 𝑎𝑓 ∈ R Total volume, or size, associated with key 𝑓 ∈ I
To be specific, our query techniques rely on Bayesian probability 𝑎ˆ𝑋𝑓 ∈ R Estimate of 𝑎 𝑓 by algorithm 𝑋
theory and on counter-cardinality information, respectively. ℓ0 ∈ N Number of distinct keys in the data stream
In our first approach, we leverage Bayesian probabilistic reason- ℓ1 ∈ N Total stream volume (sum over 𝑎)
ing to derive the Count-Bayesian Sketch (CB-Sketch). More pre- ℓ2 ∈ N Squared 𝑙 2 norm of key-volume vector 𝑎 (|𝑎| 22 )
cisely, the CB-Sketch relies on the computation of an approximate 𝑑∈N Number of counter arrays
maximum a-posteriori (MAP) estimate for the volume of desired 𝑤 ∈N Size of each counter array
keys given the collected stream data. This estimate is captured by a 𝑉 ∈ R𝑑 ×𝑤 Counter arrays storing volume information
closed-form solution and can be computed very efficiently while 𝐶 ∈ R𝑑 ×𝑤 Counter arrays storing cardinality information
being highly accurate in general. Moreover, the Bayesian concept (𝐻 1, . . . , 𝐻𝑑 ) Indep. hash functions, 𝐻𝑔 : I → {1, . . . , 𝑤 }
of priors enables a sketch operator to encode previous knowledge
about the stream, and helps to achieve even higher accuracy.
In our second approach, we design the Cardinality-Count-Average
Sketch (CCA-Sketch), where queries take into account counter- We demonstrate the properties of our algorithms with a theo-
cardinality information, i.e., the number of distinct keys that were retical analysis (§5) and an experimental evaluation (§6).
mapped to any single counter. While counter-cardinality informa-
tion has been heuristically employed in previous sketch proposals, 2 BACKGROUND
we present the first rigorous theoretical quantification of the ac- 2.1 Problem Definition
curacy improvements achievable by such information. These in-
In per-key volume aggregation, a data stream corresponds to a
sights allow to optimally combine our two approaches into the
sequence of items (𝑒 1, 𝑒 2, . . . , 𝑒𝑀 ), where each item 𝑒𝑖 is associated
Cardinality-Count-Bayesian Sketch (CCB-Sketch), enabling queries
to a key 𝑓𝑖 and has a certain volume 𝑞𝑖 . 𝑓𝑖 is a key in I, |I| = ℓ0 ,
that leverage both Bayesian techniques and cardinality information.
which is the set of all distinct keys present in the data stream,
The theoretical guarantees presented in this paper are competi-
whereas 𝑞𝑖 may be any real number, positive or negative (i.e., the
tive with other query-efficient sketches. In addition to this theoret-
turnstile model). The goal of per-key volume aggregation is to
ical analysis, we also undertake an empirical evaluation, showing
estimate the aggregate volume 𝑎 𝑓 of a key 𝑓 , which is defined as
that our proposed sketches outperform their competitor algorithms
in relevant settings. In particular, our experiments indicate that im-
∑︁
𝑎𝑓 = {𝑞𝑖 | ∀𝑖 ∈ [1, . . . , 𝑀]. 𝑓𝑖 = 𝑓 } .
provements in estimation accuracy can be considerably enhanced
by the use of informed priors in the Bayesian sketches (CB- and For convenience, we define 𝑎 ∈ Rℓ0 as the vector containing the
CCB-Sketch). Moreover, both a theoretical complexity analysis and values 𝑎 𝑓 for all 𝑓 ∈ I as separate dimensions. Moreover, 𝑎ˆ𝑋𝑓
empirical evaluation confirm that our sketches enable queries that denotes the estimate of 𝑎 𝑓 calculated by algorithm 𝑋 .
are efficient regarding computation time and memory consumption. To enable volume estimation, some sketch algorithms such as
Our paper presents the following contributions: SeqSketch [20] and PR-Sketch [30] involve mechanisms to collect
• Bayesian sketching: We discuss how to leverage Bayesian the set I of distinct keys in the stream (key tracking). We do not
probability theory for sketching, enabling the derivation of concern ourselves with this problem, and assume that I is known.
both the CB-Sketch and the CCB-Sketch (§4). Both these However, we empirically show in Sections 6.6 and 6.7 that our
sketches arise from closed-form approximations to MAP esti- algorithms remain highly accurate under incomplete key tracking.
mates. To the best of our knowledge, the application of MAP
estimates to streaming algorithms is novel. 2.2 Trade-Offs in Previous Approaches
• Cardinality-based sketching: We provide an in-depth the-
oretical analysis of the interaction between cardinality and Traditionally, per-key volume estimation in the context of sketching
volume information in counters. The resulting insights allow to techniques, i.e., the computation of 𝑎ˆ𝑋𝑓 , has been performed by
employ counter-cardinality information in the CCA-Sketch. extracting information related to key 𝑓 from a data-stream synopsis,
• Reconciling accuracy and query efficiency: We present followed by some simple aggregation operation yielding 𝑎ˆ𝑋𝑓 . Since
three novel streaming algorithms (CB-Sketch, CCA-Sketch the estimate of each per-key volume is thus computed individually,
and CCB-Sketch), which differ from earlier sketches by com- we henceforth refer to these classic techniques as local sketches.
bining high estimation accuracy with high query efficiency. The Count Sketch (henceforth: C-Sketch) and the Count-Min Sketch
(henceforth: CM-Sketch) both fall into this category.

658
𝑤 𝑤 incoming stream items 𝑒𝑖 , and the query which returns a volume
+𝑞𝑖 ⊕1 estimate 𝑎ˆ 𝑓 of 𝑎 𝑓 for a key 𝑓 ∈ I based on the data structure.
Key 𝑓𝑖 The update algorithm in our sketches closely follows the up-
+𝑞𝑖 𝑑 ⊕1 date procedures in previous sketches [10, 12, 29, 30] and is briefly
Volume 𝑞𝑖
+𝑞𝑖 ⊕1 presented in Section 3.2.1.
In contrast, the query procedures of our sketches employ novel
Item 𝑒𝑖 Volume table 𝑉 Cardinality table 𝐶
techniques. In particular, the CB-Sketch (Section 3.2.2) is inspired
Figure 1: Illustration of data structure and update algorithm. by Bayesian probabilistic reasoning: Its query technique arises from
Note that the cardinality table 𝐶 may also be populated from a stochastic optimization problem which approximates the arg max
the key set I at query time (retroactively). of the posterior distribution of the key-associated volume, given
the data structure. In other words, the CB-Sketch performs an
approximate maximum a-posteriori (MAP) estimate of any queried
key, given the information collected about the observed stream.
Recently, however, sketching techniques have begun to compute
The CCA-Sketch, which is presented in Section 3.2.3, relies on
estimates for all keys (i.e., the total key-volume vector 𝑎) simulta-
cardinality information instead of Bayesian techniques. The insights
neously rather than one estimate at a time. We henceforth refer to
from the CCA-Sketch allow to optimally enrich the CB-Sketch
these approaches as global sketches. Among these global sketches,
with cardinality information, which produces the CCB-Sketch
the PR-Sketch [30] and Seq-Sketch [20] represent the most re-
that relies on both Bayesian reasoning and cardinality information
cent examples. Such global sketches have been shown to trade off
(presented in Section 3.2.4).
unprecedented accuracy for greatly increased time and memory
cost, as is illustrated by Table 1, which contains results from an 3.2.1 Data Structure Update. With streaming algorithms targeting
experimental comparison regarding accuracy (i.e., average relative applications such as traffic supervision on network links, the update
error) as well as computation time and peak memory consumption procedure must be extremely lightweight and time-efficient in order
at query time. In this paper, we propose three algorithms that fill the to reduce forwarding overhead and keep up with the item arrival
gap between these two types of sketches, i.e., providing accuracy rate. To that end, we employ an update procedure which is an
close to global sketches, while keeping the efficient query execution extension of a commonly used counter-update scheme [12, 30]: For
time of local sketches. We are able to reconcile these desiderata each incoming item 𝑒𝑖 , 𝑑 counter arrays are updated by adding item
by deriving compact closed-form solutions for estimates based on volume 𝑞𝑖 to each counter to which 𝑓𝑖 is hashed.
Bayesian probability theory, which also allow for the injection of Regarding the total number ℓ0 of distinct keys and the cardinality
an accuracy-boosting prior into the estimation process. table 𝐶, this paper does not propose novel methods for estimating
these quantities. This omission is conscious, as the best method of
3 DATA STRUCTURE AND ALGORITHMS deriving ℓ0 and 𝐶 depends on the context of the stream processing.
Several methods [15, 20, 29–31] have been proposed to obtain exact
In this section, we provide a first description of the CB-Sketch, the
or approximate values of ℓ0 and 𝐶.
CCA-Sketch and the CCB-Sketch. This description involves both
the data structure used in the sketches (§3.1) and the algorithmic 3.2.2 CB-Sketch. In the CB-Sketch, processing a query amounts
procedures that operate on the data structure (§3.2). to the computation of an approximate MAP estimate of the key
volume given the information data structure. As we will justify in
3.1 Data Structure Section 4.1, this estimate is given as follows:
In sketching algorithms, the data structure stores all information 𝜇𝐶𝐵 𝐶𝐵 −1 + 𝑤 · Í𝑑 𝑉 [𝑖, 𝐻 (𝑓 )] − 𝑑 · ℓ
𝑝 · ( 𝜒𝑝 ) 𝑖 1
𝑎ˆ𝐶𝐵 (1)
𝑖=1
necessary to answer key-volume queries. All three of our sketches =
𝑓
( 𝜒𝑝𝐶𝐵 ) −1 + 𝑑 · (𝑤 − 1)
are based on the same data structure, which consists of multiple
counter arrays and is similar to the data structures used by C- where 𝜇𝐶𝐵
𝑝 indicates the prior mean on 𝑎 𝑓 and 𝜒𝑝 is linearly pro-
Sketch [12] and CM-Sketch [10] (depicted in Figure 1). To be portional to the prior variance on 𝑎 𝑓 . Note that ℓ0 and 𝐶 are not
specific, the data structure contains a volume table 𝑉 ∈ R𝑑 ×𝑤 , used by the CB-Sketch and can thus be omitted.
where the rows correspond to 𝑑 counter arrays, each containing 𝑤 A rigorous derivation of the approximate MAP estimate together
counters. The 𝑑 counter arrays are associated with 𝑑 independent with an intuitive interpretation is provided in Section 4.1, whereas
hash functions (𝐻 1, . . . , 𝐻𝑑 ). In addition, the data structure also con- Section 5.1 analyzes the error bounds offered by the CB-Sketch.
tains a cardinality table 𝐶, which has the same size and is associated
3.2.3 CCA-Sketch. The CCA-Sketch is inspired by considerations
with the same hash functions, but stores cardinality information, i.e.,
to combine volume and cardinality counter information in a simple
𝐶 [𝑖, 𝑗] contains the number of distinct keys mapped to counter 𝑗 in
algorithm to improve aggregate volume sketch accuracy. Intuitively,
counter array 𝑖: 𝐶 [𝑖, 𝑗] = |{𝑓 ∈ I | 𝐻𝑖 (𝑓 ) = 𝑗 }|. Furthermore, the
the volume of a key can be more accurately derived from the vol-
data structure also keeps track of the total stream volume ℓ1 .
ume in an associated counter if the number of keys mapped to that
counter is considered, given that such cardinality may vary substan-
3.2 Algorithms tially across counters. To incorporate cardinality information, the
We separate the basic functionality of a sketch algorithm into two volume estimate can be based on the ratio between the volume and
parts: The update, which modifies the data structure based on the the cardinality of counters associated with a key. Such a ratio-based

659
approach has been employed by LOFT [29], although in the context operators plausibly have domain-specific knowledge about the
of top-𝑘 detection. To be viable for the more general problem of analyzed data streams, e.g., mean flow sizes in network settings,
key-volume estimation, the CCA-Sketch departs from the LOFT or typical measurements in sensor data streams. Second, even if
algorithm in several respects. Most importantly, the CCA-Sketch such domain knowledge is initially unavailable, it can be practically
applies an affine transformation to the volume-cardinality ratios. gathered over time because streaming algorithms typically process
This affine transformation has been chosen based on a theoretical large streams in segments, e.g., sketches in network settings handle
analysis of the statistical behavior of the volume-cardinality ratios, 1-2 seconds of traffic before they are reset [29]. Hence, information
which is presented in Section 5.2. Concretely, the CCA-Sketch collected from a previous stream segment can be used as surrogate
computes a volume estimate for key 𝑓 as follows: prior information for the current stream segment. The viability
Í 𝑉 [𝑖,𝐻 (𝑓 ) ] of this prior information is then determined by the volatility of
(ℓ0 + 𝑤 − 1) · 𝑑𝑖=1 𝐶 [𝑖,𝐻𝑖 (𝑓 ) ] − 𝑑 · ℓ1 the data-stream statistics. Third, previous research has proposed
𝑎ˆ 𝑓 (2)
𝐶𝐶𝐴 𝑖
=
𝑑 · (𝑤 − 1) specialized sketches that can estimate important moments [21] of
Despite the CCA-Sketch not stemming from a stochastic opti- a data stream and can hence yield information that can be used as
mization problem, it is related to the CB-Sketch for subtle reasons, a priors.
which will be discussed in Section 5.2. The MAP estimation problem in Equation 4 can be tackled in
numerous ways, which vary in design decisions regarding the data
3.2.4 CCB-Sketch. So far, we have demonstrated how volume structure, the prior distribution and the optimization method. In
sketching can be extended with Bayesian reasoning (in the CB- this paper, we present how the MAP estimate can be approximately
Sketch) and with cardinality information (in the CCA-Sketch). solved for the case where the data structure is the classic data struc-
Interestingly, these techniques can be combined, yielding the CCB- ture from Section 3, the prior distribution is a normal distribution,
Sketch with the following volume estimate for 𝑎 𝑓 (cf. Section 4.2): and the optimization method is a closed-form solution.
𝜇𝐶𝐶𝐵 𝑉 [𝑖,𝐻𝑖 (𝑓 ) ]
+ (ℓ0 − 1) · 𝑑𝑖=1 𝐶 [𝑖,𝐻 (𝑓
𝑝 Í
𝜒𝑝𝐶𝐶𝐵 𝑖 ) ]−1 − 𝑑 · ℓ1 4.1 Constructing the CB-Sketch
𝑎ˆ𝐶𝐶𝐵 = (3) The model for the CB-Sketch assumes the prior
𝑓 ℓ −𝐶 [𝑖,𝐻 (𝑓 ) ]
(𝜒𝑝𝐶𝐶𝐵 ) −1 + 𝑖=1 𝐶0 [𝑖,𝐻 (𝑓 𝑖) ]−1
Í𝑑
𝐶𝐵 2
𝑖
 
P[𝑎 𝑓 = 𝛼] = N 𝛼; 𝜇𝐶𝐵
𝑝 , (𝜎𝑝 ) (5)
where 𝜇𝐶𝐶𝐵
𝑝 and 𝜒𝑝𝐶𝐶𝐵 are the prior parameters relating to mean
and variance, respectively. Same as in CB-Sketch, 𝜇𝐶𝐶𝐵 and 𝜒𝑝𝐶𝐶𝐵 Furthermore, P[𝐷 | 𝑎 𝑓 = 𝛼] is concretized as the probability that
𝑝
can be used to add an independently selected bias to the final the volume counters in 𝐷 associated to 𝑓 attain their respectively
estimate. Said briefly, the CCB-Sketch is a direct upgrade compared stored value. Formally, this concretization results in:
to the CB-Sketch, in which the actual counter cardinality 𝐶 is 𝑑
∑︁
approximated with an expected value. 𝑎ˆ𝐶𝐵
𝑓 = argmax𝛼 log(P[𝑉𝑖 = 𝑣𝑖 | 𝑎 𝑓 = 𝛼]) + log(P[𝑎 𝑓 = 𝛼]) (6)
𝑖=1
4 BAYESIAN SKETCHING where 𝑣𝑖 := 𝑉 [𝑖, 𝐻𝑖 (𝑓 )] and 𝑉𝑖 is the RV associated to the 𝑖-th
As the main contribution, this paper presents a new perspective on volume counter of 𝑓 in the data structure.
designing streaming algorithms, which builds on Bayesian statistics Next, the RV 𝑉𝑖 can be expressed as the sum over all item volumes
and machine learning. More concretely, we model the data structure, that hash to a counter (I being the indicator function):
denoted as 𝐷, and the volume 𝑎 𝑓 of key 𝑓 as random variables. Both
∑︁
𝑉𝑖 = 𝑎 𝑓 + I {𝐻𝑖 (𝑓 )=𝐻𝑖 (𝑔) } 𝑎𝑔
RVs are clearly not independent and are linked by a non-trivial (7)
𝑔∈ I. 𝑔≠𝑓
probability distribution. A query for an estimate of 𝑎 𝑓 can hence
be refactored into the following optimization problem: Assuming that ℓ0𝑤 −1 is sufficiently large and that the total key
volumes are i.i.d. samples of some distribution, the CLT yields the
P[𝐷 | 𝑎 𝑓 = 𝛼]P[𝑎 𝑓 = 𝛼]
𝑎ˆ 𝑓 = argmax𝛼 P[𝑎 𝑓 = 𝛼 | 𝐷] = argmax𝛼 approximation:
P[𝐷] 
ℓ1 − 𝑎 𝑓 𝑤 − 1  
2
= argmax𝛼 log(P[𝐷 | 𝑎 𝑓 = 𝛼]) + log(P[𝑎 𝑓 = 𝛼]) (4) 𝑉𝑖 ∼¤ N 𝑎 𝑓 + , ℓ2 − 𝑎 (8)
𝑤 𝑤2 𝑓
In other words, we model a point query of key 𝑓 on a sketch data By inserting Equation 8 into Equation 6, we obtain:
structure as a maximum-a-posteriori (MAP) estimate of 𝑎 𝑓 .
The advantages of this model are twofold. First, optimization
𝑑
log ℓ2 − 𝛼 2
∑︁ 
𝑎ˆ𝐶𝐵 ≈ argmax − +
and inference is a well understood problem for which many algo- 𝑓 𝛼
2
𝑖=1
rithms exist. Secondly, MAP estimates allow to insert some prior  2 (9)
−𝛼 2
  
information into the estimate, i.e., P[𝑎 𝑓 = 𝛼]. In general, priors ∑︁𝑑 𝑣𝑖 − 𝛼 − ℓ1𝑤 · 𝑤2 𝛼 − 𝜇𝐶𝐵
𝑝
serve to nudge estimates towards more likely values and to make − 2
 − 𝐶𝐵 ) 2
2(𝑤 − 1) ℓ 2 − 𝛼 2(𝜎
predictions of extreme values unlikely. 𝑖=1 𝑝
In most scenarios, sketch operators have knowledge of some In this form, the problem does not have a closed-form solution.
moments of the key sizes in the data stream, while their exact Hence, we require a simplified surrogate problem that still yields a
distribution may be unknown. A limited amount of such prior reasonable solution. To that end, we note that the logarithmic term
information is usually available for three reasons. First, sketch in Equation 9 is dominated almost everywhere by the other terms

660
of Equation 9, i.e., it can influence the solution to the optimization parallel to the C-Sketch [10]. The C-Sketch makes use of a key-
problem by at most a factor proportional to O (𝑤 −1 ). It is hence specific signed multiplication step which causes counter values,
reasonable to drop the logarithmic term from Equation 9 to get a multiplied by the sign of some key 𝑓 , in expectation, to be linearly
tractable surrogate optimization problem, and we write: correlated to 𝑎 𝑓 and equal to 0 under the null hypothesis. Hence
both C- and CB-Sketch base their estimates of 𝑎 𝑓 on the difference
ℓ1 − 𝛼 2
 −1 ∑︁𝑑  
2 between an expected null hypothesis (0 and ℓ1 /𝑤 respectively)

𝑎ˆ𝐶𝐵 ≈ argmin 𝛼 2 ℓ − 𝛼 𝑣 𝑖 − 𝛼 − +
𝑓
𝑖=1
𝑤 and observed data. These observations help to explain why the
 2 (10) performance guarantees of the C-Sketch and the CB-Sketch are
(𝑤 − 1) ℓ2 − 𝛼 2 𝛼 − 𝜇𝐶𝐵 𝑝 ª similar (cf. §5.1) as both sketches do conceptually the same thing
and only differ in how the “null hypothesis” is found (hash-based
®
𝑤 2 · (𝜎𝑝𝐶𝐵 ) 2 ®
¬ signed multiplication and measurement of ℓ1 respectively).
In typical cases, we can assume (ℓ2 − 𝛼 2 ) to be approximately
relatively constant for reasonable instances of 𝛼 due to the assump- 4.2 Constructing the CCB-Sketch
tion of ℓ0 being very large. With that knowledge, we discard the Same as with the CB-Sketch, we model a prior distribution over
leading factor (ℓ2 − 𝛼 2 ) −1 in the RHS of Equation 10 and perform a the key volumes as:
translation that simplifies the optimization problem:
, (𝜎𝑝𝐶𝐶𝐵 ) 2
 
P[𝑎 𝑓 = 𝛼] = N 𝛼; 𝜇𝐶𝐶𝐵 (13)
(𝜎𝑝𝐶𝐵 ) 2 = 𝜒𝑝𝐶𝐵 ℓ2 − 𝛼 2 =⇒ lim 𝑎ˆ𝐶𝐵
 
𝑝
ℓ0 →∞ 𝑓 ≈
2 However, in contrast to the preceding subsection, P[𝐷 | 𝑎 𝑓 = 𝛼]
(11)

(𝑤 − 1) 𝛼 − 𝜇𝐶𝐵
𝑝 𝑑 
∑︁ ℓ1 − 𝛼 2

indicates the joint probability of volume counters and cardinality
argmin𝛼 + 𝑣𝑖 − 𝛼 − counters attaining their respective value, assuming key 𝑓 has vol-
𝑤 2 · 𝜒𝑝𝐶𝐵 𝑤
𝑖=1 ume 𝛼. Since counter cardinality and key volumes are independent
Solving the optimization problem amounts to taking the derivative from each other, we can move the probability of the cardinality
with respect to 𝛼 and setting it to zero, which produces the CB- counters attaining their respective values into the conditional:
Sketch query:
  𝑑
∑︁
𝐶𝐵 −1 + 𝑤 Í𝑑 𝑣 − 𝑑 · ℓ
𝜇𝐶𝐵
𝑝 (𝜒𝑝 ) 𝑖 1 𝑎ˆ𝐶𝐶𝐵 = argmax𝛼 log(P[𝑉𝑖 = 𝑣𝑖 | 𝑎 𝑓 = 𝛼, 𝐶𝑖 = 𝑐𝑖 ]) +
lim 𝑎ˆ𝐶𝐵 ≈
𝑖=1
(12)
𝑓
𝑖=1
(14)
ℓ0 →∞ 𝑓 (𝜒𝑝 ) + 𝑑 (𝑤 − 1)
𝐶𝐵 −1
log(P[𝑎 𝑓 = 𝛼])
Regarding this derivation, two issues must be noted. Firstly, Equa-
tion 12 does not compute the exact MAP estimate, but instead an where we abbreviate 𝑐𝑖 := 𝐶 [𝑖, 𝐻𝑖 (𝑓 )] and define 𝐶𝑖 as the RV
asymptotic approximation. Such an approximation is sub-optimal associated with 𝑐𝑖 .
but necessary due to intractability of the exact problem. Secondly, The derivation of the approximate solution of Equation 14 is very
𝜒𝑝𝐶𝐵 builds on (ℓ2 − 𝛼 2 ), which is unknown. In practice, however, similar to the derivation of Equation 12 and is omitted from this sec-
it is still possible to estimate (ℓ2 − 𝛼 2 ) and to subsequently choose tion to avoid repetition. The derivation of the closed-form estimate
a matching 𝜒𝑝𝐶𝐵 . Alternatively, an approach inspired from machine in the CCB-Sketch differs from the CB-Sketch in two high-level
respects. First, the distribution of 𝑉𝑖 conditioned on 𝑎 𝑓 = 𝛼 and
learning could consist of learning 𝜒𝑝𝐶𝐵 by grid search or more so-
𝐶𝑖 = 𝑐𝑖 is modeled as an RV, which is defined as a weighted sum
phisticated search schemes on 𝜒𝑝𝐶𝐵 . While the CB-Sketch is thus of entries from a multidimensional hypergeometric distribution,
based on a number of approximations, our empirical evaluation in which is then approximated by a normal distribution. This hyperge-
Section 6 shows that the CB-Sketch yields highly accurate esti- ometric distribution differs from the sum of weighted Bernoulli RVs
mates, confirming the validity of the approximations.. that were used for the CB-Sketch, and stems from the conditioning
Since the CB-Sketch arises from a solution to an optimization on 𝐶𝑖 = 𝑐𝑖 . Correctly modeling the influence of this conditional
problem, interpreting the algorithm is non-trivial. Assuming an 𝐶𝑖 = 𝑐𝑖 is key to obtaining a superior final estimate. The second
uninformed prior 𝜒𝑝𝐶𝐵 = ∞, Equation 12 is best understood the difference between derivations of the CB-Sketch and CCB-Sketch
following way: It (i) uses ℓ1 and 𝑤 to formulate a “null hypothesis” estimates concerns the prior variance, i.e., 𝜒𝑝𝐶𝐵 and 𝜒𝑝𝐶𝐶𝐵 . For the
regarding how big 𝑣𝑖 would be in expectation if 𝑎 𝑓 = 0 (namely ℓ1 /𝑤 CCB-Sketch, this prior variance is:
because the expected value of the counter is (ℓ1 − 𝑎 𝑓 )/𝑤 + 𝑎 𝑓 ), (ii)
computes the difference of the hypothesis to the actual counter (ℓ0 − 1) ℓ2 − 𝛼 2 − (ℓ1 − 𝛼) 2

(𝜎𝑝𝐶𝐶𝐵 ) 2 = 𝜒𝑝𝐶𝐶𝐵 (15)
value 𝑣𝑖 , and (iii) averages these differences across the 𝑑 counter ℓ0 − 2
arrays in order to estimate 𝑎 𝑓 . The denominator accounts for cor-
relation between the volume counters and ℓ1 , and normalizes the which is a less trivial expression compared to 𝜒𝑝CB in Equation 11.
numerator such that it yields an asymptotically bias-free estimate Same as with the CB-Sketch, we require 𝜒𝑝𝐶𝐶𝐵 in order to remove
of 𝑎 𝑓 , as is shown in Section 5.1. The prior parameters 𝜇𝐶𝐵 𝑝 and the unknown term ℓ2 −𝛼 2 from the optimization. The prior variance
𝜒𝑝𝐶𝐵 can add an arbitrarily strong bias to the prediction. is different for the two sketches because their probabilistic models
By relying on the difference between the hypothetical and the yield different terms to be optimized, requiring different mathemat-
actual stream volume, the CB-Sketch possesses an interesting ical simplifications in order to result in a useful final estimate. For

661
the CCB-Sketch, the approximate solution to Equation 14 is: Proof. Let 𝑓 be any key in the set I and the prior on 𝑎 𝑓 be
𝜇𝐶𝐶𝐵 ( 𝜒𝑝𝐶𝐶𝐵 ) −1 + (ℓ0 − 1) 𝑑𝑖=1 𝑐𝑖𝑣−1
Í 𝑖
− 𝑑 · ℓ1 uninformed, i.e., 𝜒𝑝𝐶𝐵 = ∞. We abbreviate 𝑣𝑖 = 𝑉 [𝑖, 𝐻𝑖 (𝑓 )]. Since 𝑣𝑖
𝑝
lim 𝑎ˆ 𝑓
𝐶𝐶𝐵
≈ ℓ0 −𝑐𝑖
(16) is the only source of uncertainty in 𝑎ˆ𝐶𝐵 , it is the only part required
ℓ0 →∞ −1
𝐶𝐶𝐵 Í 𝑑
(𝜒𝑝 ) + 𝑖=1 𝑐𝑖 −1 𝑓
to be modeled stochastically. By definition of 𝑣𝑖 , we know that
As with the CB-Sketch, the prior parameters may be learned or ∑︁
estimated. Moreover, the CCB-Sketch only approximately solves 𝑣𝑖 = 𝑎 𝑓 + I {𝐻𝑖 (𝑓 )=𝐻𝑖 (𝑔) } 𝑎𝑔
𝑔∈ I. 𝑔≠𝑓
Equation 14, but the empirical performance of the sketch confirms
the validity of these approximations. where I is an indicator variable and the only source of uncertainty
Intuitively, the term (ℓ0 − 1) 𝑑𝑖=1 𝑐𝑖𝑣−1 has an elegant interpreta- in the model. By assumption of universal hashing, I {𝐻𝑖 (𝑓 )=𝐻𝑖 (𝑔) }
Í 𝑖

tion analogous to a term in Equation 12. In fact, this term estimates are i.i.d. RVs that follow a Bernoulli distribution Ber (𝑤 −1 ). The
what 𝑑 · ℓ1 should have been if key 𝑓 did not exist: The sub-term sum over 𝑔 ∈ I, 𝑔 ≠ 𝑓 is a sum over ℓ0 − 1 random variables. Thus,
𝑣𝑖 (𝑐𝑖 − 1) −1 computes the average key volume in counter 𝑖, assum- by assumption, in the limit ℓ0𝑤 −1 → ∞ the CLT applies and:
ing the volume of that counter is attributed to only 𝑐𝑖 − 1 keys, one 
ℓ1 − 𝑎 𝑓 𝑤 − 1  
2
fewer than the true number of keys 𝑐𝑖 , which can be interpreted as lim 𝑣𝑖 ∼ N 𝑎 𝑓 + , ℓ2 − 𝑎 (18)
ℓ0 𝑤 −1 →∞ 𝑤 𝑤2 𝑓
non-existence of 𝑓 . Those estimates are then multiplied by (ℓ0 − 1),
i.e., again one fewer than the number of distinct keys, in order to Therefore, in the limit, the estimate 𝑎ˆ𝐶𝐵
𝑓
is distributed as:
predict what ℓ1 should have been. Hence, the numerator of Equa-
ℓ2 − 𝑎 2𝑓
tion 16 represents 𝑑 times the difference of what ℓ1 is and what
𝑎ˆ𝐶𝐵 ∼ N (19)
© ª
𝑎 ,
it should have been if key 𝑓 did not exist, assuming the prior is 𝑓
𝑑 (𝑤 − 1)
𝑓 ­ ®
uninformed. The denominator can be interpreted as extracting the « ¬
most likely estimate of 𝑎 𝑓 from the numerator. For Equation 19, we used the fact that all hash functions are pairwise
Both the CB-Sketch and the CCB-Sketch use the crucial infor- independent and hence ∀𝑖. 𝑣𝑖 are i.i.d.. This shows in primis that for
mation that key 𝑓 can be assigned to counters with certainty; hence, a uniform prior, in the limit, the CB-Sketch implements an unbiased
the value of these counters should be deflected by 𝑎 𝑓 compared to estimator. To bound the estimation error |𝑎ˆ𝐶𝐵 𝑓
− 𝑎 𝑓 |, we use the
their expected value if 𝑓 did not exist. In fact, the similarities can sub-gaussian concentration bound on the normal distribution:
also be shown quantitatively, as we observe that the counter-array
width 𝑤 in the numerator of the CB-Sketch estimate is close to
h i © 𝜖 2 · ℓ12 · 𝑑 (𝑤 − 1) ª
lim P |𝑎ˆ𝐶𝐵 − 𝑎 𝑓 | ≥ 𝜖 · ℓ1 ≤ 2 exp ­­−  ®®
the term (ℓ0 − 1)𝑐𝑖−1 in the numerator of the CCB-Sketch estimate.

2 · ℓ2 − 𝑎 2𝑓
ℓ0 𝑤 −1 →∞ 𝑓
The same kind of relation can be observed in the denominator be-
(20)
« ¬
tween 𝑑 (𝑤 − 1) and 𝑑𝑖=1 ℓ𝑐0𝑖−𝑐 −1 . Informally, the CCB-Sketch can
Í 𝑖
which concludes the proof. □
thus be seen as a direct upgrade of the CB-Sketch, as it uses the
same general recipe for estimating 𝑎 𝑓 , while substituting some 5.1.1 Comparison to CM-Sketch. The CM-Sketch [12] has the
information-deficient terms involving 𝑤 with more precise terms following characteristics:
based on measurements of counter cardinality 𝑐𝑖 .  
h i 𝜖 ·𝑑 ·𝑤
P |𝑎ˆ𝐶𝑀
𝑓 − 𝑎 𝑓 | ≥ 𝜖 · ℓ1 ≤ exp − (21)
𝑒
5 THEORETICAL ANALYSIS
At first sight, the CM-Sketch might seem to dominate the CB-
In this section, we present and prove theorems about the theoretical
Sketch because the linear 𝜖 reduces the exponential function in
performance of the CB-Sketch and CCA-Sketch, both in terms of
Equation 21 more quickly than the squared 𝜖 in Equation 17. How-
worst-case error bounds and execution complexity.
ever, the ratio of ℓ12 to (ℓ2 − 𝑎 2𝑓 ) in Equation 17 compensates this
5.1 CB-Sketch Analysis effect, as for practical scenarios, this term behaves linearly in the
number of keys ℓ0 . Since ℓ0 ≫ 𝜖 −1 , the additional 𝜖 in the expo-
This section presents the result of a theoretical analysis of the
nent of the CB-Sketch bound is counteracted. In practice, the CB-
accuracy of the CB-Sketch, followed by comparisons to the C-
Sketch effectively dominates the CM-Sketch as 𝜖 −1 is sub-linear
Sketch and CM-Sketch. We also present a runtime analysis.
in ℓ0 .
Theorem 1. Let 𝑎 𝑓 the aggregate volume of key 𝑓 ∈ I, and 𝑎ˆ𝐶𝐵 be
𝑓 5.1.2 Comparison to C-Sketch. The performance guarantees of
the C-Sketch [10] are given by:
𝐶𝐵
the estimate according to Equation 12 with a uniform prior 𝜒𝑝 = ∞.
In the limit ℓ0𝑤 −1 → ∞, assuming that max𝑔≠𝑓 |𝑎𝑔 |𝜅 𝑓 → 0,
© 𝜖2 · ℓ2 · 𝑑 · 𝑤 ª
P |𝑎ˆ𝐶𝑓 − 𝑎 𝑓 | ≥ 𝜖 · ℓ1 ≤ exp ­­−  1
where h i
√︂
 −1  ®® (22)
3 · ℓ2 − 𝑎 2𝑓

𝜅𝑓 = ℓ2 − 𝑎 2𝑓 (𝑤 − 1)
« ¬
the accuracy of this estimate is bounded by: From this, we observe that the C-Sketch has strikingly similar
performance guarantees compared to the CB-Sketch, which may
h i © 𝜖 2 · ℓ12 · 𝑑 (𝑤 − 1) ª also be superior depending on sketch dimensioning and stream
P |𝑎ˆ𝐶𝐵
𝑓 − 𝑎 𝑓 | ≥ 𝜖 · ℓ1 ≤ 2 exp ­− (17)
characteristics. This partially superior accuracy may stem from
­   ®®
2 · ℓ2 − 𝑎 2𝑓
« ¬ (1) approximations made in the computation of the CB-Sketch

662
estimate, and (2) a more sophisticated update procedure of the which results from collapsing the matrix-vector multiplications of
C-Sketch. Nevertheless, in our experimental evaluation (§6), we mean and variance in terms of ℓ2 and ℓ1 . These moments enable
demonstrate that the CB-Sketch outperforms the C-Sketch in expressing the distribution of 𝑣𝑖 · 𝑐𝑖−1 as a mixture distribution of
terms of accuracy in most scenarios. 𝑣𝑖 · 𝑐 −1 , where 𝑐 itself is distributed according to 𝑐𝑖 which is an
affine binomial distribution:
5.1.3 Time Complexity. Equation 12 indicates that the time re-  
quired to compute the estimate 𝑎ˆ𝐶𝐵 𝑓
is in O (𝑑). Hence, the CB- 𝑐𝑖 ∼ 1 + Bin ℓ0 − 1, 𝑤 −1 (26)
Sketch has the same asymptotic query complexity as the CM-
Sketch. Importantly, however, the C-Sketch query requires the Thanks to the law of total variance and the law of total expectation,
computation of a median of a vector of length 𝑑. While this median the moments of 𝑣𝑖 · 𝑐𝑖−1 are given by
can be computed in asymptotically linear time by the Quickselect  
 (𝑐 − 1) ℓ1 − 𝑎 𝑓
 
algorithm [18], in practice this algorithm is more expensive than a
 
𝑣𝑖 𝑎 𝑓 
= E𝑐∼𝑐𝑖  +

E
single list iteration, and has poor worst-case performance. More- 𝑐 (ℓ0 − 1)

𝑐𝑖  𝑐 
over, we emphasize that the update procedure of the C-Sketch
 
  
applies double the number of hash functions than the CB-Sketch, (𝑐 − 1) (ℓ0 − 𝑐)𝑚 (27)
 
𝑣𝑖
= E𝑐∼𝑐𝑖 2 +
which constitutes a significant performance impediment [25].
V 2
𝑐𝑖 𝑐 (ℓ0 − 1) (ℓ0 − 2)
(𝑐 − 1) (ℓ1 − 𝑎 𝑓 ) 𝑎 𝑓
 
5.2 CCA-Sketch Analysis V𝑐∼𝑐𝑖 +
𝑐 (ℓ0 − 1) 𝑐
The error analysis of the CCA-Sketch is considerably more in-
volved compared to the CB-Sketch, mainly due to the introduction The expressions in Equations 27 have no exact closed-form solu-
of cardinality information. However, although the CCA-Sketch tions. Hence, we use an approximation which is inspired by a work
is quite different from the CB-Sketch, the CCA-Sketch analysis on confidence intervals for ratios of binomial distributions [22].
uncovers an equivalence regarding accuracy between the sketches. This work was expanded upon to be applicable to powers of RVs
that approximately follow a normal distribution. More precisely,
Theorem 2. Let 𝑎 𝑓 be the aggregate volume of key 𝑓 ∈ I, and our approximation scheme involves (1) a log-transformation on
𝑎ˆ𝐶𝐶𝐴
𝑓
be the estimate according to Equation 2. Then, in the limit the ratio of RVs, which yield a difference of logarithms of RVs, and
ℓ0𝑤 −1 → ∞, assuming that max𝑔≠𝑓 |𝑎𝑔 |𝜅 𝑓 → 0, where (2) a number of linear approximations to remove the logarithm op-
eration. The result of applying this approximation scheme, whose
error converges to 0 for ℓ0𝑤 −1 → ∞, to Equations 27 leads to the
√︂
  −1
𝜅𝑓 = ℓ2 − 𝑎 2𝑓 (𝑤 − 1), following result in the limit ℓ0𝑤 −1 → ∞:
(ℓ1 − 𝑎 𝑓 ) + 𝑤 · 𝑎 𝑓 ℓ1 + (𝑤 − 1) · 𝑎 𝑓
 
the accuracy of this estimate is bounded by: 𝑣𝑖
E = 𝜇𝑈 = = (28)
𝑐𝑖 ℓ0 + 𝑤 − 1 ℓ0 + 𝑤 − 1
h i © 𝜖 2 · ℓ12 · 𝑑 (𝑤 − 1) ª
2 3 2
  
lim P |𝑎ˆ𝐶𝐶𝐴 − 𝑎 𝑓 | ≥ 𝜖ℓ1 ≤ 2 exp ­−   ®® © (𝑤 − 1) − 2(ℓ0 − 1) 𝑤 + (ℓ0 − 1) − 𝑤 𝑚 ª
ℓ0 𝑤 −1 →∞ 𝑓 ­
2 ℓ2 − 𝑎 𝑓 2 ­  ®
2 2

2)𝑤 1)𝑎
­ ®
(23) − (ℓ0 − (ℓ1 − (ℓ0 − )
« ¬  
𝑓
= 𝜎𝑈2 = − «
𝑣𝑖
which equals the bound on 𝑎ˆ𝐶𝐵 V ¬
𝑓
from Theorem 1. 𝑐𝑖 (ℓ0 − 2) (ℓ0 − 1) (ℓ0 + 𝑤 − 1) 4
(29)
Proof. In the following, 𝑓 is any key in the set I, and we ab-
breviate 𝑣𝑖 = 𝑉 [𝑖, 𝐻𝑖 (𝑓 )] and 𝑐𝑖 = 𝐶 [𝑖, 𝐻𝑖 (𝑓 )]. To analyze the The bias-free CCA-Sketch estimate, as presented in Equation 2,
CCA-Sketch, we require the distribution of 𝑣𝑖 /𝑐𝑖 . To that end, we follows from the solution of Equation 28.
first observe that In similar fashion, it is possible to show that 𝑣𝑖 · 𝑐𝑖−1 is approx-
  imately normally distributed: After applying a log-transform on
𝑣𝑖 | (𝑐𝑖 = 𝑐) ∼ [𝑎 1, . . . , 𝑎 ℓ0 ] · MDHGD 𝑐 − 1, 1ℓ0 − 𝑒®𝑓 + 𝑎 𝑓 (24) 𝑣𝑖 ·𝑐𝑖−1 and a linear approximation to the random logarithmic terms,
we observe that both 𝑣𝑖 and 𝑐𝑖 can be described as a sum of nu-
where MDHGD(𝑥, 𝑦) is a multidimensional hypergeometric dis-
merous independent RVs, where 𝑣𝑖 and 𝑐𝑖 share some co-variance.
tribution with 𝑥 draws from the “bins” defined by the vector 𝑦,
Hence, log(𝑣𝑖 · 𝑐𝑖−1 ) is approximately normally distributed. Finally,
𝑒®𝑓 is the 𝑓 ’th unit vector and 1ℓ0 is a vector of 1s of length ℓ0 . By
since the the log-normal distribution of 𝑣𝑖 · 𝑐𝑖−1 can be shown to
making use of the known moments of the MDHG distribution in
Equation 24, we can derive converge to a normal distribution in the limit ℓ0𝑤 −1 → ∞, in the
  sense that the difference of the CDFs converges to zero, we conclude
(𝑐 − 1) 1 − the analysis by stating that in the limit ℓ0𝑤 −1 → ∞:
1 ℓ 𝑎
 
𝑣𝑖 𝑓 𝑎𝑓
| (𝑐𝑖 = 𝑐) = E[𝑣𝑖 | (𝑐𝑖 = 𝑐)] = +
𝑑 · 𝜎𝑈2 (ℓ0 + 𝑤 − 1) 2
E
𝑐 (ℓ0 − 1)
!
𝑐𝑖 𝑐 𝑐 𝑣𝑖 
2
 1
∼ N 𝜇𝑈 , 𝜎𝑈 =⇒ 𝑎ˆ 𝑓 𝐶𝐶𝐴
∼ N 𝑑 · 𝑎𝑓 ,
1 (𝑐 − 1) (ℓ0 − 𝑐)𝑚 (𝑤 − 1) 2
 
𝑣𝑖 𝑐𝑖 𝑑
V | (𝑐𝑖 = 𝑐) = 2 V[𝑣𝑖 | (𝑐𝑖 = 𝑐)] = 2 (25)
𝑐 (ℓ0 − 1) 2 (ℓ0 − 2)
𝜎𝑈2 (ℓ0 + 𝑤 − 1) 2
𝑐𝑖 𝑐 !
2
= N 𝑎𝑓 , (30)
with 𝑚 = (ℓ0 − 1) ℓ2 − 𝑎 2𝑓 − ℓ1 − 𝑎 𝑓
  
𝑑 (𝑤 − 1) 2

663
precisely, if such cardinality information is not yet collected in
Empirical
Accuracy
CCA CCB
CB the update procedure (with cardinality estimators such as Hyper-
CM C LogLog [15]), the query procedure must reconstruct the cardinality
Stringency of theoretical worst-case bounds counters by mapping all keys to their respective counters in all
counter arrays, effectively requiring ℓ0 · 𝑑 hash-function evalua-
Figure 2: Accuracy characterization of sketch algorithms
tions. However, this reconstruction work is independent of the
discussed in this paper. Note that the characterizations of
number of performed volume estimates and is thus amortised if
the CB- and the CCB-Sketch relate to the untrained versions;
a large number of estimates is conducted. If the volume of every
the trained versions achieve higher average accuracy.
key is estimated, the query procedure requires exactly only one
additional iteration over the data structure per estimate. Note that
With the most important characteristics of 𝑣𝑖 𝑐𝑖−1 , we again apply the CCB-Sketch shares the same complexity as the CCA-Sketch.
a concentration bound on 𝑎ˆ𝐶𝐶𝐴
𝑓
:

𝜖 2 · ℓ12 · 𝑑 (𝑤 − 1) 2
h i
! 6 EXPERIMENTAL ANALYSIS
lim P |𝑎ˆ𝐶𝐶𝐴 − 𝑎 | ≥ 𝜖ℓ1 ≤ 2 exp − In this section, we complement the theoretical analysis from Sec-
2𝜎𝑈2 (ℓ0 + 𝑤 − 1) 2
𝑓 𝑓
ℓ0 𝑤 −1 →∞
tion 5 with an experimental evaluation, which is valuable for sev-
(31)
eral reasons. First, the preceding theoretical analysis focuses on
By inserting Equation 29 into Equation 31 and taking the limit
guarantees at certain probability levels, which are not necessarily
ℓ0𝑤 −1 → ∞ of the RHS, the bound behaves asymptotically like:
informative about the average accuracy of sketches in realistic set-
h i © 𝜖 2 · ℓ12 · 𝑑 (𝑤 − 1) ª tings. Second, the theoretical analysis provides little insight into
lim P |𝑎ˆ𝐶𝐶𝐴 − 𝑎 𝑓 | ≥ 𝜖ℓ1 ≤ 2 exp ­­−   ®® the performance of the CCB-Sketch, although Section 5.2 alluded
2 ℓ2 − 𝑎 2𝑓
ℓ0 𝑤 −1 →∞ 𝑓
that the CCB-Sketch is a direct upgrade to the CB-Sketch. Third,
(32) this section demonstrates how to improve the performance of the
« ¬

which concludes the proof. □ CB-Sketch and CCB-Sketch by injecting informative priors into
the estimation process. Fourth, none of the sketches presented in
Theorem 2 has two goals. First, the theorem shows that the this paper have yet been compared to global sketches such as the
CCA-Sketch shares worst-case accuracy guarantees with the CB- PR-Sketch [30] and Seq-Sketch [20]. Both of these sketches are
Sketch, and is hence superior to the CM-Sketch (cf. Section 5.1.1). expected to provide higher accuracy compared to the other sketches
Secondly, the results of the CCA-Sketch analysis may also apply mentioned in this paper, because these sketches reconstruct not
to the CCB-Sketch, which itself has proven difficult to treat ana- single key sizes 𝑎 𝑓 at a time, but the entire vector 𝑎 at once. Hence,
lytically. This similarity is plausible from both an algorithmic and these global sketches can capture the inter-dependency of key sizes
an empirical perspective. From an algorithmic perspective, the esti- in the reconstruction. However, global-sketching techniques can
mate 𝑎ˆ𝐶𝐶𝐵 revolves around 𝑑𝑖=1 𝑣𝑖 (𝑐𝑖 − 1) −1 , while the estimate
Í
𝑓 also be expected to come at the cost of considerably increased time
of the CCA-Sketch is based on 𝑑𝑖=1 𝑣𝑖 𝑐𝑖−1 , which are very similar
Í and space requirements for key-query operations.
quantities, especially if ℓ0 is large. From an empirical perspective,
the CCA-Sketch and the CCB-Sketch share similar performance 6.1 Evaluation Set-Up
characteristics, as is evidenced by empirical results in Section 6. All experiments presented in this section have been performed in a
Hence, we conjecture that the two sketches also share similar theo- common C++ testbed which fed the data stream to the data struc-
retical worst-case performance bounds. tures and computed error and performance statistics based on the
Conjecture 3. CCA-Sketch and CCB-Sketch share the same various estimates 𝑎ˆ𝑋 returned by the different sketches. All algo-
worst-case performance bounds. rithms were implemented in C++, where the local sketches were all
implemented manually, while Seq-Sketch made use of Armadillo
Importantly, while our proposed sketches may thus provide the
[27, 28] and the compressive-sensing library Kl1p [16]. PR-Sketch
same theoretical asymptotic worst-case guarantees, their effective
was implemented with the help of Armadillo by computing the
accuracy in experiments differs considerably, with the cardinality-
pseudo-inverse of the system matrix by means of sparse matrix SVD.
based sketches generally outperforming the CB-Sketch. Hence,
Manual implementations were parallelized with std::thread and
cardinality information can improve accuracy in most scenarios,
the other algorithms made use of Armadillo’s LAPACK powered
although these improvements are not visible in the bounds (cf.
parallelized linear algebra operations.
Figure 2).
All algorithms compared in this section were given the same
Since the CCA-Sketch is equivalent to the CB-Sketch regarding
amount of information and memory. In the case of Seq-Sketch,
its lower accuracy bounds, the comparison to the C-Sketch and
this equalization implies that Seq-Sketch does not need to detect
CM-Sketch follows from Sections 5.1.1 and 5.1.2.
which keys are present in the data stream, which is the goal of
5.2.1 Time Complexity. Like the CB-Sketch, the query procedure a dedicated component in Seq-Sketch. Instead, Seq-Sketch can
of the CCA-Sketch requires time O (𝑑), which is also the asymp- focus on reconstructing the known keys as well as possible.
totic complexity of the C-Sketch and the CM-Sketch. However, In terms of evaluation metrics, we present performance and error
the CCA-Sketch introduces a constant-factor runtime cost to ob- statistics evaluated on both synthetic and real data-stream traces.
tain the cardinality information which its query relies on. More We mainly use four statistics to compare the fitness of a sketch:

664
1600 400

Nr. of Counters
Nr. of Counters

25600 1600

102400 6400

10−1100 101 102 103 101 102 103 104 105 101 103 105 107 109 100 101 102 10−1 100 101 102 101 102 103 104 101 103 105 107 100 101 102
Avg. Rel. Err. Avg. Abs. Err. Neg. r2 Inv. Pear. Corr. Avg. Rel. Err. Avg. Abs. Err. Neg. r2 Inv. Pear. Corr.
C Sketch Trained CB Sketch CCB Sketch CB Sketch CCB Sketch Seq Sketch
CM Sketch CCA Sketch Trained CCB Sketch Trained CB Sketch Trained CCB Sketch PR Sketch
CB Sketch CCA Sketch

(a) Comparison to local sketches (Trace of 100𝑘 keys). (b) Comparison to global sketches (Trace of 10𝑘 keys).
Figure 3: Comparison based on Poisson trace.

1600 400
Nr. of Counters
Nr. of Counters

25600 1600

102400 6400

100 101 102 103 102 103 104 105 101 103 105 107 100 101 100 101 102 102 103 104 101 103 105 100 101
Avg. Rel. Err. Avg. Abs. Err. Neg. r2 Inv. Pear. Corr. Avg. Rel. Err. Avg. Abs. Err. Neg. r2 Inv. Pear. Corr.
C Sketch Trained CB Sketch CCB Sketch CB Sketch CCB Sketch Seq Sketch
CM Sketch CCA Sketch Trained CCB Sketch Trained CB Sketch Trained CCB Sketch PR Sketch
CB Sketch CCA Sketch

(a) Comparison to local sketches (100𝑘 keys). (b) Comparison to global sketches (10𝑘 keys).
Figure 4: Comparison based on geometric trace.

• Average relative error: ℓ0−1 |𝑎ˆ𝑋𝑓 − 𝑎 𝑓 |𝑎 −1 the trained versions of CB-Sketch and the CCB-Sketch, we rely
Í
𝑓 ∈I 𝑓
on a training trace which is synthesized identically to the test trace,
• Average absolute error: ℓ0 𝑓 ∈ I |𝑎ˆ 𝑓 − 𝑎 𝑓 |
𝑋 −1 Í
2 2
both regarding length and statistics. Given this training trace, we
• Negative 𝒓 : 1−𝑟 , where 𝑟 is the coefficient of determination
2
use a simple log-scale grid search in order to find the best prior
• Inverted Pearson correlation: Inverted sample Pearson cor- parameters 𝜒𝑝𝐶𝐶𝐵 and 𝜒𝑝𝐶𝐵 , where the optimization goal was the
relation between 𝑎ˆ𝑋 and 𝑎
average absolute error. We set the prior means 𝜇𝐶𝐶𝐵 and 𝜇𝐶𝐵
𝑝 to
Furthermore, we show data for experiments under different memory 𝑝
be the empirical mean on the training trace. Details on the cost of
constraints. To that end, the Nr. of Counters denotes the total number
training are elaborated in Section 6.5.
of counters used for the data structure, i.e., 𝑤 · 𝑑.
Relevant for experiments relating to runtime and memory con- 6.2.1 Observations. Figure 3a shows that the CB-Sketch is con-
sumption of the various algorithms in Section 6.8, the machine on siderably superior to the CM-Sketch, which can be expected from
which the experiments were run is a 2 + 8 core Apple Silicon M1 the theoretical analysis in Section 5.1.1. More surprisingly, the
Pro with 16 GB of memory. CB-Sketch also narrowly outperforms C-Sketch, although the
theoretical guarantees of the CB-Sketch are slightly less strong
6.2 Poisson Trace than those of the C-Sketch (cf. Section 5.1.2). Furthermore, the
In the first set of experiments, we compare the accuracy of our trained and untrained sketches clearly differ in the estimation errors
sketches to previous proposals on the basis of synthetic Poisson for both the CB-Sketch and the CCB-Sketch, where the trained
traces, i.e., traces in which the key sizes in the data stream were versions clearly outperform the untrained versions. This error-
sampled i.i.d. from a distribution 100 + Poi(100). reducing effect of training aligns with the expectations based on
For the untrained versions of the CB-Sketch and the CCB- the favorable distribution of the trace. We note that training the
Sketch, the priors are uninformed, i.e., 𝜒𝑝𝐶𝐶𝐵 = 𝜒𝑝𝐶𝐵 = ∞. For priors with respect to the average relative error also improves the

665
average absolute error and the 𝑟 2 score while leaving the correlation PR-Sketch in terms of average relative and absolute error. This sur-
unaffected. In fact, the correlation and ordering of 𝑎ˆ𝑋𝑓 is guaranteed prising result shows that the injection of a prior can vastly improve
to be invariant to the prior in all cases. the accuracy of the sketch, in some cases even outperforming global
With regards to the global sketches, we observe from Figure 3b sketches.
that the PR-Sketch dominates both trained and untrained versions Comparing the proposed sketches among each other, we find that
of the proposed sketches, although only by a minuscule margin for the CB-Sketch, CCA-Sketch and CCB-Sketch have very similar
certain memory constraints. This high accuracy of the PR-Sketch performances in this experiment, where the latter has marginally
is partially to be expected, especially for lax memory constraints better performance compared to the former two, again supporting
since the PR-Sketch approaches the interpolation regime, where Conjecture 3
estimates inevitably lie within machine precision. Surprisingly, the
Seq-Sketch is inferior to our proposed sketches in terms of errors 6.5 Training and Prior Tuning
and correlation. Finally, we observe that the CCB-Sketch and the From the previous sections, the value of learning suitable priors for
CCA-Sketch are qualitatively indistinguishable regarding their the Bayesian sketches becomes apparent. To illustrate the trade-
accuracy, which supports Conjecture 3. offs in tuning the prior parameters for the trainable sketches, we
present Figure 6, which shows data collected on a Poisson trace.
6.3 Geometric Trace In terms of all error statistics, we observe the following behaviors:
To explore the case in which the priors of our Bayesian sketches (i.e., The CCB-Sketch yields strictly better performance compared to
normal distributions) poorly match the distribution of aggregated the CB-Sketch, no matter which common prior is chosen. Sec-
key-volumes, we perform the experiments in the same way as in the ondly, we note that the CCB-Sketch attains its optimum at higher
preceding section, except that the Poisson distribution is substituted values of 𝜒𝑝𝑋 compared to the CB-Sketch. This higher reliance of
by a Geometric distribution and the prior training is conducted with the CB-Sketch on narrow localization of the prior (i.e., low prior
the negative 𝑟 2 score as optimization goal. variance 𝜒𝑝𝑋 ) is expected as the CCB-Sketch incorporates more
information into its estimates compared to the CB-Sketch. There-
6.3.1 Observations. Figure 4a suggests that the proposed sketches fore, the CCB-Sketch is in general less reliant on prior knowledge.
dominate both the C-Sketch and CM-Sketch even though the Thirdly, in all settings, the error function has a unique minimum,
prior distribution significantly differs from the actual distribution which is neither 𝜒𝑝𝑋 = 0 nor 𝜒𝑝𝑋 = ∞. Hence, 𝜒𝑝𝑋 can always be
of key sizes. The likely reason for such high accuracy is that the chosen such that the resulting estimates are better than both pre-
geometric distribution has a short tail and hence normal approxima- dicting the mean and predicting with uniform priors. The highest
tions still work out sufficiently well. As is displayed in Figure 4b, the accuracy is thus achieved through an appropriate combination of
PR-Sketch dominates all proposed algorithms in this experiment, prior knowledge and data stream information. In particular, when
whereas the Seq-Sketch is again consistently inferior. Also, we the prior mean is close to the average key size in the data stream,
once again observe qualitatively identical performance statistics for both the CCB-Sketch and the CB-Sketch allow a selection of 𝜒𝑝𝑋
the CCB- and CCA-Sketch, yielding additional empirical evidence
such that the 𝑟 2 score is higher than 0, i.e., the estimation is better
for Conjecture 3.
than simply taking the average key size as estimate. Lastly, we also
see that 𝜇𝑝𝑋 = 0 can be used effectively for error reduction, although
6.4 CAIDA Trace
Figure 6 suggests only slim gains for the CCB-Sketch.
To complement the synthetic-trace experiments in the preceding In all experiments presented in this paper, the training method
sections, we repeat the experiments with a CAIDA trace [9], which was a simple exponential grid search with 200 steps that minimized
involves network-packet data from a commercial backbone link a given error statistic averaged over all training keys in the stream.
and contains roughly 1 million unique flows (i.e., keys). For the sake Hence, the time required for training was roughly 200 times the
of keeping the scale of experiments manageable, we use a random time required for evaluating the sketch on all training keys in the
subset of 100k and 10k keys to compare the proposed algorithms to data stream. This training cost thus depends on how dense the
local and global sketches, respectively. The priors are determined grid search is, and how costly the error-statistic computation is.
with a training trace which was sampled randomly from the whole Moreover, we note that the training of our sketches is only rarely
CAIDA trace, and with average relative error as optimization goal. performed (potentially only once for certain use-cases), and the cost
6.4.1 Observations. Figure 5a shows that the proposed methods of training can therefore be amortized over many sketch evaluations.
dominate the CM-Sketch for all choices of data-structure size, Additionally, training is temporally and spatially detached from
which is in line with the results of the theoretical analysis and testing and can thus be scheduled at advantageous times on less
the synthetic-trace experiments. In turn, our proposed sketches are critical hardware.
dominated by the C-Sketch if uninformed priors are used. However,
the trained methods dominate all local sketches in terms of average 6.6 Noisy Cardinality Information
relative and average absolute error by a considerable margin. In practice, the full set of keys I (and thus the cardinality counters)
Regarding global sketches, shown in Figure 5b, we again observe cannot always be collected without error. Therefore, the effect of
that the PR-Sketch outperforms the untrained sketches, whereas faulty information on the accuracy of the CCB-Sketch deserves
the Seq-Sketch is inferior to them. In contrast to previous ex- further attention. To quantify this effect, we evaluate the accuracy
periments, however, the trained methods still manage to beat the of the CCB-Sketch on a Poisson trace for different noise levels,

666
1600 400
Nr. of Counters

Nr. of Counters
25600 1600

102400 6400

101 102 103 104 104 105 106 10−1100 101 102 103 104 100 102 103 104 105 100 101 102 103 100 101
Avg. Rel. Err. Avg. Abs. Err. Neg. r2 Inv. Pear. Corr. Avg. Rel. Err. Avg. Abs. Err. Neg. r2 Inv. Pear. Corr.
C Sketch Trained CB Sketch CCB Sketch CB Sketch CCB Sketch Seq Sketch
CM Sketch CCA Sketch Trained CCB Sketch Trained CB Sketch Trained CCB Sketch PR Sketch
CB Sketch CCA Sketch

(a) Comparison to local sketches (Subsample of 100k keys). (b) Comparison to global sketches (Subsample of 10k keys).
Figure 5: Comparison based on CAIDA trace.

Avg. Rel. Err. of Un-Trained CCB Sketch


Avg. Rel. Err. of Trained CCB Sketch
10−1 2.5
0.039

10−3 2.0
0.038
χX

1.5
p

10−5

0.037
10−7
1.0

0.036 0.5
10−9

10−1 100 101 102 101 103 0.0


Avg. Rel. Err. Avg. Abs. Err. Neg. r2 0.0 0.2 0.4 0.0 0.2 0.4
Noise Level Noise Level
CCB Skch. Perf. CB Skch. Perf. CCB Skch. Zero CB Skch. Zero

Figure 7: The curves plotted are the minimum, 25th, 50th


Figure 6: Statistics of trained sketches as a function of tuning and 75th percentile and maximum of average relative errors
parameter 𝜒𝑝𝑋 . “Zero” refers to the setting 𝜇𝑝𝑋 = 0 and “Perf.” recorded across 50 simulations for different noise levels, i.e.,
the fraction of keys missing during the computation of the
refers to the case where 𝜇𝑝𝑋 is set to the training trace empir-
cardinality information. The y-axis in the left plot is trun-
ical mean. The triangles indicate the minima of the plotted
cated.
curves.

UNIV1 Pt2
i.e., random fractions of keys that are unknown during query exe-
cution. Figure 7 shows how the performance of the CCB-Sketch UNIV1 Pt5
degrades with increasing noise level. However, we also observe
Dataset

that the trained version suffers a lot less from increasing noise UNIV1 Pt20

levels because it can compensate for the poor data stream infor-
KOSARAK
mation by choosing an adequate prior (Note that the training was
also conducted on a training trace equally noisy as the test trace). RETAIL
Given high noise levels, the prior becomes dominant in the trained
CCB sketch, always making the sketch predict more or less the 101 102 103 104 101102103104105106107 10−1 101 103 100 101
mean 𝜇𝐶𝐶𝐵𝑝 . However, modern key-tracking mechanisms achieve Avg. Rel. Err. Avg. Abs. Err. Neg. r2 Inv. Pear. Corr.

noise levels of lower than 5% [30], for which the CCB-Sketch still C Sketch
CM Sketch
Trained CB Sketch
Trained CCB Sketch
Seq Sketch
PR Sketch
achieves high accuracy.
Figure 8: System simulations on various real datasets
6.7 Stress-Testing Simulations
To further evaluate the CB- and CCB-Sketches in realistic set-
tings, Figure 8 displays results of simulations performed in an sketches. Second, to account for concept drift, we introduce tem-
environment with three challenging properties. First, we use the poral distance between training and testing: For KOSARAK and
UNIV1 [5, 7], KOSARAK [6], and RETAIL [8] datasets, which in RETAIL, training and testing is done on the first and last 10% of
parts heavily conflict with the normal assumptions of the Bayesian the datasets, respectively; and for UNIV1, the first part of the trace

667
is used for training, while testing is performed on 3 increasingly 1012 104
distant parts (2, 5 and 20). Third, realistic noise in cardinality in-
formation is induced by using a Bloom filter as key tracker (as in 1011

Peak Mem. Usage [MB]


Seq-Sketch). 103

Query Time [ns]


1010
For UNIV1, RETAIL, and KOSARAK, the sample kurtosis of the
total key volumes well exceeds 10k in all cases and visual tests 109
suggest that the variances are not necessarily well defined, which 102
108
makes learning priors difficult. While the largest single key volume
in parts 1 and 2 of UNIV1 makes up about 5% of the total volume, 107
101
in parts 5 and 20 the ratio of maximal key volume to total trace
106
volume increases to 20% and 85% percent, respectively, which breaks 102 103 102 103
asymptotic approximations in the proposed sketches. Nr. of Counters Nr. of Counters
Count Sketch CCB Sketch Seq Sketch
6.7.1 Observations. Despite unfavorable trace characteristics, es- CountMin Sketch CCA Sketch PR Sketch
CB Sketch
pecially in KOSARAK, RETAIL, and UNIV Pt2, the learned priors
give the CB- and CCB-Sketch a clear advantage over other local
sketches, despite considerable temporal distance between training Figure 9: Query time [ns] and peak memory usage [MB] of a
and testing. We also observe that the proposed sketches struggle query operation on all keys in the data stream.
with parts 5 and 20 of UNIV1. This is due to two factors: an in-
creasing discrepancy between the training and testing statistics,
and an increasingly extreme heaviness of the tail. It is worth noting sketches. The PR-Sketch is the most expensive evaluated sketch, as
that PR-Sketch also appears to suffer considerably from increasing its peak memory consumption and execution time increases rapidly
heaviness of the tail, which would suggest the relevance of that due to the SVD computation, despite being optimized for sparse
distribution property. The performance difference between CB- and matrices. Similarly, the Seq-Sketch suffers from rapidly increasing
CCB-Sketch is minuscule and slightly favors the former over the peak memory consumption and time complexity, although memory
latter. Noisy cardinality information plays a role, but the bigger consumption is asymptotically lower than that of the PR-Sketch.
factor, considering simulations from previous sections, seems to For real-world applications involving a large number of keys,
be the heavy tail of the key-volume distribution, which leads to querying the PR-Sketch would likely require several hundreds of
misinterpretation of measurements under the normal assumption GB and up to an hour of computation time on a high-end main-
of the sketches. This effect is stronger for the CCB-Sketch, which frame, whereas the CCB-Sketch can be queried in near-real time
is more receptive, and hence more exposed, to measurement data. on relatively modest hardware.

7 CONCLUSION
6.8 Time and Space Complexity
In this paper, we present the derivation, analysis and evaluation
The goal of this work is to devise sketches that combine the accu- of three novel stream-processing algorithms. Both our theoretical
racy of global sketches with the query efficiency of local sketches. analysis (§5) and experimental results (§6) show that these sketches
While the preceding sections demonstrate that the trained Bayesian combine the strengths of lightweight but not very accurate sketches
sketches are competitive with local sketches, we focus on the query (e.g., C-Sketch) and more accurate but significantly costlier meth-
efficiency of our sketches in this section. ods (e.g., PR-Sketch). Hence, our sketches enable high accuracy
To demonstrate the differences in space and time complexity at low query cost: Typically, the proposed CCB-Sketch is orders
of the various algorithms, Figure 9 shows data collected from an of magnitude more accurate compared to the C-Sketch while sig-
experiment with 5𝑘 keys in the data stream and compares the time nificantly cheaper than the PR-Sketch. This is also the case when
and peak memory necessary to query every 𝑎ˆ𝑋𝑓 for 𝑓 ∈ I. All modeled prior and actual data stream statistics belong to different
local sketches have approximately the same peak memory usage parametric families.
since they all only require information about one flow 𝑓 in order This reconciliation of estimation accuracy and query efficiency
to estimate 𝑎 𝑓 . The computation of 𝑎ˆ𝑋𝑓 does not incur substantial decisively propels real-world applications that rely on stream pro-
additional memory allocation in the case of local sketches since cessing. For example, emerging QoS systems based on bandwidth
the computation only amounts to a form of weighted aggregation. reservation [17] must quickly identify flows that overuse their reser-
Among the local sketches, only the C-Sketch stands out, as the vation, and require the highly efficient and highly accurate flow-size
computation of a median is more expensive than a simple cumula- estimation which only the sketches in this paper can guarantee.
tive aggregation over the array. Also note that the data structure Moreover, we emphasize that our paper is only an initial explo-
itself consumes negligible memory compared to memory required ration of the research opportunities that are opened up by com-
for query evaluation; hence, the CCA-Sketch and the CCB-Sketch bining Bayesian techniques with sketching algorithms. In future
are not noticeably more expensive than other local sketches, despite work, it will be of interest to investigate the value of different priors
additionally keeping a cardinality table. (e.g., exponential distributions), other variational-inference tech-
More importantly, we see that the global sketches are much niques (e.g., MCMC), and hybrid approaches based on constrained
more expensive regarding memory and runtime costs to all local optimization (as in PR-Sketch) and Bayesian techniques.

668
REFERENCES [20] Qun Huang, Siyuan Sheng, Xiang Chen, Yungang Bao, Rui Zhang, Yanwei Xu, and
[1] Rakesh Agrawal, Ramakrishnan Srikant, et al. 1994. Fast algorithms for mining Gong Zhang. 2021. Toward nearly-zero-error sketching via compressive sensing.
association rules. In Proc. 20th int. conf. very large data bases, VLDB, Vol. 1215. In 18th USENIX Symposium on Networked Systems Design and Implementation
Citeseer, 487–499. (NSDI 21). 1027–1044.
[2] Pablo Basanta-Val, Norberto Fernandez-Garcia, Luis Sánchez-Fernández, and [21] Daniel M. Kane, Jelani Nelson, Ely Porat, and David P. Woodruff. 2010. Fast
Jesus Arias-Fisteus. 2017. Patterns for distributed real-time stream processing. Moment Estimation in Data Streams in Optimal Space. CoRR abs/1007.4191
IEEE Transactions on Parallel and Distributed Systems 28, 11 (2017), 3243–3257. (2010). arXiv:1007.4191 http://arxiv.org/abs/1007.4191
[3] Ran Ben Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. 201720. Ran- [22] D. Katz, J. Baptista, S. P. Azen, and M. C. Pike. 1978. Obtaining Confidence
domized admission policy for efficient top-k and frequency estimation. In IEEE Intervals for the Risk Ratio in Cohort Studies. Biometrics 34, 3 (1978), 469–474.
INFOCOM 2017-IEEE Conference on Computer Communications. IEEE, 1–9. http://www.jstor.org/stable/2530610
[4] Michael Bender and Slobodan Simonovic. 1994. Time-series modeling for long- [23] Jon Kleinberg. 2003. Bursty and hierarchical structure in streams. Data mining
range stream-flow forecasting. Journal of Water Resources Planning and Manage- and knowledge discovery 7, 4 (2003), 373–397.
ment 120, 6 (1994), 857–870. [24] George Kollios, John W Byers, Jeffrey Considine, Marios Hadjieleftheriou, and
[5] Theophilus Benson, Aditya Akella, and David Maltz. 2010. Network Traffic Feifei Li. 2005. Robust Aggregation in Sensor Networks. IEEE Data Eng. Bull. 28,
Characteristics of Data Centers in the Wild. Proceedings of the ACM SIGCOMM 1 (2005), 26–32.
Internet Measurement Conference, IMC, 267–280. https://doi.org/10.1145/1879141. [25] Zaoxing Liu, Ran Ben-Basat, Gil Einziger, Yaron Kassner, Vladimir Braverman,
1879175 Roy Friedman, and Vyas Sekar. 2019. Nitrosketch: Robust and general sketch-
[6] Ferenc Bodon. 2003. KOSARAK dataset. http://fimi.uantwerpen.be/data/kosarak. based monitoring in software switches. In Proceedings of the ACM Special Interest
dat.gz. last accessed 05.08.2022. Group on Data Communication. 334–350.
[7] Ferenc Bodon. 2010. UNIV1 dataset. https://pages.cs.wisc.edu/~tbenson/IMC10_ [26] Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman. 2016. One Sketch
Data.html. last accessed 05.08.2022. to Rule Them All: Rethinking Network Flow Monitoring with UnivMon. In ACM
[8] Tom Brijs. 2003. RETAIL dataset. http://fimi.uantwerpen.be/data/retail.dat.gz. SIGCOMM. https://doi.org/10.1145/2934872.2934906
[27] Conrad Sanderson and Ryan Curtin. 2016. Armadillo: A template-based C++
last accessed 05.08.2022.
library for linear algebra. Journal of Open Source Software 1 (07 2016), 26. https:
[9] CAIDA. 2018. The CAIDA UCSD Anonymized Internet Traces - Oct. 18th.
//doi.org/10.21105/joss.00026
http://www.caida.org/data/passive/passive_dataset.xml
[28] Conrad Sanderson and Ryan Curtin. 2020. An Adaptive Solver for Systems of
[10] Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent
Linear Equations. In 2020 14th International Conference on Signal Processing and
items in data streams. In International Colloquium on Automata, Languages, and
Communication Systems (ICSPCS). IEEE. https://doi.org/10.1109/icspcs50536.
Programming.
2020.9309998
[11] Lior Cohen, Gil Avrahami-Bakish, Mark Last, Abraham Kandel, and Oscar Kiper-
[29] Simon Scherrer, Che-Yu Wu, Yu-Hsi Chiang, Benjamin Rothenberger, Daniele E.
sztok. 2008. Real-time data mining of non-stationary data streams from sensor
Asoni, Arish Sateesan, Jo Vliegen, Nele Mentens, Hsu-Chun Hsiao, and Adrian
networks. Information Fusion 9, 3 (2008), 344–353.
Perrig. 2021. Low-Rate Overuse Flow Tracer (LOFT): An Efficient and Scalable
[12] Graham Cormode and S. Muthukrishnan. 2005. An Improved Data Stream
Algorithm for Detecting Overuse Flows. arXiv:2102.01397 [cs.NI]
Summary: The Count-Min Sketch and Its Applications. J. Algorithms 55, 1 (apr
[30] Siyuan Sheng, Qun Huang, Sa Wang, and Yungang Bao. 2021. PR-Sketch: Moni-
2005), 58–75. https://doi.org/10.1016/j.jalgor.2003.12.001
toring per-Key Aggregation of Streaming Data with Nearly Full Accuracy. Proc.
[13] Ryohei Ebina, Kenji Nakamura, and Shigeru Oyanagi. 2011. A real-time burst de-
VLDB Endow. 14, 10 (jun 2021), 1783–1796. https://doi.org/10.14778/3467861.
tection method. In 2011 IEEE 23rd International Conference on Tools with Artificial
3467868
Intelligence. IEEE, 1040–1046.
[31] Haibo Wang, Chaoyi Ma, Olufemi O Odegbile, Shigang Chen, and Jih-Kwon
[14] Cristian Estan and George Varghese. 2003. New Directions in Traffic Measure-
Peir. 2021. Randomized Error Removal for Online Spread Estimation in Data
ment and Accounting: Focusing on the Elephants, Ignoring the Mice. ACM
Streaming. Proc. VLDB Endow. 14, 6 (feb 2021), 1040–1052. https://doi.org/10.
Transactions on Computer Systems 21, 3 (2003), 270–313.
14778/3447689.3447707
[15] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hy- [32] Chen-Chi Wu, Kuan-Ta Chen, Chun-Ying Huang, and Chin-Laung Lei. 2009.
perloglog: the analysis of a near-optimal cardinality estimation algorithm. In An empirical evaluation of VoIP playout buffer dimensioning in Skype, Google
Discrete Mathematics and Theoretical Computer Science. Discrete Mathematics talk, and MSN Messenger. In Proceedings of the 18th international workshop on
and Theoretical Computer Science, 137–156. Network and operating systems support for digital audio and video. 97–102.
[16] René Gebel. 2012. KL1p - a portable C++ library for compressed sensing. [33] Hao Wu, Hsu-Chun Hsiao, and Yih-Chun Hu. 2014. Efficient Large Flow De-
[17] Giacomo Giuliari, Dominik Roos, Marc Wyss, Juan A. Garcia-Pardo, Markus tection over Arbitrary Windows: An Algorithm Exact Outside an Ambiguity
Legner, and Adrian Perrig. 2021. Colibri: A Cooperative Lightweight Inter- Region. In Proceedings of the 2014 Conference on Internet Measurement Conference
domain Bandwidth-Reservation Infrastructure. Proceedings of ACM CoNEXT (IMC). ACM, 209–222.
(2021). [34] Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao,
[18] Charles AR Hoare. 1961. Algorithm 65: find. Commun. ACM 4, 7 (1961), 321–322. Xiaoming Li, and Steve Uhlig. 2018. Elastic sketch: Adaptive and fast network-
[19] Thomas Holterbach, Edgar Costa Molero, Maria Apostolaki, Alberto Dainotti, wide measurements. In Proceedings of the 2018 Conference of the ACM Special
Stefano Vissicchio, and Laurent Vanbever. 2019. Blink: Fast connectivity recovery Interest Group on Data Communication. ACM, 561–575.
entirely in the data plane. In 16th USENIX Symposium on Networked Systems [35] Zheng Zhong, Shen Yan, Zikun Li, Decheng Tan, Tong Yang, and Bin Cui. 2021.
Design and Implementation (NSDI 19). 161–176. BurstSketch: Finding Bursts in Data Streams. In Proceedings of the 2021 Interna-
tional Conference on Management of Data. 2375–2383.

669

You might also like