research-article

Open access

Model Counting Meets F₀ Estimation

Authors:

A. Pavan^®,

N. V. Vinodchandran^®,

Arnab Bhattacharyya^®,

Kuldeep S. MeelAuthors Info & Claims

ACM Transactions on Database Systems, Volume 48, Issue 3

Article No.: 7, Pages 1 - 28

https://doi.org/10.1145/3603496

Published: 09 August 2023 Publication History

PDF eReader

Abstract

Constraint satisfaction problems (CSPs) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP’s and computation of zeroth frequency moments (F₀) for data streams.

Our investigations lead us to observe a striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F₀ computation. We design a recipe for translating algorithms developed for F₀ estimation to model counting, resulting in new algorithms for model counting. We also provide a recipe for transforming sampling algorithm over streams to constraint sampling algorithms. We then observe that algorithms in the context of distributed streaming can be transformed into distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F₀ estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields an algorithm for multidimensional range efficient F₀ estimation with a simpler analysis.

1 Introduction

Constraint Satisfaction Problems (CSPs) and the data stream model are two core themes in computer science with a diverse set of applications, ranging from probabilistic reasoning, networks, databases, verification, and the like. Model counting and computation of zeroth frequency moment ( $F_0$ ) are fundamental problems for CSPs and the data stream model, respectively. This article is motivated by our observation that despite the usage of similar algorithmic techniques for the two problems, the developments in the two communities have, surprisingly, evolved separately, and rarely has an article from one community been cited by the other.

Given a set of constraints $\varphi$ over a set of variables in a finite domain $\mathcal {D}$ , the problem of model counting is to estimate the number of solutions of $\varphi$ . We are often interested when $\varphi$ is restricted to a special class of representations such as Conjunctive Normal Form (CNF) and Disjunctive Normal Form (DNF). A data stream over a domain $[N]$ is represented by $\mathbf {a} = a_1, a_2, \ldots , a_m$ wherein each item $a_i \subseteq [N]$ . The zeroth frequency moment, denoted as $F_0$ , of $\mathbf {a}$ is the number of distinct elements appearing in $\mathbf {a}$ , i.e., $|\cup _{i} a_i|$ (traditionally, $a_i$ s are singletons; we will also be interested in the case when $a_i$ s are sets). The fundamental nature of model counting and $F_0$ computation over data streams has led to intense interest from theoreticians and practitioners alike in the respective communities for the past few decades.

The starting point of this work is the confluence of two viewpoints. The first viewpoint contends that some of the algorithms for model counting can conceptually be thought of as operating on the stream of the solutions of the constraints. The second viewpoint contends that a stream can be viewed as a DNF formula, and the problem of $F_0$ estimation is similar to model counting. These viewpoints make it natural to believe that algorithms developed in the streaming setting can be directly applied to model counting, and vice versa. We explore this connection and indeed, design new algorithms for model counting inspired by algorithms for estimating $F_0$ in data streams. By exploring this connection further, we design new algorithms to estimate $F_0$ for streaming sets that are succinctly represented by constraints. To put our contributions in context, we briefly survey the historical development of algorithmic frameworks in both model counting and $F_0$ estimation and point out the similarities.

Model Counting

The complexity-theoretic study of model counting was initiated by Valiant who showed that this problem, in general, is #P-complete [66]. This motivated researchers to investigate approximate model counting and in particular achieving $(\varepsilon ,\delta)$ -approximation schemes. The complexity of approximate model counting depends on its representation. When the model $\varphi$ is represented as a CNF formula $\varphi$ , designing an efficient $(\varepsilon ,\delta)$ -approximation is NP-hard [62]. In contrast, when it is represented as a DNF formula, model counting admits an FPRAS (fully polynomial-time randomized approximation scheme) [43, 44]. We will use #CNF to refer to the case when $\varphi$ is a CNF formula while #DNF to refer to the case when $\varphi$ is a DNF formula.

For #CNF, Stockmeyer [62] provided a hashing-based randomized procedure that can compute ( $\varepsilon ,\delta)$ -approximation within time polynomial in $|\varphi |, \varepsilon ,\delta$ , given access to an NP oracle. Building on Stockmeyer’s approach and motivated by the unprecedented breakthroughs in the design of SAT solvers, researchers have proposed a series of algorithmic improvements that have allowed the hashing-based techniques for approximate model counting to scale to formulas involving hundreds of thousands of variables [2, 15, 16, 18, 26, 35, 39, 59, 60]. The practical implementations substitute NP oracle with SAT solvers. In the context of model counting, we are primarily interested in time complexity and therefore, the number of NP queries is of key importance. The emphasis on the number of NP calls also stems from practice as the practical implementation of model counting algorithms have shown to spend over 99% of their time in the underlying SAT calls [60].

Karp and Luby [43] proposed the first FPRAS scheme for #DNF, which was subsequently improved in the follow-up works [25, 44]. Chakraborty, Meel, and Vardi [16] demonstrated that the hashing-based framework can be extended to #DNF, hereby providing a unified framework for both #CNF and #DNF. Meel, Shrotri, and Vardi [49, 50, 51] subsequently improved the complexity of the hashing-based approach for #DNF and observed that hashing-based techniques achieve better scalability than that of Monte Carlo techniques.

Zeroth Frequency Moment Estimation

Estimating $(\varepsilon ,\delta)$ -approximation of the $k^{\rm th}$ frequency moments ( $F_k$ ) is a central problem in the data streaming model [3]. In particular, considerable work has been done in designing algorithms for estimating the 0^th frequency moment ( $F_0$ ), the number of distinct elements in the stream. While designing streaming algorithms, the primary resource concerns are two-fold: space complexity and processing time per element. For an algorithm to be considered efficient, these should be ${\rm poly}(\log N,1/\epsilon)$ where N is the size of the universe.¹

The first algorithm for computing $F_0$ with a constant factor approximation was proposed by Flajolet and Martin, who assumed the existence of hash functions with ideal properties resulting in an algorithm with undesirable space complexity [32]. In their seminal work, Alon, Matias, and Szegedy designed an $O(\log N)$ space algorithm for $F_0$ with a constant approximation ratio that employs 2-universal hash functions [3]. Subsequent investigations into hashing-based schemes by Gibbons and Tirthapura [34] and Bar–Yossef, Kumar, and Sivakumar [8] provided $(\varepsilon , \delta)$ -approximation algorithms with space and time complexity $\log N \cdot {\rm poly} ({1\over \varepsilon })$ . Subsequently, Bar-Yossef et al. proposed three algorithms with improved space and time complexity [7]. While the three algorithms employ hash functions, they differ conceptually in the usage of relevant random variables for the estimation of $F_0$ . This line of work resulted in the development of an algorithm with optimal space complexity ${O}(\log N + {1\over \varepsilon ^2})$ and $O(\log N)$ update time [42].

The above-mentioned works are in the setting where each data item $a_i$ is an element of the universe. Subsequently, there has been a series of results of estimating $F_0$ in rich scenarios with particular focus to handle the cases $a_i \subseteq [N]$ such as a list or a multidimensional range [8, 53, 63, 65].

The Road to a Unifying Framework

As mentioned above, the algorithmic developments for model counting and $F_0$ estimation have largely relied on the usage of hashing-based techniques and yet these developments have, surprisingly, been separate, and rarely has a work from one community been cited by the other. In this context, we wonder whether it is possible to bridge this gap and if such an exercise would contribute to new algorithms for model counting as well as for $F_0$ estimation? The main conceptual contribution of this work is an affirmative answer to the above question. First, we point out that the two well-known algorithms; Stockmeyer’s #CNF algorithm [62] that is further refined by Chakraborty et al. [16] and Gibbons and Tirthapura’s $F_0$ estimation algorithm [34], are essentially the same.

The core idea of the hashing-based technique of Stockmeyer’s and Chakraborty et al’s scheme is to use pairwise independent hash functions to partition the solution space (satisfying assignments of a CNF formula) into roughly equal and small cells, wherein a cell is small if the number of solutions is less than a pre-computed threshold, denoted by $\mathsf {Thresh}$ . Then a good estimate for the number of solutions is the number of solutions in an arbitrary cell $\times$ number of cells. To partition the solution space, pairwise independent hash functions are used. To determine the appropriate number of cells, the solution space is iteratively partitioned as follows. At the $m^{th}$ iteration, a hash function with range $\lbrace 0,1\rbrace ^m$ is considered resulting in cells $h^{-1}(y)$ for each $y\in \lbrace 0,1\rbrace ^m$ . An NP oracle can be employed to check whether a particular cell (for example $h^{-1}(0^m)$ ) is small by enumerating solutions one by one until we have either obtained $\mathsf {Thresh}$ +1 number of solutions or we have exhaustively enumerated all the solutions. If the cell $h^{-1}(0^m)$ is small, then the algorithm outputs $t\times 2^m$ as an estimate where t is the number of solutions in the cell $h^{-1}(0^m)$ . If the cell $h^{-1}(0^m)$ is not small, then the algorithm moves on to the next iteration where a hash function with range $\lbrace 0,1\rbrace ^{m+1}$ is considered.

We now describe Gibbons and Tirthapura’s algorithm for $F_0$ estimation which we call the $\mathsf {Bucketing}$ algorithm. We will assume the universe $[N] = \lbrace 0,1\rbrace ^n$ . The algorithm maintains a bucket of size $\mathsf {Thresh}$ and starts by picking a hash function $h:\lbrace 0,1\rbrace ^n \rightarrow \lbrace 0,1\rbrace ^n$ . It iterates over sampling levels. At level m, when a data item x comes, if $h(x)$ starts with $0^m$ , then x is added to the bucket. If the bucket overflows, then the sampling level is increased to $m+1$ and all elements x in the bucket other than the ones with $h(x)=0^{m+1}$ are deleted. At the end of the stream, the value $t\times 2^{m}$ is output as the estimate where t is the number of elements in the bucket and m is the sampling level.

These two algorithms are conceptually the same. In the $\mathsf {Bucketing}$ algorithm, at the sampling level m, it looks at only the first m bits of the hashed value; this is equivalent to considering a hash function with range $\lbrace 0,1\rbrace ^m$ . Thus the bucket is nothing but all the elements in the stream that belong to the cell $h^{-1}(0^m)$ . The final estimate is the number of elements in the bucket times the number of cells, identical to Chakraborty et. al’s algorithm. In both algorithms, to obtain an $(\varepsilon , \delta)$ approximation, the $\mathsf {Thresh}$ value is chosen as $O({1 \over \varepsilon ^2})$ and the median of $O(\log {1\over \delta })$ independent estimations is output.

Our Contributions

Motivated by the conceptual identity between the two algorithms, we further explore the connections between algorithms for model counting and $F_0$ estimation.

(1)

We formalize a recipe to transform streaming algorithms for $F_0$ estimation to those for model counting. Such a transformation yields new $(\varepsilon , \delta)$ -approximate algorithms for model counting, which are different from currently known algorithms. We also establish a relationship between the space complexity of the streaming algorithms and the query complexity of the obtained model counting algorithms. Recent studies in the fields of automated reasoning have highlighted the need for diverse approaches [69], and similar studies in the context of #DNF provided strong evidence of the power of diversity of approaches [50]. In this context, these newly obtained algorithms open up several new interesting directions of research ranging from the development of MaxSAT solvers with native XOR support to open problems in designing FPRAS schemes.

(2)

The problem of counting and sampling are closely related. In particular, the seminal work of Jerrum, Valiant, and Vazirani [40] showed that the problem of approximate counting and almost-uniform sampling are inter-reducible for self-reducible NP problems. Concurrent to developments in approximate model counting, there has been a significant interest in the design of efficient sampling algorithms. Building on the recipe to transform streaming algorithms to model counting algorithms, we obtain a recipe to transfer $L_0$ -sampling algorithms into constrained sampling algorithms.

(3)

Given the central importance of #DNF (and its weighted variant) due to a recent surge of interest in scalable techniques for provenance in probabilistic databases [56, 57], a natural question is whether one can design efficient techniques in the distributed setting. In this work, we initiate the study of distributed #DNF. We then show that the transformation recipe from $F_0$ estimation to model counting allows us to view the problem of the design of distributed #DNF algorithms through the lens of distributed functional monitoring that is well studied in the data streaming literature.

(4)

Building upon the connection between model counting and $F_0$ estimation, we design new algorithms to estimate $F_0$ over structured set streams where each element of the stream is a (succinct representation of a) subset of the universe. Thus, the stream is $S_1, S_2, \ldots$ where each $S_i \subseteq [N]$ and the goal is to estimate the $F_0$ of the stream, i.e., size of $\cup _{i} S_i$ . In this scenario, a traditional $F_0$ streaming algorithm that processes each element of the set incurs high per-item processing time-complexity and is inefficient. Thus the goal is to design algorithms whose per-item time (time to process each $S_i$ ) is poly-logarithmic in the size of the universe. Structured set streams that are considered in the literature include 1-dimensional and multidimensional ranges [53, 65]. Several interesting problems such as max-dominance norm [22], counting triangles in graphs [8], and distinct summation problem [19] can be reduced to computing $F_0$ over such ranges.

We observe that several structured sets can be represented as small DNF formulae and thus $F_0$ counting over these structured set data streams can be viewed as a special case of #DNF. Using the hashing-based techniques for #DNF, we obtain a general recipe for a rich class of structured sets that include multidimensional ranges, multidimensional arithmetic progressions, and affine spaces. Prior work on single and multidimensional ranges² had to rely on involved analysis for each of the specific instances, while our work provides a general recipe for both analysis and implementation.

Remark 1.

This work is an extension of the work that appeared in PODS 2021 [54] and differs from it in the following ways. First, we establish, in Section 3.5, a new relationship between the space complexity of streaming algorithms and the query complexity of general model counting algorithms. Second, building on the close relationship between counting and sampling, we provide a recipe for the transformation of $L_0$ sampling techniques to constrained sampling, thereby accomplishing the future direction stated in the conference version. Third, we provide detailed algorithmic descriptions for distributed DNF counting which are described in Section 5.

Organization

We present notations and preliminaries in Section 2. We then present the transformation of $F_0$ estimation to model counting in Section 3. In Section 4, we provide a recipe to transform $L_0$ sampling algorithms into constrained sampling algorithms. We then focus on distributed #DNF in Section 5. In Section 6, we present the transformation of model counting algorithms to structured set streaming algorithms. We conclude in Section 7 with a discussion of future research directions.

We would like to emphasize that the primary objective of this work is to provide a unifying framework for $F_0$ estimation and model counting. Therefore, when designing new algorithms based on the transformation recipes, we intentionally focus on conceptually cleaner algorithms and leave potential improvements in time and space complexity for future work.

2 Notation

We will assume that the universe is $[N] = \lbrace 0,1\rbrace ^n$ . We write $\Pr [\mathcal {Z}: {\Omega }]$ to denote the probability of outcome $\mathcal {Z}$ when sampling from a probability space ${\Omega }$ . For brevity, we omit ${\Omega }$ when it is clear from the context.

F₀ Estimation.

A data stream $\mathbf {a}$ over domain $[N]$ can be represented as $\mathbf {a} = a_1, a_2, \ldots a_m$ wherein each item $a_i \in [N]$ . Let $\mathbf {a}_u = \cup _{i} \lbrace a_i\rbrace$ . $F_0$ of the stream $\mathbf {a}$ is $|\mathbf {a}_u|$ . We are often interested in a probably approximately correct scheme that returns an $(\varepsilon ,\delta)$ -estimate c, i.e.,

$\begin{align*} \Pr \left[\frac{|\mathbf {a}_u|}{1+\varepsilon } \le c \le (1+\varepsilon) |\mathbf {a}_u| \right] \ge 1-\delta . \end{align*}$

Model Counting.

Let $\lbrace x_1, x_2, \ldots , x_n\rbrace$ be a set of Boolean variables. For a Boolean formula $\varphi$ , let $\mathsf {Vars}(\varphi)$ denote the set of variables appearing in $\varphi$ . Throughout the article, unless otherwise stated, we will assume that the relationship $n = |\mathsf {Vars}(\varphi)|$ holds. We denote the set of all satisfying assignments of $\varphi$ by $\mathsf {Sol}(\varphi)$ .

The propositional model counting problem is to compute $|\mathsf {Sol}(\varphi)|$ for a given formula $\varphi$ . A probably approximately correct (or PAC) counter is a probabilistic algorithm ${\mathsf {ApproxCount}}(\cdot , \cdot ,\cdot)$ that takes as inputs a formula $\varphi$ , a tolerance $\varepsilon \gt 0$ , and a confidence $\delta \in (0, 1]$ , and returns a $(\varepsilon ,\delta)$ -estimate c, i.e.,

$\begin{align*} \Pr \Big [\frac{|\mathsf {Sol}(\varphi)|}{1+\varepsilon } \le c \le (1+\varepsilon)|\mathsf {Sol}(\varphi)|\Big ] \ge 1-\delta . \end{align*}$

PAC guarantees are also sometimes referred to as $(\varepsilon ,\delta)$ -guarantees. We use #CNF (respectively, #DNF) to refer to the model counting problem when $\varphi$ is represented as CNF (respectively, DNF).

Given a formula $\varphi$ , tolerance parameter $\varepsilon \gt 0$ , confidence parameter $\delta \gt 0$ , a constrained sampler $\mathsf {UnifSampler}$ returns $\sigma \in \mathsf {Sol}(\varphi)$ such that

$\begin{align*} \forall \sigma \in \mathsf {Sol}(\varphi), \frac{(1-\varepsilon)}{|\mathsf {Sol}(\varphi)|} \le \Pr [\mathsf {UnifSampler}(\varphi ,\varepsilon ,\delta) = \sigma ] \le \frac{(1+\varepsilon)}{|\mathsf {Sol}(\varphi)|} \end{align*}$

And the algorithm $\mathsf {UnifSampler}$ succeeds with probability $1-\delta$ .

k-wise independent hash functions.

Let $n,m\in \mathbb {N}$ and $\mathcal {H}(n,m) \triangleq \lbrace h:\lbrace 0,1\rbrace ^{n} \rightarrow \lbrace 0,1\rbrace ^m \rbrace$ be a family of hash functions mapping $\lbrace 0,1\rbrace ^n$ to $\lbrace 0,1\rbrace ^m$ . We use $h \xleftarrow {R} \mathcal {H}(n,m)$ to denote the probability space obtained by choosing a function h uniformly at random from $\mathcal {H}(n,m)$ .

Definition 1.

A family of hash functions $\mathcal {H}(n,m)$ is $k-$ wise independent if $\forall \alpha _1, \alpha _2, \ldots , \alpha _k \in \lbrace 0,1\rbrace ^m$ , $\text{ distinct } x_1, x_2, \ldots , x_k \in \lbrace 0,1\rbrace ^n, h \xleftarrow {R} \mathcal {H}(n,m)$ ,

$\begin{align} \Pr [(h(x_1) = \alpha _1) \wedge (h(x_2) = \alpha _2) \ldots (h(x_k) = \alpha _k) ] = \frac{1}{2^{km}} \end{align}$

(1)

We will use $\mathcal {H}_{\mathsf {k-wise}}(n,m)$ to refer to a $k-$ wise independent family of hash functions mapping $\lbrace 0,1\rbrace ^n$ to $\lbrace 0,1\rbrace ^m$ .

Explicit Families.

In this work, one hash family of particular interest is $\mathcal {H}_{\mathsf {Toeplitz}}(n,m)$ , which is known to be 2-wise independent [12]. The family is defined as follows: $\mathcal {H}_{\mathsf {Toeplitz}}(n,m) \triangleq \lbrace h: \lbrace 0,1\rbrace ^n \rightarrow \lbrace 0,1\rbrace ^m \rbrace$ is the family of functions of the form $h(x) = Ax+b$ with A is a Toeplitz matrix in $\mathbb {F}_{2}^{m \times n}$ and $b \in \mathbb {F}_{2}^{m \times 1}$ . A matrix is Toeplitz if for every diagonal (top-left to bottom-right) its entries are the same. Another related hash family of interest is $\mathcal {H}_{\mathsf {xor}}(n,m)$ wherein $h(X)$ is again of the form $Ax+b$ where $A \in \mathbb {F}_{2}^{m \times n}$ and $b \in \mathbb {F}_{2}^{m \times 1}$ . Both $\mathcal {H}_{\mathsf {Toeplitz}}$ and $\mathcal {H}_{\mathsf {xor}}$ are 2-wise independent but it is worth noticing that $\mathcal {H}_{\mathsf {Toeplitz}}$ can be represented with $\Theta (n)$ -bits while $\mathcal {H}_{\mathsf {xor}}$ requires $\Theta (mn)$ bits of representation. We use both these families, as we use results from prior works that use both these hash families.

For every $\ell \in \lbrace 1, \ldots , n\rbrace$ , the $\ell ^{th}$ prefix-slice of h, denoted $h_{\ell }$ , is a map from $\lbrace 0,1\rbrace ^{n}$ to $\lbrace 0,1\rbrace ^\ell$ , where $h_{\ell }(y)$ is the first $\ell$ bits of $h(y)$ . Observe that when $h(x) = Ax+b$ , $h_{\ell }(x) = A_{\ell }x+b_{\ell }$ , where $A_{\ell }$ denotes the submatrix formed by the first $\ell$ rows of A and $b_{\ell }$ is the first $\ell$ entries of the vector b.

3 From F₀ Estimation to Counting

As a first step, we present a unified view of the three hashing-based algorithms proposed in Bar-Yossef et al. [7]. The first algorithm is the $\mathsf {Bucketing}$ algorithm discussed above with the observation that instead of keeping the elements in the bucket, it suffices to keep their hashed values. Since in the context of model counting, our primary concern is with time complexity, we will focus on Gibbons and Tirthapura’s $\mathsf {Bucketing}$ algorithm in [34] rather than Bar–Yossef et al.’s modification. The second algorithm, which we call $\mathsf {Minimum}$ , is based on the idea that if we hash all the items of the stream, then the $\mathcal {O}(1/\varepsilon ^2)$ -th minimum of the hash values can be used to compute a good estimate of $F_0$ . The third algorithm, which we call $\mathsf {Estimation}$ , chooses a set of k functions, $\lbrace h_1, h_2, \ldots \rbrace ,$ such that each $h_j$ is picked randomly from an $\mathcal {O}(\log (1/\varepsilon))$ -independent hash family. For each hash function $h_j$ , we say that $h_j$ is not lonely if there exists $a_i \in \mathbf {a}$ such that $h_j(a_i) = 0$ . One can then estimate $F_0$ of $\mathbf {a}$ by estimating the number of hash functions that are not lonely.

Algorithm 1, called $\mathsf {ComputeF0}$ , presents the overarching architecture of the three proposed algorithms. Each of these algorithms first picks an appropriate set of hash functions H and initializes the sketch $\mathcal {S}$ . The architecture of $\mathsf {ComputeF0}$ is fairly simple: it chooses a collection of hash functions using $\mathsf {ChooseHashFunctions}$ , calls the subroutine $\mathsf {ProcessUpdate}$ for every incoming element of the stream, and invokes $\mathsf {ComputeEst}$ at the end of the stream to return the $F_0$ approximation.

ChooseHashFunctions.

As shown in Algorithm 2, the hash functions depend on the strategy being implemented. The subroutine $\mathsf {PickHashFunctions}(\mathcal {H}, t)$ returns a collection of t independent hash functions from the family $\mathcal {H}$ . We use H to denote the collection of hash functions returned, this collection is viewed as either 1-dimensional array or as a 2-dimensional array. When H is 1-dimensional array, $H[i]$ denotes the ith hash function of the collection and when H is a 2-dimensional array $H[i][j]$ is the $[i, j]$ th hash functions.

Sketch Properties.

For each of the three algorithms, their corresponding sketches can be viewed as arrays of the size of $35\log (1/\delta)$ . The parameter $\mathsf {Thresh}$ is set to $96/\varepsilon ^2$ .

$\mathsf {Bucketing}$ The element $\mathcal {S}[i]$ is a tuple $\langle \ell _i, m_i\rangle$ where $\ell _i$ is a list of size at most $\mathsf {Thresh}$ , where $\ell _i = \lbrace x \in \mathbf {a}\mid H[i]_{m_i}(x)= 0^{m_i}\rbrace$ . We use $\mathcal {S}[i](0)$ to denote $\ell _i$ and $\mathcal {S}[i](1)$ to denote $m_i$ .

$\mathsf {Minimum}$ The element $\mathcal {S}[i]$ holds a set of size $\mathsf {Thresh}$ . This set is the $\mathsf {Thresh}$ many lexicographically smallest elements of $\lbrace H[i](x)~|~x \in \mathbf {a}\rbrace$ . This sketch is also known as K-Minimum Value Sketch (KMV Sketch) [10].

$\mathsf {Estimation}$ The element $\mathcal {S}[i]$ holds a tuple of size $\mathsf {Thresh}$ . The j’th entry of this tuple is the largest number of trailing zeros in any element of $H[i,j](\mathbf {a})$ .

ProcessUpdate.

For a new item x, the update of $\mathcal {S}$ , as shown in Algorithm 3 is as follows:

Bucketing

For a new item x, if $H[i]_{m_i}(x) = 0^{m_i}$ , then we add it to $\mathcal {S}[i]$ if x is not already present in $\mathcal {S}[i]$ . If the size of $\mathcal {S}[i]$ is greater than $\mathsf {Thresh}$ (which is set to be $\mathcal {O}(1/\varepsilon ^2)$ ), then we increment $m_i$ as in line 8 of Algorithm 3.

Minimum

For a new item x, if $H[i](x)$ is smaller than the $\max {\mathcal {S}[i]}$ , then we replace $\max {\mathcal {S}[i]}$ with $H[i](x)$ .

Estimation

For a new item x, compute $z = {\mathsf {TrailZero}(H[i,j](x))}$ , i.e, the number of trailing zeros in $H[i,j](x)$ , and replace $\mathcal {S}[i,j]$ with z if z is larger than $\mathcal {S}[i,j]$ .

ComputeEst.

Finally, for each of the algorithms, we estimate $F_0$ based on the sketch $\mathcal {S}$ as described in the subroutine $\mathsf {ComputeEst}$ presented as Algorithm 4. It is crucial to note that the estimation of $F_0$ is performed solely using the sketch $\mathcal {S}$ for the Bucketing and Minimum algorithms. The Estimation algorithm requires an additional parameter r that depends on a loose estimate of $F_0$ ; we defer details to Section 3.4.

3.1 A Recipe For Transformation

Observe that for each of the algorithms, the final computation of $F_0$ estimation depends on the sketch $\mathcal {S}$ . Therefore, as long as for two streams $\mathbf {a}$ and $\hat{\mathbf {a}}$ , if their corresponding sketches say $\mathcal {S}$ and $\hat{\mathcal {S}}$ , respectively, are equivalent, the three schemes presented above would return the same estimates. The recipe for a transformation of streaming algorithms to model counting algorithms is based on the following insight:

(1)

Capture the relationship $\mathcal {P} (\mathcal {S}, H, \mathbf {a}_{u})$ between the sketch $\mathcal {S}$ , set of hash functions H, and set $\mathbf {a}_{u}$ at the end of stream. Recall that $\mathbf {a}_{u}$ is the set of all distinct elements of the stream $\mathbf {a}$ .

(2)

The formula $\varphi$ is viewed as a symbolic representation of the unique set $\mathbf {a}_{u}$ represented by the stream $\mathbf {a}$ such that $\mathsf {Sol}(\varphi) = \mathbf {a}_{u}$ .

(3)

Given a formula $\varphi$ and set of hash functions H, design an algorithm to construct sketch $\mathcal {S}$ such that $\mathcal {P} (\mathcal {S}, H, \mathsf {Sol}(\varphi))$ holds. And now, we can estimate $|\mathsf {Sol}(\varphi)|$ from $\mathcal {S}$ .

In the rest of this section, we will apply the above recipe to the three types of $F_0$ estimation algorithms and derive corresponding model counting algorithms. In particular, we show how applying the above recipe to the $\mathsf {Bucketing}$ algorithm leads us to reproduce the state-of-the-art hashing-based model counting algorithm, $\mathsf {ApproxMC}$ , proposed by Chakraborty et al. [16]. Applying the above recipe to $\mathsf {Minimum}$ and $\mathsf {Estimation}$ allows us to obtain fundamentally different schemes. In particular, we observe while model counting algorithms based on $\mathsf {Bucketing}$ and $\mathsf {Minimum}$ provide FPRAS’s when $\varphi$ is DNF, such is not the case for the algorithm derived based on $\mathsf {Estimation}$ .

3.2 Bucketing-based Algorithm

The $\mathsf {Bucketing}$ algorithm chooses a set H of pairwise independent hash functions and maintains a sketch $\mathcal {S}$ that we will describe. Here we use $\mathcal {H}_{\mathsf {Toeplitz}}$ as our choice of pairwise independent hash functions. The sketch $\mathcal {S}$ is an array where, each $\mathcal {S}[i]$ is of the form $\langle c_i, m_i\rangle$ . We say that the relation $\mathcal {P}_1 (\mathcal {S}, H, \mathbf {a}_u)$ holds if

(1)

$|\mathbf {a}_{u} \cap \lbrace x~|~H[i]_{m_i-1}(x) = 0^{m_i-1}\rbrace | \ge \frac{96}{\varepsilon ^2}$

(2)

$c_i = |\mathbf {a}_{u} \cap \lbrace x~|~H[i]_{m_i}(x) = 0^{m_i}\rbrace | \lt \frac{96}{\varepsilon ^2}$

The following lemma due to Bar–Yossef et al. [7] and Gibbons and Tirthapura [34] captures the relationship among the sketch $\mathcal {S}$ , the relation $\mathcal {P}_1$ and the number of distinct elements of a multiset.

Lemma 1 ([7, 34]).

Let $\mathbf {a} \subseteq \lbrace 0,1\rbrace ^n$ be a multiset and $H \subseteq \mathcal {H}_{\mathsf {Toeplitz}}(n,3n)$ where each $H[i]$ is independently drawn from $\mathcal {H}_{\mathsf {Toeplitz}}(n, 3n)$ , and $|H|=O(\log 1/\delta)$ and let $\mathcal {S}$ be such that the $\mathcal {P}_1 (\mathcal {S}, H, a_u)$ holds. Let $c = \mbox{Median }\lbrace c_i \times 2^{m_i}\rbrace _i$ . Then

$\begin{equation*} \Pr \left[ \frac{|\mathbf {a}_u|}{(1+\varepsilon)} \le c \le (1+\varepsilon)|\mathbf {a}_u|\right]\ge 1- \delta . \end{equation*}$

To design an algorithm for model counting, based on the bucketing strategy, we turn to the subroutine introduced by Chakraborty, Meel, and Vardi: $\mathsf {BoundedSAT}$ , whose properties are formalized as follows:

Proposition 1 ([15, 16]).

There is an algorithm $\mathsf {BoundedSAT}$ that gets $\varphi$ over n variables, a hash function $h \in \mathcal {H}_{\mathsf {Toeplitz}}(n, m)$ , and a number p as inputs, returns $\min (p, |\mathsf {Sol}(\varphi \wedge h(x) = {0^m})|)$ . If $\varphi$ is a CNF formula, then $\mathsf {BoundedSAT}$ makes $\mathcal {O}(p)$ calls to an NP oracle. If $\varphi$ is a DNF formula with k terms, then $\mathsf {BoundedSAT}$ takes $\mathcal {O}(n^3 \cdot k \cdot p)$ time.

Equipped with Proposition 1, we now turn to designing an algorithm for model counting based on the Bucketing strategy. The algorithm follows in a similar fashion to its streaming counterpart where $m_i$ is iteratively incremented until the number of solutions of the formula ( $\varphi \wedge H[i]_{m_i}(x) = 0^{m_i})$ is less than $\mathsf {Thresh}$ . Interestingly, an approximate model counting algorithm, called $\mathsf {ApproxMC}$ , based on bucketing strategy was discovered independently by Chakraborty et al. [15] in 2013. We reproduce an adaptation $\mathsf {ApproxMC}$ in Algorithm 5 to showcase how $\mathsf {ApproxMC}$ can be viewed as a transformation of the $\mathsf {Bucketing}$ algorithm. In the spirit of $\mathsf {Bucketing}$ , $\mathsf {ApproxMC}$ seeks to construct a sketch $\mathcal {S}$ of size $t \in \mathcal {O}(\log (1/\delta))$ . To this end, for every iteration of the loop, we continue to increment the value of the loop until the conditions specified by the relation $\mathcal {P}_1 (\mathcal {S}, H, \mathsf {Sol}(\varphi))$ are met. For every iteration i, the estimate of the model count is $c_i \times 2^{m_i}$ . Finally, the estimate of the model count is simply the median of the estimation of all the iterations. Since in the context of model counting, we are concerned with time complexity, wherein both $\mathcal {H}_{\mathsf {Toeplitz}}$ and $\mathcal {H}_{\mathsf {xor}}$ lead to the same time complexity. Furthermore, Chakraborty et al. [14] observed no difference in empirical runtime behavior due to $\mathcal {H}_{\mathsf {Toeplitz}}$ and $\mathcal {H}_{\mathsf {xor}}$ .

The following theorem establishes the correctness of $\mathsf {ApproxMC}$ , and the proof follows from Lemma 1 and Proposition 1.

Theorem 2.

Given a formula $\varphi$ , $\varepsilon$ , and $\delta$ , $\mathsf {ApproxMC}$ returns Est such that $\Pr [ \frac{|\mathsf {Sol}(\varphi)|}{1+\varepsilon } \le Est \le (1+\varepsilon)|\mathsf {Sol}(\varphi)|] \ge 1- \delta$ . If $\varphi$ is a CNF formula, then this algorithm makes $\mathcal {O}(n \cdot \frac{1}{\varepsilon ^2} \log (1/\delta))$ calls to NP oracle. If $\varphi$ is a DNF formula then $\mathsf {ApproxMC}$ is an FPRAS. In particular, for a DNF formula with k terms, $\mathsf {ApproxMC}$ takes $\mathcal {O}(n^4 \cdot k \cdot \frac{1}{\varepsilon ^2} \cdot \log (1/\delta))$ time.

Further Optimizations.

We now discuss how the setting of model counting allows for further optimizations. Observe that for all i, $\mathsf {Sol}(\varphi \wedge (H[i]_{m_i-1})(x) = 0^{m_i-1}) \supseteq \mathsf {Sol}(\varphi \wedge (H[i]_{m_i})(x) = 0^{m_i})$ . Note that we are interested in finding the value of $m_i$ such that $|\mathsf {Sol}(\varphi \wedge (H[i]_{m_i-1})(x) = 0^{m_i-1}) | \ge \frac{96}{\varepsilon ^2}$ and $|\mathsf {Sol}(\varphi \wedge (H[i]_{m_i})(x) = 0^{m_i}) | \lt \frac{96}{\varepsilon ^2}$ , therefore, we can perform a binary search for $m_i$ instead of a linear search performed in the loop 8–10. Indeed, this observation was at the core of Chakraborty et al’s followup work [16], which proposed ApproxMC2, thereby reducing the number of calls to NP oracle from $\mathcal {O}(n \cdot \frac{1}{\varepsilon ^2} \log (1/\delta))$ to $\mathcal {O}(\log n \cdot \frac{1}{\varepsilon ^2} \log (1/\delta))$ . Furthermore, the reduction in NP oracle calls led to significant runtime improvement in practice. It is worth commenting that the usage of $\mathsf {ApproxMC2}$ as an FPRAS for DNF is shown to achieve runtime efficiency over the alternatives based on Monte Carlo methods [49, 50, 51].

3.3 Minimum-based Algorithm

For a given multiset $\mathbf {a}$ (eg: a data stream or solutions to a model), we now specify the property $\mathcal {P}_2(\mathcal {S}, H, \mathbf {a}_{u})$ . The sketch $\mathcal {S}$ is an array of sets indexed by members of H that holds lexicographically p minimum elements of $H[i](\mathbf {a}_u)$ where p is $\min (\frac{96}{\varepsilon ^2}, |\mathbf {a}_{u}|)$ . $\mathcal {P}_2$ is the property that specifies this relationship. More formally, the relationship $\mathcal {P}_2$ holds, if the following conditions are met.

(1)

$\forall i, |\mathcal {S}[i]| = \min (\frac{96}{\varepsilon ^2}, |\mathbf {a}_{u}|)$

(2)

$\forall i, \forall y \notin \mathcal {S}[i], \forall y^{\prime } \in \mathcal {S}[i] \text{ it holds that } H[i](y^{\prime }) \preceq H[i](y)$

Here, $\preceq$ is the natural lexicographic order among the strings. The following lemma due to Bar-Yossef et al. [7] establishe the relationship between the property $\mathcal {P}_2$ and the number of distinct elements of a multiset. Let $\max (S_i)$ denote the largest element of the set $S_i$ .

Lemma 2 ([7]).

Let $\mathbf {a} \subseteq \lbrace 0,1\rbrace ^n$ be a multiset and $H \subseteq \mathcal {H}_{\mathsf {Toeplitz}}(n,n)$ , where each $H[i]$ is independently drawn from $\mathcal {H}_{\mathsf {Toeplitz}}(n, n)$ such that $|H|=O(\log 1/\delta)$ . Let $\mathcal {S}$ be such that the $\mathcal {P}_2 (\mathcal {S}, H, a_u)$ holds. Let $c = \mbox{Median }\lbrace {p\cdot 2^m \over \max (S[i])}\rbrace _i$ . Then

$\begin{equation*} \Pr \left[ \frac{|\mathbf {a}_u|}{(1+\varepsilon)} \le c \le (1+\varepsilon)|\mathbf {a}_u|\right]\ge 1- \delta . \end{equation*}$

Therefore, we can transform the $\mathsf {Minimum}$ algorithm for $F_0$ estimation to that of model counting given access to a subroutine that can compute $\mathcal {S}$ such that $\mathcal {P}_2(\mathcal {S}, H, \mathsf {Sol}(\varphi))$ holds true. The following proposition establishes the existence and complexity of such a subroutine, called $\mathsf {FindMin}$ :

Proposition 2.

There is an algorithm $\mathsf {FindMin}$ that, given $\varphi$ over n variables, $h \in \mathcal {H}_{\mathsf {Toeplitz}}(n,m)$ , and p as input, returns a set, $\mathcal {B} \subseteq h(\mathsf {Sol}(\varphi))$ so that if $|h(\mathsf {Sol}(\varphi))| \le p$ , then $\mathcal {B} =h(\mathsf {Sol}(\varphi))$ , otherwise $\mathcal {B}$ is the p lexicographically minimum elements of $h(\mathsf {Sol}(\varphi))$ . Moreover, if $\varphi$ is a CNF formula, then $\mathsf {FindMin}$ makes $\mathcal {O}(p\cdot m)$ calls to an NP oracle, and if $\varphi$ is a DNF formula with k terms, then $\mathsf {FindMin}$ takes $\mathcal {O}(m^3 \cdot n \cdot k \cdot p)$ time.

Equipped with Proposition 2, we are now ready to present the algorithm for model counting, which we call $\mathsf {ApproxModelCountMin}$ . Since the complexity of $\mathsf {FindMin}$ is PTIME when $\varphi$ is in DNF, we have $\mathsf {ApproxModelCountMin}$ as an FPRAS for DNF formulas.

Theorem 3.

Given $\varphi$ , $\varepsilon , \delta$ , $\mathsf {ApproxModelCountMin}$ returns c such that

$\begin{equation*} \Pr \left(\frac{|\mathsf {Sol}(\varphi)}{1+\varepsilon } \le Est \le (1+\varepsilon)|\mathsf {Sol}(\varphi)|\right) \ge 1- \delta . \end{equation*}$

If $\varphi$ is a CNF formula, then $\mathsf {ApproxModelCountMin}$ is a polynomial-time algorithm that makes $\mathcal {O}(\frac{1}{\varepsilon ^2} n \log (\frac{1}{\delta }))$ calls to NP oracle. If $\varphi$ is a DNF formula, then $\mathsf {ApproxModelCountMin}$ is an FPRAS.

Implementing the Min-based Algorithm.

We now give a proof of Proposition 2 by giving an implementation of $\mathsf {FindMin}$ subroutine.

Proof.

We first present the algorithm when the formula $\varphi$ is a DNF formula. Adapting the algorithm for the case of CNF can be done by using similar ideas.

Let $\phi = T_1\vee T_2\vee \ \cdots \vee T_k$ be a DNF formula over n variables where $T_i$ is a term. Let $h:\lbrace 0,1\rbrace ^n\rightarrow \lbrace 0,1\rbrace ^{m}$ be a linear hash function in $\mathcal {H}_{\mathsf {Toeplitz}}(n,m)$ defined by a $m\times n$ binary matrix A. Let $\mathcal {C}$ be the set of hashed values of the satisfying assignments for $\varphi$ : $\mathcal {C} = \lbrace h(x) \mid x \models \varphi \rbrace \subseteq \lbrace 0,1\rbrace ^m$ . Let $\mathcal {C}_{p}$ be the first p elements of $\mathcal {C}$ in the lexicographic order. Our goal is to compute $\mathcal {C}_{p}$ .

We will give an algorithm with running time $O(m^3np)$ to compute $\mathcal {C}_p$ when the formula is just a term T. Using this algorithm we can compute $\mathcal {C}_p$ for a formula with k terms by iteratively merging $\mathcal {C}_p$ for each term. The time complexity increases by a factor of k, resulting in an $O(m^3nkp)$ time algorithm.

Let T be a term with width w (number of literals) and $\mathcal {C} = \lbrace Ax \mid x \models T\rbrace$ . By fixing the variables in T we get a vector $b_T$ and an $n \times (n-w)$ matrix $A_T$ so that $\mathcal {C} = \lbrace A_Tx + b_T \mid x\in \lbrace 0,1\rbrace ^{(n-w)}\rbrace$ . Both $A_T$ and $b_T$ can be computed from A and T in linear time. Let $h_T(x)$ be the transformation $A_Tx + b_T$ .

We will compute $\mathcal {C}_p$ (p lexicographically minimum elements in $\mathcal {C}$ ) iteratively as follows: assuming we have computed $(q-1)^{th}$ minimum of $\mathcal {C}$ , we will compute $q^{th}$ minimum using a prefix-searching strategy. We will use a subroutine to solve the following basic prefix-searching primitive: Given any l bit string $y_1\ldots y_l$ , is there an $x\in \lbrace 0,1\rbrace ^{n-w}$ so that $y_1\ldots y_l$ is a prefix for some string in $\lbrace h_T(x)\rbrace$ ? This task can be performed using Gaussian elimination over an $(l+1)\times (n-w)$ binary matrix and can be implemented in time $O(l^2(n-w))$ .

Let $y=y_1\ldots y_m$ be the $(q-1)^{th}$ minimum in $\mathcal {C}$ . Let $r_1$ be the rightmost 0 of y. Then using the above-mentioned procedure we can find the lexicographically smallest string in the range of $h_T$ that extends $y_1\ldots y_{(r-1)}1$ if it exists. If no such string exists in $\mathcal {C}$ , find the index of the next 0 in y and repeat the procedure. In this manner the $q^{th}$ minimum can be computed using $O(m)$ calls to the prefix-searching primitive resulting in an $O(m^3n)$ time algorithm. Invoking the above procedure p times results in an algorithm to compute $\mathcal {C}_p$ in $O(m^3np)$ time.

If $\varphi$ is a CNF formula, we can employ the same prefix-searching strategy. Consider the following NP oracle: $O=\lbrace \langle \varphi , h, y, y^{\prime } \rangle \mid \exists x, \exists y^{\prime \prime }, \mbox{ so that } x \models \varphi , y^{\prime }y^{\prime \prime } \gt y, h(x) = y^{\prime }y^{\prime \prime } \rbrace$ . With m calls to O, we can compute the lexicographically smallest string in $\mathcal {C}$ that is greater than y. So with $p\cdot m$ calls to O, we can compute $\mathcal {C}_p$ .□

Further Optimizations.

As mentioned in Section 1, the problem of model counting has witnessed a significant interest from practitioners owing to its practical usage. The recent developments have been fueled by breakthrough progress in the design of SAT solvers. These developments enable replacing calls to NP oracles with SAT solvers in practice.Motivated by the progress in SAT solving, there has been significant interest in the design of efficient algorithmic frameworks for related problems such as MaxSAT and its variants. The state-of-the-art MaxSAT solvers are based on sophisticated strategies such as implicit hitting sets. Such solvers are shown to significantly outperform algorithms based on merely invoking an SAT solver iteratively. Of particular interest to us is the recent progress in the design of MaxSAT solvers to handle lexicographic objective functions. In this context, it is worth remarking that we expect practical implementation of $\mathsf {FindMin}$ would invoke a MaxSAT solver $\mathcal {O}(p)$ times as practical solvers also provide witness (i.e., assignment to variables) that achieves the optimal value.

3.4 Estimation-based Algorithm

We now adapt the $\mathsf {Estimation}$ algorithm to model counting. For a given stream $\mathbf {a}$ and chosen hash functions H, the sketch $\mathcal {S}$ corresponding to the estimation-based algorithm satisfies the following relation $\mathcal {P}_3(\mathcal {S}, H, \mathbf {a}_u)$ :

$\begin{align} \mathcal {P}_{3}(\mathcal {S}, H, \mathbf {a}_u) := (S[i,j] = \max _{x \in \mathbf {a}_{u}} \mathsf {TrailZero}(H[i,j](x))), \end{align}$

(2)

where the procedure $\mathsf {TrailZero}(z)$ is the length of the longest all-zero suffix of z. Bar–Yossef et al. [7] show the following relationship between the property $\mathcal {P}_3$ and $F_0$ .

Following the recipe outlined above, we can transform an $F_0$ streaming algorithm to a model counting algorithm by designing a subroutine that can compute the sketch for the set of all solutions described by $\varphi$ and a subroutine to find r. The following proposition achieves the first objective for CNF formulas using a small number of calls to an NP oracle:

Proposition 3.

There is an algorithm $\mathsf {FindMaxRange}$ that given $\varphi$ over n variables and hash function $h \in \mathcal {H}_{s-{\rm wise}}(n,n)$ , returns t such that

(1)

$\exists z, z \models \varphi$ and $h(z)$ has t least significant bits equal to zero.

(2)

$\forall (z \models \varphi) \Rightarrow$ $h(z)$ has $\le t$ least significant bits equal to zero.

If $\varphi$ is a CNF formula, then $\mathsf {FindMaxRange}$ makes $\mathcal {O}(\log n)$ calls to an NP oracle.

Proof.

Consider an NP oracle $O= \lbrace \langle \varphi , h, t\rangle \mid \exists x, \exists y, x \models \varphi , h(x) = y0^t\rangle \rbrace$ . Note that h can be implemented as a degree-s polynomial $h: \mathbb {F}_{2^n} \rightarrow \mathbb {F}_{2^n}$ , so that $h(x)$ can be evaluated in polynomial time. A binary search, requiring $O(\log n)$ calls to O, suffices to find the largest value of t for which $\langle \varphi , h, t\rangle$ belongs to O.□

We note that unlike Propositions 1 and 2, we do not know whether $\mathsf {FindMaxRange}$ can be implemented efficiently when $\varphi$ is a DNF formula. For a degree-s polynomial $h: \mathbb {F}_{2^n} \rightarrow \mathbb {F}_{2^n}$ , we can efficiently test whether h has a root by computing $\mathsf {gcd}(h(x), x^{2^n}-x)$ , but it is not clear how to simultaneously constrain some variables according to a DNF term.

Equipped with Proposition 3, we obtain $\mathsf {ApproxModelCountEst}$ that takes in a formula $\varphi$ and a suitable value of r and returns $|\mathsf {Sol}(\varphi)|$ . The key idea of $\mathsf {ApproxModelCountEst}$ is to repeatedly invoke $\mathsf {FindMaxRange}$ for each of the chosen hash functions and compute the estimate based on the sketch $\mathcal {S}$ and the value of r. The following theorem summarizes the time complexity and guarantees of $\mathsf {ApproxModelCountEst}$ for CNF formulas.

Theorem 4.

Given a CNF formula $\varphi$ , parameters $\varepsilon$ and $\delta$ , and r such that $2F_0 \le 2^r \le 50F_0$ , the algorithm $\mathsf {ApproxModelCountEst}$ returns c satisfying

$\begin{equation*} \Pr \left[ \frac{|\mathsf {Sol}(\varphi)}{1+\varepsilon } \le c \le (1+\varepsilon)|\mathsf {Sol}(\varphi)|\right] \ge 1- \delta . \end{equation*}$

$\mathsf {ApproxModelCountEst}$ makes $\mathcal {O}(\frac{1}{\varepsilon ^2} \log n \log (\frac{1}{\delta }))$ calls to an NP oracle.

In order to obtain r, we run in parallel another counting algorithm based on the simple $F_0$ -estimation algorithm [3, 32] which we call $\mathsf {FlajoletMartin}$ . Given a stream $\mathbf {a}$ , the $\mathsf {FlajoletMartin}$ algorithm chooses a random pairwise-independent hash function $h \in H_{xor}(n,n)$ , computes the largest r so that for some $x \in \mathbf {a}_u$ , the r least significant bits of $h(x)$ are zero, and outputs r. Alon, Matias and Szegedy [3] showed that $2^r$ is a 5-factor approximation of $F_0$ with probability $3/5$ . Using our recipe, we can convert $\mathsf {FlajoletMartin}$ into an algorithm that approximates the number of solutions to a CNF formula $\varphi$ within a factor of 5 with probability 3/5. It is easy to check that using the same idea as in Proposition 3, this algorithm requires $O(\log n)$ calls to an NP oracle.

3.5 Role of the Sketch Complexity

In the design of streaming algorithms reducing the space complexity is of primary concern whereas in model counting algorithms the goal is to minimize the run time or the number of NP queries made. Having established a recipe to transform sketch-based streaming algorithms into model counting algorithms, a natural question that arises is the relationship between the space complexity of the streaming algorithm and the number of NP queries made by the model counting algorithm. In this section, we attempt to clarify this relationship. In the following, we will fold the hash function h also in the sketch S. With this simplification, instead of writing $P(S,h,\mathsf {Sol}(\varphi))$ we write $P(S,\mathsf {Sol}(\varphi))$ .

We first introduce some complexity-theoretic notation. For a complexity class $\mathcal {C}$ , a language L belongs to the complexity class $\exists \cdot \mathcal {C}$ if there is a polynomial $q(\cdot)$ and a language $L^{\prime } \in \mathcal {C}$ such that for every x

$\begin{equation*} x \in L \Leftrightarrow \exists y, |y| \le q(|x|), \langle x, y\rangle \in L^{\prime }. \end{equation*}$

Consider a streaming algorithm for $F_0$ that constructs a sketch such that $P(S, a_u)$ holds for some property P using which we can estimate $|a_u|$ , where the size of S is poly-logarithmic in the size of the universe and polynomial in $1/\varepsilon$ . Now consider the following Sketch-Language

$\begin{equation*} L_{sketch} = \lbrace \langle \varphi , {S}\rangle ~|~{P}(S, \mathsf {Sol}(\varphi)) \mbox{ holds}\rbrace . \end{equation*}$

Theorem 5.

If $L_{sketch}$ belongs to the complexity class ${\mathcal {C}}$ , then there exists a ${\rm FP}^{\exists \cdot \mathcal {C}}$ model counting algorithm that estimates the number of satisfying assignments of a given formula $\varphi$ . The number of queries made by the algorithm is bounded by the sketch size.

The above theorem gives a general upper bound on the complexity of the model counting algorithm based on the complexity of the language $L_{sketch}$ . In the specific instances that we illustrate (bucketing, minimum, and estimation), the sketch language is in ${\rm coNP}$ . This will lead to a ${\rm FP}^{\Sigma _2^{\mathrm{P}}}$ algorithm for model counting. For example, consider the minimum-based algorithm. The sketch language is the following:

$\begin{equation*} \lbrace \langle \varphi , \langle h, v_1, \ldots , v_t\rangle \rangle ~|~\lbrace v_1, \ldots , v_t\rbrace \mbox{ is the set of $t$ lex-smallest elements of } h(\mathsf {Sol}(\varphi))\rbrace . \end{equation*}$

The above language is in the class ${\rm coNP}$ : If $\langle \varphi , \langle h, v_1, \ldots , v_t\rangle \rangle$ does not belong to the sketch language, then there is a satisfying assignment a of $\varphi$ such that there exists i, $0 \le i \le t-1$ and $v_i \lt h(a) \lt v_{i+1}$ (where $v_0$ is the empty string). Thus an NP machine for the complement language works by guessing an assignment a and verifying that a satisfies $\varphi$ and $h(a)$ lies between $v_i$ and $v_{i+1}$ for some i, $0 \le i \le t-1$ . Thus the sketch language is in coNP. Since $\exists \cdot {\rm coNP}$ is same as the class $\Sigma ^{\mathrm{P}}_2$ , we obtain a ${\rm FP}^{\Sigma _2^{\mathrm{P}}}$ algorithm. Since $t = O(1/\varepsilon ^2)$ and h maps from n-bit strings to 3n-bit strings, it follows that the size of the sketch is $O(n/\varepsilon ^2)$ . Thus the number of queries made by the algorithm is $O(n/\varepsilon ^2)$ .

Note that in all three model counting algorithms that were obtained, are probabilistic polynomial-time algorithms that make queries to languages in NP. The above generic transformation gives a deterministic polynomial-time algorithm that makes queries to a $\Sigma _2^{\mathrm{P}}$ oracle. Precisely characterizing the properties of the sketch that lead to probabilistic algorithms making only NP queries is an interesting direction to explore.

3.6 The Opportunities Ahead

As noted in Section 3.2, the algorithms based on Bucketing were already known and have witnessed a detailed technical development from both applied and algorithmic perspectives. The model counting algorithms based on Minimum and Estimation are new. We discuss some potential implications of these new algorithms to SAT solvers and other aspects.

MaxSAT solvers with native support for XOR constraints. When the input formula $\varphi$ is represented as CNF, then $\mathsf {ApproxMC}$ , the model counting algorithm based on Bucketing strategy, invokes NP oracle over CNF-XOR formulas, i.e., formulas expressed as a conjunction of CNF and XOR constraints. The XOR constraints appear due to the need to evaluate the hash functions which are evaluations of XORs. The significant improvement in the runtime performance of $\mathsf {ApproxMC}$ owes to the design of SAT solvers with native support for CNF-XOR formulas [59, 60, 61]. Such solvers have now found applications in other domains such as cryptoanalysis. It is perhaps worth emphasizing that the proposal of $\mathsf {ApproxMC}$ was crucial to renewed interest in the design of SAT solvers with native support for CNF-XOR formulas. As observed in Section 3.3, the algorithm based on the Minimum strategy would ideally invoke a MaxSAT solver that can handle XOR constraints naively. We believe that the Minimum-based algorithm will ignite interest in the design of MaxSAT solver with native support for XOR constraints.

FPRAS for DNF based on Estimation. In Section 3.4, we were unable to show that the model counting algorithm obtained based on Estimation is FPRAS when $\varphi$ is represented as DNF. The algorithms based on Estimation have been shown to achieve optimal space efficiency in the context of $F_0$ estimation. In this context, an open problem is to investigate whether the Estimation-based strategy lends itself to FPRAS for DNF counting.

Empirical Study of FPRAS for DNF Based on Minimum. Meel et al. [50, 51] observed that FPRAS for DNF based on Bucketing has superior performance, in terms of the number of instances solved, to that of FPRAS schemes based on the Monte Carlo framework. In this context, a natural direction of future work would be to conduct an empirical study to understand the behavior of FPRAS scheme based on the Minimum strategy.

4 From L₀ Sampling to Constrained Sampling

There has been considerable work on sampling elements from data streams [20, 33, 41, 52]. In particular, for a data stream $\mathbf {a}$ , one would like to generate a uniform sample from $\mathbf {a}_u$ , the set of unique elements of the stream $\mathbf {a}$ . This problem is known as $L_0$ sampling. It is known that counting and sampling are closely-related problems. In particular, Jerrum, Valiant, and Vazirani [40] demonstrated that model counting and constrained sampling (for example generating uniform samples from the set of satisfying assignments of a Boolean formula) are inter-reducible. Therefore, a natural question is whether known $L_0$ sampling algorithms can be similarly transformed into constrained sampling algorithms. In this section, we answer this question affirmatively for a broad class of $L_0$ sampling algorithms.

Our recipe for transformation is based on the following unifying framework presented by Cormode and Firmani [20]. This framework involves three steps; sampling, recovery, and selection.

Sampling

For a given stream $\mathbf {a}$ and its corresponding unique set $\mathbf {a}_u$ , the sampling process implicitly defines m subsets of $\mathbf {a}$ , say $\mathcal {S}[0], \mathcal {S}[1], \mathcal {S}[m-1]$ . These subsets are not stored explicitly but are summarized implicitly.

Recovery

The recovery step seeks to recover every subset $\mathcal {S}[i]$ , if the size of $|\mathcal {S}[i]|\lt s$ for an appropriately chosen parameter s. We call such a set s-sparse.

Selection

In order to draw a sample, the $L_0$ sampler seeks to choose a level $j \in [m]$ such that $\mathcal {S}[i]$ is s-sparse but not empty. In such a case, the element y is chosen such that $y \in \mathcal {S}[i]$ and $h(y)$ is the smallest among all the elements recovered.

Based on the above framework, Cormode and Firmani synthesized the known samplers into the algorithm presented in Algorithm 8.

4.1 A Recipe for Transformation

Our recipe for the transformation of $L_0$ sampling algorithms captured by the unified framework of Algorithm 8 to constrained sampling is based on two simple insights:

(1)

Similar to the recipe for transformation of $F_0$ estimation to model counting, for each i, we capture the relationship $\mathcal {P}(\mathcal {S}[i],h, \mathbf {a}_u)$ between the implicit subsets $\mathcal {S}[i]$ , hash function h, the set $\mathbf {a}_u$ at the end of the stream. Again, we can view a formula $\varphi$ to be a symbolic representation of some unique set $\mathbf {a}_u$ such that $\mathsf {Sol}(\varphi) = \mathbf {a}_u$ .

(2)

The exact s-sparse recovery step can be simulated by a generalization of $\mathsf {BoundedSAT}$ , i.e., given $\varphi$ , the hash function h, and a number i, we can reconstruct $\mathcal {S}[i]$ (if $S[i]$ is small) such that $\mathcal {P}(\mathcal {S}[i], h, \mathsf {Sol}(\varphi))$ holds.

As an example, let us consider Algorithm 8 and we can formalize the property $\mathcal {P}_{4}(\mathcal {S}[i],h, \mathbf {a}_u)$ as follows:

$\begin{align*} \mathcal {P}_{4}(\mathcal {S}[i],h, \mathbf {a}_u):= S[i]= \lbrace x \mid \mathsf {TrailZero}(h(x)) \le i \wedge x \in \mathbf {a}_u \rbrace . \end{align*}$

We apply the above recipe to translate the unified algorithm presented in Algorithm 8 to one for constrained sampling. To this end, we rely on the following generalization of $\mathsf {BoundedSAT}$ that can simulate exact sparse recovery.

Proposition 4 (Lemma 3.7 of [9]).

There is an algorithm $\mathsf {GenBoundedSAT}$ that gets $\varphi$ over n variables, a hash function $h \in \mathcal {H}_{k\mbox{-}wise}(n,3n)$ , and numbers m and p as inputs, returns $\mathcal {L}$ such that $\mathcal {L} \subseteq \mathsf {Sol}(\varphi \wedge \mathsf {TrailZero}(h(x)) \le m)$ and $|\mathcal {L}| = \min (p, |\mathsf {Sol}(\varphi \wedge \mathsf {TrailZero}(h(x)) \le m)|)$ , and makes $\mathcal {O}(p \cdot n)$ calls to a NP oracle.

Equipped with $\mathsf {GenBoundedSAT}$ , we present the algorithm $\mathsf {UnifSampler}$ in Algorithm 9 that takes in a formula $\varphi$ , tolerance parameter $\varepsilon$ , and confidence parameter $\delta$ , and returns a sample $\sigma \in \mathsf {Sol}(\varphi)$ . Since $\mathsf {GenBoundedSAT}$ implements exact sparse recovery, the algorithm $\mathsf {UnifSampler}$ enjoys theoretical guarantees for the quality of its samples.

Theorem 6.

For a given formula $\varphi$ , tolerance parameter $\varepsilon$ , and confidence parameter $\delta$ , $\mathsf {UnifSampler}$ succeeds (i.e., does not return FAIL) with probability at least $1-\delta$ , and conditioned on success, outputs $\sigma \in \mathsf {Sol}(\varphi)$ with probability $\frac{1 \pm \varepsilon }{|\mathsf {Sol}(\varphi)|} \pm \delta$ .

5 Distributed DNF Counting

Consider the problem of distributed DNF counting. In this setting, there are k sites that can each communicate with a central coordinator. The input DNF formula $\varphi$ is partitioned into k DNF subformulas $\varphi _1, \ldots , \varphi _k$ , where each $\varphi _i$ is a subset of the terms of the original $\varphi$ , with the j’th site receiving only $\varphi _j$ . The goal is for the coordinator to obtain an $(\epsilon ,\delta)$ -approximation of the number of solutions to $\varphi$ , while minimizing the total number of bits communicated between the sites and the coordinator. Distributed algorithms for sampling and counting solutions to CSPs have been studied recently in other models of distributed computation [28, 29, 30, 31]. From a practical perspective, given the centrality of #DNF in the context of probabilistic databases [55, 56], a distributed DNF counting would entail applications in distributed probabilistic databases.

From our perspective, distributed DNF counting falls within the distributed functional monitoring framework formalized by Cormode et al. [23]. Here, the input is a stream $\mathbf {a}$ which is partitioned arbitrarily into sub-streams $\mathbf {a}_1, \ldots , \mathbf {a}_k$ that arrive at each of k sites. Each site can communicate with the central coordinator, and the goal is for the coordinator to compute a function of the joint stream $\mathbf {a}$ while minimizing the total communication. This general framework has several direct applications and has been studied extensively [4, 6, 21, 24, 37, 45, 46, 47, 58, 67, 68, 70]. In distributed DNF counting, each sub-stream $\mathbf {a}_i$ corresponds to the set of satisfying assignments to each subformula $\varphi _i$ , while the function to be computed is $F_0$ .

The model counting algorithms discussed in Section 3 can be extended to the distributed setting, using the mergeability of the underlying sketches. We describe next the distributed implementations for each of the three algorithms. As earlier, we set the parameters $\mathsf {Thresh}$ to $O(1/\varepsilon ^2)$ and t to $O(\log (1/\delta))$ . We use a variant of $\mathsf {BoundedSAT}$ that takes in $\varphi$ over n variables, a function $h \in \mathcal {H}_{\mathsf {Toeplitz}}(n, m)$ , and a threshold t as inputs, and returns a set U of solutions such that $|U| = \min (t, |\mathsf {Sol}(\varphi \wedge h(x) = {0^m})|)$ , instead of returning $|U|$ itself.

Bucketing. Setting $\ell =$ $O(\log (k/\delta \varepsilon ^2))$ , the coordinator chooses $H[1], \ldots , H[t]$ from $\mathcal {H}_{\mathsf {Toeplitz}}(n,n)$ and G from $\mathcal {H}_{\mathsf {xor}}(n,\ell)$ . It then sends them to the k sites, along with the values of t and $\mathsf {thresh}$ . Let $m_{i,j}$ be the smallest m such that the size of the set $\mathsf {BoundedSAT}(\varphi _j, H[i]_m, \mathsf {thresh})$ is smaller than $\mathsf {thresh}$ . The j’th site sends to the coordinator the following tuples:

$\begin{equation*} \langle i, G(x), \mathsf {TrailZero}(H[i](x)), m_{i,j}\rangle \end{equation*}$

for each $i \in [t]$ and for each x in $\mathsf {BoundedSAT}(\varphi _j, H[i]_{m_{i,j}},$ $\mathsf {thresh})$ .

Each of the k sites only sends tuples for at most $O(1/\varepsilon ^2)$ choices of x. By a standard union-bound argument, G hashes these x to distinct values with probability $1-\delta /2$ . The coordinator can then execute the rest of the algorithm, as shown in the coordinator part of $\mathsf {ApproxMCDis}$ . For each $i = 1,\ldots , t$ , it merges the lists sent over by each of the k sites to get a final list consisting of the hashes of at most $\mathsf {Thresh}$ elements that (i) have at least $M[i]$ many trailing zeros when hashed by $H[i]$ and (ii) satisfy the subformula for at least one of the sites. The communication cost is $\tilde{O}(k(n+1/\varepsilon ^2) \cdot \log (1/\delta))$ , and the time complexity for each site is polynomial in n, $\varepsilon ^{-1}$ , and $\log (\delta ^{-1})$ .

Minimum. The coordinator chooses hash functions $H[1],\ldots ,H[t]$ from $\mathcal {H}_{\mathsf {Toeplitz}}(n,3n)$ and sends it to the k sites. Each site runs the $\mathsf {FindMin}$ algorithm for each hash function and sends the outputs to the coordinator. So, the coordinator receives sets $S[i,j]$ , consisting of the $\mathsf {Thresh}$ lexicographically smallest hash values of the solutions to $\varphi _j$ . The coordinator then extracts $S[i]$ , the $\mathsf {Thresh}$ lexicographically smallest elements of $S[i,1] \cup \cdots \cup S[i,k]$ and proceeds with the rest of algorithm $\mathsf {ApproxModelCountMin}$ . The communication cost is $O(kn/\varepsilon ^2 \cdot \log (1/\delta))$ to account for the k sites sending the outputs of their $\mathsf {FindMin}$ invocations. The time complexity for each site is polynomial in n, $\varepsilon ^{-1}$ , and $\log (\delta ^{-1})$ .

Estimation. For each $i \in [t]$ , the coordinator chooses $\mathsf {Thresh}$ hash functions $H[i,1], \ldots , H[i,\mathsf {Thresh}]$ , drawn pairwise independently from $\mathcal {H}_{s-{\rm wise}}(n,n)$ (for $s = O(\log (1/\varepsilon))$ ) and sends it to the k sites. Each site runs the $\mathsf {FindMaxRange}$ algorithm for each hash function and sends the output to the coordinator. Suppose the coordinator receives $S[i,j, \ell ] \in [n]$ for each $i \in [t], j \in [\mathsf {Thresh}]$ and $\ell \in [k]$ . It computes $S[i,j] = \max _\ell S[i,j,\ell ]$ . The rest of $\mathsf {ApproxModelCountEst}$ is then executed by the coordinator. The communication cost is $\tilde{O}(k(n+1/\varepsilon ^2)\log (1/\delta))$ .

Lower Bound

The communication cost for the Bucketing and Estimation-based algorithms is nearly optimal in their dependence on k and $\varepsilon$ . Woodruff and Zhang [67] showed that the randomized communication complexity of estimating $F_0$ up to a $1+\varepsilon$ factor in the distributed functional monitoring setting is $\Omega (k/\varepsilon ^2)$ . We can reduce $F_0$ estimation problem to distributed DNF counting. Namely, if for the $F_0$ estimation problem, the j’th site receives items $a_1, \ldots , a_m \in [N]$ , then for the distributed DNF counting problem, $\varphi _j$ is a DNF formula on $\lceil \log _2 N \rceil$ variables whose solutions are exactly $a_1, \ldots , a_m$ in their binary encoding. Thus, we immediately get an $\Omega (k/\varepsilon ^2)$ lower bound for the distributed DNF counting problem. Finding the optimal dependence on N for $k\gt 1$ remains an interesting open question.³

6 From Counting to Streaming: Structured Set Streaming

In this section we consider structured set streaming model where each item $S_i$ of the stream is a succinct representation of a set over the universe $U = \lbrace 0,1\rbrace ^n$ . Our goal is to design efficient algorithms (both in terms of memory and processing time per item) for computing $|\cup _i S_i|$ —number of distinct elements in the union of all the sets in the stream. We call this problem $F_0$ computation over structured set streams.

DNF Sets

A particular representation we are interested in is where each set is presented as the set of satisfying assignments to a DNF formula. Let $\varphi$ be a DNF formula over n variables. Then the DNF Set corresponding to $\varphi$ is the set of satisfying assignments of $\varphi$ . The size of this representation is the number of terms in the formula $\varphi$ .

A stream over DNF sets is a stream of DNF formulas $\varphi _1, \varphi _2, \ldots$ . Given such a DNF stream, the goal is to estimate $|\bigcup _{i} S_i|$ where $S_i$ the DNF set represented by $\varphi _i$ . This quantity is same as the number of satisfying assignments of the formula $\vee _i \varphi _i$ . We show that the algorithms described in the previous section carry over to obtain $(\epsilon , \delta)$ estimation algorithms for this problem with space and per-item time $\mathrm{poly}(1/\epsilon , n, k, \log (1/\delta))$ where k is the size of the formula.

Notice that this model generalizes the traditional streaming model where each item of the stream is an element $x\in U$ as it can be represented as single term DNF formula $\phi _x$ whose only satisfying assignment is x. This model also generalizes certain other models considered in the streaming literature that we discuss later.

Theorem 7.

There is a streaming algorithm to compute an $(\epsilon , \delta)$ approximation of $F_0$ over DNF sets. This algorithm takes space $O({n\over \varepsilon ^2}\cdot \log {1\over \delta })$ and processing time $O(n^4\cdot k\cdot {1\over \varepsilon ^2}\cdot \log {1\over \delta })$ per item where k is the size (number of terms) of the corresponding DNF formula.

Proof.

We show how to adapt Minimum-value based algorithm from Section 3.3 to this setting. The algorithm picks a hash function $h \in \mathcal {H}_{\mathsf {Toeplitz}}(n,3n)$ and maintains the set $\mathcal {B}$ consisting of t lexicographically minimum elements of the set $\lbrace h(\mathsf {Sol}(\varphi _1 \vee \cdots \vee \varphi _{i-1}))\rbrace$ after processing $i-1$ items. When $\varphi _i$ arrives, it computes the set $\mathcal {B^{\prime }}$ consisting of the t lexicographically minimum values of the set $\lbrace h(\mathsf {Sol}(\varphi _i))\rbrace$ and subsequently updates $\mathcal {B}$ by computing the t lexicographically smallest elements from $\mathcal {B}\cup \mathcal {B^{\prime }}$ . By Proposition 2, computation of $\mathcal {B^{\prime }}$ can be done in time $O(n^4\cdot k \cdot t)$ where k is the number of terms in $\varphi _i$ . Updating $\mathcal {B}$ can be done in $O(t\cdot n)$ time. Thus update time for the item $\varphi _i$ is $O(n^4 \cdot k \cdot t)$ . For obtaining an $(\varepsilon , \delta)$ approximations we set $t = O({1\over \varepsilon ^2})$ and repeat the procedure $O(\log {1\over \delta })$ times and take the median value. Thus the update time for item $\varphi$ is $O(n^4\cdot k\cdot {1\over \varepsilon ^2}\cdot \log {1\over \delta })$ . For analyzing sapce, each hash function uses $O(n)$ bits and to store $O({1 \over \epsilon ^2})$ minimums, we require $O({n \over \epsilon ^2})$ space resulting in overall space usage of $O({n\over \varepsilon ^2}\cdot \log {1\over \delta })$ . The proof of correctness follows from Lemma 2.□

Instead of using Minimum-value based algorithm, we could adapt Bucketing-based algorithm to obtain an algorithm with similar space and time complexities. As noted earlier, some of the set streaming models considered in the literature can be reduced the DNF set streaming. We discuss them next.

Multidimensional Ranges

A d dimensional range over an universe $U = \lbrace 0, \ldots , 2^n-1\rbrace$ is defined as $[a_1,b_1] \times [a_2,b_2] \times \cdots \times [a_d, b_d]$ . Such a range represents the set of tuples $(x_1,\ldots ,x_d)$ where $a_i \le x_i \le b_i$ and $x_i$ is an integer. Note that every d-dimensional range can be succinctly represented by the tuple $\langle a_1, b_1, \ldots , a_d, b_d\rangle$ . A multi-dimensional stream is a stream where each item is a d-dimensional range. The goal is to compute $F_0$ of the union of the d-dimensional ranges efficiently. We will show that $F_0$ computation over multi-dimensional ranges can be reduced to $F_0$ computation over DNF sets. Using this reduction we arrive at a simple algorithm to compute $F_0$ over multi-dimensional ranges.

Lemma 4.

Any d-dimensional range R over U can be represented as a DNF formula $\varphi _R$ over nd variables whose size is at most $(2n)^d$ . There is an algorithm that takes R as input and outputs the $i^{th}$ term of $\varphi _R$ using $O(nd)$ space, for $1 \le i \le (2n)^d$ .

Proof.

Let $R = [a_1,b_1] \times [a_2,b_2] \times \cdots \times [a_d, b_d]$ be a d-dimensional range over $U^{d}$ . We will first describe the formula to represent the multi-dimensional range as a conjunction of d DNF formulae $\phi _1, \ldots , \phi _d$ each with at most 2n terms, where $\phi _i$ represents $[a_i,b_i]$ , the range in the $i^{th}$ dimension. Converting this into a DNF formula will result in the formula $\phi _R$ with $(2n)^d$ terms.

For any $\ell$ bit number c, $1 \le c \le 2^n$ , it is straightforward to write a DNF formula $\varphi _{\le c}$ , of size at most $\ell$ , that represents the range $[0,c]$ (or equivalently the set $\lbrace x\mid 0\le x \le c\rbrace$ ). Similarly we can write a DNF formula $\varphi _{\ge c}$ , of size at most $\ell$ for the range $[c,2^{\ell -1}]$ . Now we construct a formula to represent the range $[a,b]$ over U as follows. Let $a_1a_{2}\ldots a_n$ and $b_1b_{2}\ldots b_n$ be the binary representations of a and b, respectively. Let $\ell$ be the largest integer such that $a_1a_2\ldots a_l = b_1b_2\ldots b_l$ . Hence $a_{\ell +1} = 0$ and $b_{\ell +1} = 1$ . Let $a^{\prime }$ and $b^{\prime }$ denote the integers represented by $a_{l+2}\ldots a_n$ and $b_{l+2} \ldots b_n$ . Also, let $\psi$ denote the formula (a single term) that represents the string $a_1\ldots a_\ell$ . Then the formula representing $[a,b]$ is $\psi \wedge (\overline{x_{\ell +1}} \varphi _{\ge a^{\prime }} \vee x_{\ell +1}\varphi _{\le b^{\prime }})$ . This can be written as a DNF formula by distributing $\psi$ and the number of terms in the resulting formula is at most 2n, and has n variables. Note that each $\varphi _i$ can be constructed using $O(n)$ space. To obtain the final DNF representing the range R, we need to convert $\varphi _1 \wedge \cdots \varphi _d$ into a DNF formula. It is easy to see that for any i, then ith term of this DNF can be computed using space $O(nd)$ . Note that this formula has nd variables, n variables per each dimension.□

Using the above reduction and Theorem 7, we obtain an algorithm for estimating $F_0$ over multidimensional ranges in a range-efficient manner.

Theorem 8.

There is a streaming algorithm to compute an $(\epsilon , \delta)$ approximation of $F_0$ over d-dimensional ranges that takes space $O(\frac{nd}{\varepsilon ^2}\cdot \log (1/\delta))$ and processing time $O((nd)^4\cdot n^d \cdot \frac{1}{\varepsilon ^2})\log (1/\delta))$ per item.

Remark 2.

Tirthapura and Woodruff [65] studied the problem of range efficient estimation of $F_k$ ( $k^{th}$ frequency moments) over d-dimensional ranges. They claimed an algorithm to estimate $F_0$ with space and per-item time complexity $\mathrm{poly}(n, d, 1/\epsilon ,\log 1/\delta)$ . However, they have retracted their claim (Woodruff, Personal Communication, June 16, 2020). Their method only yields $\mathrm{poly}(n^d,1/\epsilon ,\log 1/\delta)$ time per item. Their proof is based on recursive sketches [11, 38] as well as a range-efficient implementation of count sketch algorithm [17]. We obtain the same complexity bounds with much simpler analysis and a practically efficient algorithm that can use of-the-shelf available implementations [50].

Remark 3.

Subsequent to the present work, an improved algorithm for $F_0$ over structured sets is presented in [64]. In particular, the article presents an $F_0$ estimation algorithm, called $\mathsf {APS\mbox{-}Estimator}$ , for streams over Delphic sets. A set $S \subseteq \lbrace 0,1\rbrace ^n$ belongs to the Delphic family if the following queries can be done in $O(n)$ time: (1) know the size of the set S, (2) draw a uniform random sample from S, and (3) given any x check if $x\in S$ . The authors design a streaming algorithm that given a stream $\mathcal {S}= \langle S_1, S_2, \ldots , S_M \rangle$ wherein each $S_i \subseteq \lbrace 0,1\rbrace ^n$ belongs to Delphic family, computes an $(\varepsilon ,\delta)$ -approximation of $| \bigcup _{i=1}^{M} S_i|$ with worst-case space complexity $O(n\cdot \log (M/\delta)\cdot \varepsilon ^{-2})$ and per-item time is $\widetilde{O}(n \cdot \log (M/\delta)\cdot \varepsilon ^{-2})$ . The algorithm $\mathsf {APS\mbox{-}Estimator}$ , when applied to d-dimensional ranges, gives per-item time and space complexity bounds that are $\mathrm{poly}(n, d, \log M, 1/\varepsilon , \log 1/\delta)$ . While $\mathsf {APS\mbox{-}Estimator}$ brings down the dependency on d from exponential to polynomial, it works under the assumption that the length of the stream M is known. The general setup presented in [64], however, can be applied to other structured sets considered in this article including multidimensional arithmetic progressions.

Representing Multidimensional Ranges as CNF Formulas. Since the algorithm, $\mathsf {APS\mbox{-}Estimator}$ , presented in [64], employs a sampling-based technique, a natural question is whether there exists a hashing-based technique that achieves per-item time polynomial in n and d. We note that the above approach of representing a multi-dimensional range as DNF formula does not yield such an algorithm. This is because there exist d-dimensional ranges whose DNF representation requires $\Omega (n^d)$ size.

Observation 1.

There exist d-dimensional ranges whose DNF representation has size $\ge n^d$ .

Proof.

The observation follows by considering the range $R = [1,2^n-1]^d$ (only 0 is missing from the interval in each dimension). We will argue that any DNF formula $\varphi$ for this range has size (number of terms) $\ge n^d$ . For any $1\le j \le d$ , we use the set of variables $X^{j}=\lbrace x^j_1,x^j_2,\ldots ,x^j_n\rbrace$ for representing the $j^{th}$ coordinate of R. Then R can be represented as the formula $\varphi _{R} = \vee _{(i_1,i_2,\ldots ,i_d)} x_{i_1}^1x_{i_2}^2\ldots x_{i_d}^d$ , where $1\le i_j \le n$ . This formula has $n^d$ terms. Let $\varphi$ be any other DNF formula representing R. The main observation is that any term T of $\varphi$ is completely contained (in terms of the set of solutions) in one of the terms of $\varphi _{R}$ . This implies that $\varphi$ should have $n^d$ terms. Now we argue that T is contained in one of the terms of $\varphi _{R}$ . T should have at least one variable as positive literal from each of $X^j$ . Suppose T does not have any variable from $X^j$ for some j. Then T contains a solution with all the variables in $X^j$ set to 0 and hence not in R. Now let $x^j_{i_j}$ be a variable from $X^j$ that is in T. Then clearly T is in the term $x^1_{i_1}x^2_{i_2}\ldots x^d_{i_d}$ of R.□

This leads to the question of whether we can obtain a super-polynomial lower bound on the time per item. We observe that such a lower bound would imply $\mathrm{P} \ne \mathrm{NP}$ . For this, we note the following.

Observation 2.

Any d-dimensional range R can be represented as a CNF formula of size $O(nd)$ over nd variables.

This is because a single dimensional range $[a, b]$ can also be represented as a CNF formula of size $O(n)$ [13] and thus the CNF formula for R is a conjunction of formulas along each dimension. Thus the problem of computing $F_0$ over d-dimensional ranges reduces to computing $F_0$ over a stream where each item of the stream is a CNF formula. As in the proof of Theorem 7, we can adapt Minimum-value based algorithm for CNF streams. When a CNF formula $\varphi _i$ arrive, we need to compute the t lexicographically smallest elements of $h(\mathsf {Sol}(\varphi _i))$ where $h \in \mathcal {H}_{\mathsf {Toeplitz}}(n,3n)$ . By Proposition 2, this can be done in polynomial-time by making $O(tnd)$ calls to an NP oracle since $\varphi _i$ is a CNF formula over nd variables. Thus if P equals NP, then the time taken per range is polynomial in n, d, and $1/\varepsilon ^2$ . Thus a super polynomial time lower bound on time per item implies that P differs from NP.

From Weighted #DNF to d-Dimensional Ranges.

Designing a hashing-based streaming algorithm with a per-item update time of polynomial in n and d is a very interesting open problem with implications on weighted DNF counting. Consider a formula $\varphi$ defined on the set of variables $x = \lbrace x_1, x_2, \ldots , x_n \rbrace$ . Let a weight function $\rho : x \mapsto (0,1)$ be such that weight of an assignment $\sigma$ can be defined as follows:

$\begin{align*} W(\sigma) = \prod _{x_i: \sigma (x_i) = 1} \rho (x_i) \prod _{x_i:\sigma (x_i) = 0} (1-\rho (x_i)). \end{align*}$

Furthermore, we define the weight of a formula $\varphi$ as

$\begin{align*} W(\varphi) = \sum _{\sigma \models \varphi } W(\sigma). \end{align*}$

Given $\varphi$ and $\rho$ , the problem of weighted counting is to compute $W(\varphi)$ . We consider the case where for each $x_i$ , $\rho (x_i)$ is represented using $m_i$ bits in binary representation, i.e., $\rho (x_i) = \frac{k_i}{2^{m_i}}$ . Inspired by the key idea of weighted to unweighted reduction due to Chakraborty et al. [13], we show how the problem of weighted DNF counting can be reduced to that of estimation of $F_0$ over n-dimensional ranges. The reduction is as follows: we transform every term of $\varphi$ into a product of multi-dimension ranges where every variable $x_i$ is replaced with interval $[1,k_i]$ while $\lnot x_i$ is replaced with $[k_i+1, 2^{m_i}]$ and every $\wedge$ is replaced with $\times .$ For example, a term $(x_1 \wedge \lnot x_2 \wedge \lnot x_3)$ is replaced with $[1,k_1]\times [k_2+1, 2^{m_2}]\times [k_3+1,2^{m_3}]$ . Given $F_0$ of the resulting stream, we can compute the weight of $\varphi$ simply as $W(\varphi) = \frac{F_0}{2^{\sum _i m_i} }$ . Thus a hashing-based streaming algorithm that has $poly(n, d)$ time per item, yields a hashing-based FPRAS for weighted DNF counting, solving an open problem from [1].

Multidimensional Dyadic Arithmetic Progressions.

We will now generalize Theorem 8 to handle arithmetic progressions instead of ranges. Let $[a, b, c]$ represent the arithmetic progression with common difference c in the range $[a, b]$ , i.e., $a , a+c, a+2c, a + id$ , where i is the largest integer such that $a+ id \le b$ . Here, we consider d-dimensional arithmetic progressions $R = [a_1, b_1, c_1] \times \cdots \times [a_d, b_d, c_d]$ where each $c_i$ is a power two. We first observe that the set represented by $[a, b, 2^\ell ]$ can be expressed as a DNF formula as follows: Let $\varphi$ be the DNF formula representing the range $[a, b]$ and let $a_1, \ldots , a_\ell$ are the least significant bits of a. Let $\psi$ be the term that represents the bit sequence $a_1 \ldots a_\ell$ . Now the formula to represent the arithmetic progression $[a, b, 2^\ell ]$ is $\phi \wedge \psi$ which can be converted to a DNF formula of size $O(n)$ . Thus, the multi-dimensional arithmetic progression R can be represented as a DNF formula of size $O(n)^d$ . Note that the time and space required to convert R into a DNF formula are as before, i.e, $O(n^d)$ time and $O(nd)$ space. This leads us to the following corollary.

Corollary 1.

There is a streaming algorithm to compute an $(\epsilon , \delta)$ approximation of $F_0$ over d-dimensional arithmetic progressions, whose common differences are powers of two, that takes space $O(nd/\varepsilon ^2\cdot \log 1/\delta)$ and processing time $O((nd)^4\cdot n^d \cdot \frac{1}{\varepsilon ^2})\log (1/\delta))$ per item.

Affine Spaces

Another example of a structured stream is where each item of the stream is an affine space represented by $Ax = B$ where A is a boolean matrix and B is a zero-one vector. Without loss of generality, we may assume that A is a $n \times n$ matrix. Thus an affine stream consists of $\langle A_1, B\rangle , \langle A_2, B_2\rangle \ldots$ , where each $\langle A_i, B_i\rangle$ is succinctly represents a set $\lbrace x \in \lbrace 0,1 \rbrace ^n\mid A_ix= B_i\rbrace$ .

For a $n \times n$ Boolean matrix A and a zero-one vector B, let $\mathsf {Sol}(\langle A, B\rangle)$ denote the set of all x that satisfy $Ax = B$ .

Proposition 5.

Given $(A, B)$ , $h \in \mathcal {H}_{\mathsf {Toeplitz}}(n,3n)$ , and t as input, there is an algorithm, $\mathsf {AffineFindMin}$ , that returns a set, $\mathcal {B} \subseteq h(\mathsf {Sol}(\langle A, B\rangle))$ so that if $|h(\mathsf {Sol}(\langle A, B\rangle))| \le t$ , then $\mathcal {B} =h(\mathsf {Sol}(\langle A, B\rangle))$ , otherwise $\mathcal {B}$ is the t lexicographically minimum elements of $h(\mathsf {Sol}(\langle A, B\rangle))$ . Time taken by this algorithm is $O(n^4t)$ and the space taken by the algorithm is $O(tn)$ .

Proof.

Let D be the matrix that specifies the hash function h. Let $\mathcal {C} = \lbrace Dx~|~ Ax =B\rbrace$ , and the goal is to compute the t smallest element of $\mathcal {C}$ . Note that if $y \in \mathcal {C}$ , then it must be the case that $D|Ax =y|B$ where $D|A$ is the matrix obtained by appending rows of A to the rows of D (at the end), and $y|B$ is the vector obtained by appending B to y. Note that $D|A$ is a matrix with 4n rows. Now the proof is very similar to the proof of Proposition 2. We can do a prefix search as before and this involves doing Gaussian elimination using sub matrices of $D|A$ .□

Theorem 9.

There is a streaming algorithms computes $(\epsilon , \delta)$ approximation of $F_0$ over affine spaces. This algorithm takes space $O(\frac{n}{\epsilon ^2}\cdot \log (1/\delta))$ and processing time of $O(n^4\frac{1}{\epsilon ^2}\log (1/\delta))$ per item.

7 Conclusion and Future Outlook

To summarize, our investigation led to a diverse set of results that unify over two decades of work in model counting and $F_0$ estimation. We believe that the viewpoint presented in this work has the potential to spur several new interesting research directions. We sketch some of these directions below:

Faster Model Counting Algorithms. In this article, we considered three $F_0$ estimation algorithms and showed that they can be transformed into model counting algorithms. An exciting research direction is to explore the possibility of transforming other $F_0$ estimation algorithms into model counting algorithms. For example, can we transform the optimal (in terms of space and time) $F_0$ estimation algorithm due to Kane, Nelson, and Woodruff [42] into a model counting algorithm with better runtime than the currently known algorithms? Another $F_0$ estimation algorithm that is of interest is the HyperLogLog algorithm. In practice, the HyperLogLog is shown to be one of the most efficient $F_0$ estimation algorithms. Thus, transforming this algorithm could potentially yield a new model counting algorithm that is faster in practice.

Higher Moments. There has been a long line of work on the estimation of higher moments, i.e., $F_k$ in the streaming context. A natural direction of future research is to adapt the notion of $F_k$ in the context of CSP. For example, in the context of DNF, one can view $F_1$ be simply a sum of the size of clauses but it remains to be seen to understand the relevance and potential applications of higher moments such as $F_2$ in the context of CSP. Given the similarity of the core algorithmic frameworks for higher moments, we expect an extension of the framework and recipe presented in the article to derive algorithms for higher moments in the context of CSP.

Sparse XORs. In the context of model counting, the performance of underlying SAT solvers strongly depends on the size of XORs. The standard construction of $\mathcal {H}_{\mathsf {Toeplitz}}$ and $\mathcal {H}_{\mathsf {xor}}$ lead to XORs of size $\Theta (n/2)$ and an interesting line of research has focused on the design of sparse XOR-based hash functions [2, 5, 27, 36, 39] culminating in showing that one can use hash functions of form $h(x) = Ax+b$ wherein each entry of mth row of A is 1 with probability $\mathcal {O}(\frac{\log m}{m})$ [48]. Such XORs were shown to improve runtime efficiency. In this context, a natural direction would be to explore the usage of sparse XORs in the context of $F_0$ estimation.

Acknowledgements

We thank the anonymous PODS 21 and TODS reviewers for their valuable comments. We are grateful to Phokion Kolaitis for suggesting exploration beyond the transformation recipe that led to results in Section 3.5.

Footnotes

We ignore $O(\log {1 \over \delta })$ factor in this discussion.

Please refer to Remark 2 in Section 6 for a discussion on the earlier work on multidimensional ranges [65].

Note that if $k=1$ , then $\log (n/\varepsilon)$ bits suffices, as the site can solve the problem on its own and send to the coordinator the binary encoding of a $(1+\varepsilon)$ -approximation of $F_0$ .

References

[1]

Ralph Abboud, Ismail Ilkan Ceylan, and Thomas Lukasiewicz. 2019. Learning to reason: Leveraging neural networks for approximate DNF counting. arXiv:1904.02688. Retrieved from https://arxiv.org/abs/1904.02688.

Abstract

1 Introduction

Model Counting

Zeroth Frequency Moment Estimation

The Road to a Unifying Framework

Our Contributions

Organization

2 Notation

F0 Estimation.

Model Counting.

k-wise independent hash functions.

Explicit Families.

3 From F0 Estimation to Counting

ChooseHashFunctions.

Sketch Properties.

ProcessUpdate.

ComputeEst.

3.1 A Recipe For Transformation

3.2 Bucketing-based Algorithm

Further Optimizations.

3.3 Minimum-based Algorithm

Implementing the Min-based Algorithm.

Further Optimizations.

3.4 Estimation-based Algorithm

3.5 Role of the Sketch Complexity

3.6 The Opportunities Ahead

4 From L0 Sampling to Constrained Sampling

4.1 A Recipe for Transformation

5 Distributed DNF Counting

Lower Bound

6 From Counting to Streaming: Structured Set Streaming

DNF Sets

Multidimensional Ranges

From Weighted #DNF to d-Dimensional Ranges.

Multidimensional Dyadic Arithmetic Progressions.

Affine Spaces

7 Conclusion and Future Outlook

Acknowledgements

Footnotes

References

Index Terms

Recommendations

Model Counting meets F0 Estimation

Parameterized model counting for string and numeric constraints

The Model Counting Competition 2020

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations

F₀ Estimation.

3 From F₀ Estimation to Counting

4 From L₀ Sampling to Constrained Sampling

Model Counting meets F₀ Estimation