Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Vram 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

vRAM: Faster Verifiable RAM With

Program-Independent Preprocessing
Yupeng Zhang∗ , Daniel Genkin† , ∗ , Jonathan Katz∗ , Dimitrios Papadopoulos‡ , ∗ and Charalampos Papamanthou∗
∗ University
of Maryland † University of Pennsylvania ‡ Hong Kong University of Science and Technology

Email: {zhangyp,cpap}@umd.edu, danielg3@cis.upenn.edu, jkatz@cs.umd.edu, dipapado@cse.ust.hk

Abstract—We study the problem of verifiable computation which a trusted party (possibly the verifier) generates a set of
(VC) for RAM programs, where a computationally weak verifier public parameters corresponding to a specific circuit for the
outsources the execution of a program to a powerful (but function f . Furthermore, this preprocessing phase is orders of
untrusted) prover. Existing efficient implementations of VC
protocols require an expensive preprocessing phase that binds the magnitude slower than evaluating f itself.
parties to a single circuit. (While there are schemes that avoid Verifying RAM Programs. While circuits can model arbi-
preprocessing entirely, their performance remains significantly trary programs, most real-world computations are expressed
worse than constructions with preprocessing.) Thus, a prover and in terms of random-access memory (RAM) machines. This
verifier are forced to choose between two approaches: (1) Allow is true both in terms of most programmers’ mental model of
verification of arbitrary RAM programs, at the expense of
efficiency, by preprocessing a universal circuit which can handle computation, as well as in terms of the execution of assembly
all possible instructions during each CPU cycle; or (2) Sacrifice code on general-purpose computers. However, since most con-
expressiveness by preprocessing an efficient circuit which is structions of VC protocols work on computations expressed as
tailored to the verification of a single specific RAM program. arithmetic circuits, verification of a RAM program P is usually
We present vRAM, a VC system for RAM programs that avoids done by verifying the correct evaluation of an arithmetic circuit
both the above drawbacks by having a preprocessing phase that
is entirely circuit-independent (other than an upper bound on CP that corresponds to the next-instruction function of the
the circuit size). During the proving phase, once the program to RAM program while checking consistency of memory, etc.
be verified and its inputs are chosen, the circuit-independence As stated above, most VC implementations require the circuit
of our construction allows the parties to use a smaller circuit to be fixed ahead of time, during a trusted preprocessing phase.
tailored to verifying the specific program on the chosen inputs, Due to this, previous works for verifying RAM programs can
i.e., without needing to encode all possible instructions in each
cycle. Moreover, our construction is the first with asymptotically be roughly divided into two main categories.
optimal prover overhead; i.e., the work of the prover is a constant 1) Program-Specific Preprocessing. If the program P to
multiplicative factor of the time to execute the program. be verified is known ahead of time, it is possible to tailor
Our experimental evaluation demonstrates that vRAM reduces the circuit CP so as to verify P as efficiently as possible.
the prover’s memory consumption by 55–110× and its running
While this tailoring is beneficial to the protocol’s overall
time by 9–30× compared to existing schemes with universal
preprocessing. This allows us to scale to RAM computations with performance, it comes at the expense of usability since CP
more than 2 million CPU cycles, a 65× improvement compared to cannot be used to verify another program P0 . Examples of
the state of the art. Finally, vRAM has performance comparable this approach are Pantry [17] and Buffet [50].
to (and sometimes better than) the best existing scheme with 2) Universal Preprocessing. In case the RAM program to
program-specific preprocessing despite the fact that the latter can
be verified is not known ahead of time, it is possible to
deploy program-specific optimizations (and has to pay a separate
preprocessing cost for every new program). construct a universal circuit CRAM which is capable of
verifying any RAM program that runs for at most T steps.
I. I NTRODUCTION Examples of this approach include [9], [11].
Protocols for verifiable computation (VC) allow a computa- Both these approaches have significant drawbacks. In the first
tionally weak verifier to outsource the execution of a program case, the verifier cannot change the RAM program P being ver-
to a powerful but untrusted prover (e.g., a cloud provider) ified without re-running the (expensive) preprocessing phase.
while being assured that the result was computed correctly. This is a major drawback as the preprocessing cost can only
Somewhat more formally, a verifier V and prover P agree on be amortized by running the same program on different inputs.
a function f and an input x. The prover then sends a result y In the second case, although the preprocessing cost can be
to the verifier, together with a proof that y = f (x). There is amortized over the evaluation of different programs on differ-
a long line of work constructing VC protocols for arbitrary ent inputs, the universal preprocessing used in this approach
computations, the most prominent of which rely on suc- imposes large concrete overheads during the proving phase.
cinct non-interactive arguments of knowledge (SNARKs) [12], This results from the fact that CRAM must be able to emulate
[25]. This has resulted in several implemented systems; see all possible operations at every CPU step in order to handle
Section I-C for an overview. While VC protocols without arbitrary RAM programs. In contrast, the program-specific
preprocessing have been recently implemented [7], efficient approach benefits from the fact that P is known when CP is
VC implementations still rely on a preprocessing phase during chosen, and so the set of possible instructions at each step is
potentially much smaller. hand, compared to systems using program-specific preprocess-
Two notable exceptions to the above are the works of [7], ing [50] vRAM achieves very similar prover performance; in
[10], which do not need a preprocessing phase tied to a fact, in some cases our prover is faster despite the fact that sys-
specific circuit. However, the concrete cost of these systems tems with program-specific preprocessing can deploy program-
remains significantly higher than that of the preprocessing- specific optimizations during the preprocessing phase.
based solutions mentioned above. (See Section V-C.) We also show that vRAM is much better in terms of
Thus, in this paper we thus ask the following question: memory consumption, which is currently the main bottleneck
Is it possible to construct a VC protocol for RAM programs for running large instances of verifiable computation. vRAM
which has similar (or even better) performance than what is achieves an improvement of 55–110× in terms of memory
achievable with program-specific preprocessing, but without consumption compared to [11], which allows us to prove
knowing the program during the preprocessing phase? computations involving more than 2 million CPU cycles with
256GB memory (65× more than [11]). The improvements
A. Our Results
achieved by vRAM come at the cost of increased verifier’s
We answer the above question in the affirmative by presenting running time and proof size, however these still remain well
vRAM, a VC protocol for RAM programs that improves the within the capabilities of modern machines. In Section V-B we
performance of previous works both concretely and asymptoti- discuss the practical limitations of our approach and provide
cally. In particular, our system achieves performance similar to estimations for instances where VC can be applicable.
(and often better than) state-of-the art systems with program- Architecture Independence. Another advantage of vRAM’s
specific preprocessing, but without requiring the RAM pro- circuit independent preprocessing is that it can use information
gram to be fixed during the preprocessing phase. This allows obtained after executing the computation to optimize the RAM
a single execution of the preprocessing phase to be used for architecture to be used for its verification. Any parameter
verifying arbitrary RAM programs (running for some bounded of the architecture (number of registers, register width, in-
number of steps) afterwards. struction set, etc.) can be tweaked so as to reduce the size
Our starting point is vSQL [52], a system for verifying SQL of produced circuit to be verified. Finally, our construction
queries on outsourced databases. While not presented as such, naturally supports both arithmetic circuits and RAM programs
vSQL can be viewed as a VC scheme that has a preprocessing with a single preprocessing phase, allowing the parties to
phase that does not depend on a specific circuit beyond an selectively choose the optimal representation for a particular
upper bound on the circuit’s input size. vRAM relies on two program. Thus, if the computation has a “nice” arithmetic
novel improvements to vSQL: circuit representation, one may even avoid RAM architecture
1) Extending Expressiveness. The vSQL protocol is only entirely. These features can result in further performance
efficient for a specific type of circuits which correspond improvements, as we demonstrate in Section V-D.
to SQL queries. We improve expressiveness by extending
the class of circuits it can efficiently handle. We then show B. Overview of Our Techniques
that the resulting protocol is an argument of knowledge (cf. As mentioned earlier, our starting point is vSQL [52] which
Definition 1) with circuit-independent preprocessing. can be shown to be a VC scheme that efficiently handles
2) New RAM Reduction. Exploiting circuit-independent pre- circuits that mostly consist of parallel copies of a single
processing, we devise a new RAM-to-circuit reduction that sub-circuit. While circuits that are constructed from SQL
reduces the concrete size of the circuit to be verified. In queries typically have this structure, this is not the case for
more detail: circuit-independent preprocessing allows the circuits constructed via our RAM-to-circuit encoding, since
prover to construct “on the fly” during the proving phase a each program step can perform a different instruction, thus
circuit that is optimized for a specific input. Thus, for each resulting in a different sub-circuit.
step of the RAM program, the produced circuit checks only Before describing our results and addressing this issue, we
the instruction that is actually executed for the given input. briefly review vSQL. At a high level, vSQL combines the
A Linear Time Prover. vRAM is the first verifiable RAM interactive proof of [19], [45], [46] (these are based on [28],
protocol with asymptotically optimal prover overhead. In and we refer to all of these variants as the CMT protocol in
particular, for a RAM program P of size ` running for T steps, the paper) with an extractable verifiable polynomial delegation
the prover time in vRAM is O(T +`) (asymptotically the same (VPD) protocol [39]. The CMT protocol can be used to verify
as simply executing P), whereas previous approaches required the correct evaluation of a circuit C on an input x, assuming
time O((` + T )polylog(T + `)). that in the final step the verifier can evaluate a specific
Experimental Evaluation. We provide an experimental polynomial px that depends only on x (and not on C). The latter
evaluation of vRAM’s performance as well as compare vRAM is done using a VPD protocol, which is the only part of the
with state-of-the-art-implementations in both the program- construction that requires preprocessing. The VPD protocol is
specific [50] and universal [11] preprocessing setting (cf. extractable, which guarantees that the prover “knows” an input
Section V). When verifying RAM programs not known during that makes the circuit evaluate to the specified output; this can
the preprocessing phase, we improve the prover’s running time be used to support NP computations that use auxiliary input
by 9–30× as compared to prior work [11]. On the other (provided by the prover).
Improving the CMT Protocol. Athough the original CMT the different copies of Cexe are not arranged by their order of
protocol [19] can handle arbitrary arithmetic circuits, it is execution, but are instead sorted by instruction type. The result
especially efficient for highly regular circuits and in partic- is that CP can be described by simply listing the multiplicity
ular circuits that consist of parallel copies of identical sub- of each instruction. Finally, since our protocol has public-coin
circuits [45]. In Section IV-A, we show how to modify the verification, we can make it non-interactive in the random
CMT protocol to efficiently handle circuits that consist of oracle model using the Fiat-Shamir heuristic [22].
(non-interconnected) parallel copies of different sub-circuits.
C. Related Work
At a high level, we achieve this by refactoring the recursive
equation used for the CMT protocol, adding an additional Verifiable computation was formalized in [24], [41], but
variable that corresponds to the positionof a sub-circuit in the research on constructing interactive protocols for verifying
larger circuit. In this way, we can efficiently handle varying general-purpose computations began much earlier with the
wiring patterns across different sub-circuits. This improvement works of Kilian [33] and Micali [38]. While those works have
is crucial for improving the concrete efficiency of our VC good asymptotic performance, and follow-up works further
system for RAM programs (since each program step may optimized those approaches (e.g., [6], [32]), subsequent imple-
perform a different instruction, resulting in entirely different mentations revealed that the concrete costs of those approaches
sub-circuits), and may be of independent interest since it are prohibitively high for the prover [44].
expands the type of computations that are efficiently supported SNARKs. The next big breakthrough in general-purpose
by the CMT protocol. verifiable computation came with the work of Gennaro et
Improving the VPD Protocol. We also present a more al. [25] (building upon earlier work by Groth [30]), which
efficient version of the VPD protocol used by vSQL [52]. introduced quadratic arithmetic programs (QAPs) and showed
We do this by augmenting the selectively secure scheme of that they can be used to capture the correct evaluation of
Papamanthou et al. [39] to make it both adaptively secure an arithmetic program. QAPs have since been the de-facto
and extractable (see Section IV-B), by including additional tool for constructing efficient succinct arguments of knowledge
terms in the proof. As compared to the VPD scheme used in (SNARKs) [12], [14] that can be used to verify arbitrary NP
vSQL, this reduces the prover time for multilinear polynomials computations. This has led to a long line of research providing
(i.e., multivariate polynomials of degree 1 in each variable) both highly-optimized systems [40], [20], [47], [42], [9], [43],
from quasi-linear to linear in the number of monomials, and [35], [18], [23] and significant protocol refinements [36],
improves concrete efficiency by 2–4×. [10], [21], [29]. Our solution shares intuition with some of
New RAM Reduction. Previous RAM-to-circuit reductions these works, e.g., [20] also uses the technique of verifying
rely on a circuit CRAM that can handle any possible instruction heterogeneous sub-computations, whereas [47] produces an
at any given CPU step. The circuit CRAM is composed of T arithmetic circuit adapted for the given computation. Even
copies of a smaller circuit Cexe that can verify all possible though all of these works use different technical approaches,
instructions (where the i-th copy of Cexe verifies the i-th one theme remains common: while the verifier’s performance
RAM step, and T is a bound on the total number of steps). is generally excellent, the concrete overhead for the prover (in
As mentioned above, this approach “wastes resources” as terms of running time, memory consumption, etc.) remains
eventually only one instruction will be executed at each step. prohibitive. We refer to [51] for a detailed survey.
In existing constructions that handle arbitrary programs this Verifiable RAM Computation. A series of works [9], [10],
waste is unavoidable since CRAM must be fixed during the [11], [17], [50], [7] consider the problem of verifying RAM
preprocessing phase, before it is known which instruction will computations by reducing the verification of a RAM program
occur at each step. However, since our argument system has to the verification of a circuit. In Section V, we compare the
circuit-independent preprocessing, we can generate the circuit performance of our system with the most efficient prior work
CRAM during the proving phase, after the prover executes in this direction [11], [50].
program P on input x. This allows us to replace CRAM with a
II. P RELIMINARIES
circuit CP which is constructed on-the-fly by the prover and is
optimized for the execution of P on x. In particular, we “cus- Throughout this paper we use standard notation for arith-
tomize” the i-th copy of Cexe to only contain the gates needed metic circuits and multilinear extensions of polynomials (see
to verify the specific instruction executed during the i-th step Appendix A). To simplify notation, we implicitly assume
of P on input x. While this significantly reduces the size of that all field operations take constant time. Thus, whenever
the produced circuit, it raises a subtle issue: CP no longer has we report asymptotic complexities we omit a factor that is
a succinct representation and, in the worst case, can only be polylogarithmic in the field/group size.
described by giving the sequence of T instructions. Applying Bilinear Pairings and Cryptographic Assumptions. We
an argument system with circuit-independent preprocessing to denote the generation of the bilinear map parameters by
such a circuit results in having the verifier’s overhead be Ω(T ) bp = (p, G, GT , e, g) ← BilGen(1λ ), where λ is the security
(since he must, at the very least, hold a description of the parameter, G, GT are two groups of order p (with p a λ -bit
circuit) which is as large as evaluating P. In Section III, we prime), g ∈ G is a generator, and e : G × G → GT is a bilinear
show how this can be avoided via a new reduction in which map. The security of our constructions relies on the q-strong
bilinear Diffie-Hellman assumption [15] and a modified ver- For each layer i of C, define the function Vi : {0, 1}si → F that
sion of the q-power knowledge of exponent assumption [30], takes as input a gate g ∈ {0, 1}si and outputs its value. Note
[52] (presented formally in Appendix B). that the values returned by Vd correspond to the values of the
A. Argument Systems and Interactive proofs input layer of C, i.e., x. Finally, for each layer i we define
functions addi , multi that we call C’s wiring predicates. The
Argument Systems. An argument system for an NP rela- function addi : {0, 1}si−1 +2si → {0, 1} takes as input a gate
tion R is a protocol that allows a computationally bounded g1 from layer i − 1 and two gates g2 , g3 from layer i and
prover P to convince a verifier V holding input x that outputs 1 iff g1 is an addition gate whose input wires are
“∃w such that (x; w) ∈ R.” Here, we focus on arguments of connected to g2 and g3 . The function multi is defined similarly
knowledge, i.e., if the prover convinces the verifier then it for multiplication gates. Notice that the value of a gate g at
must know w. We adopt the definition of [25] which includes layer i < d can be computed as a function of the values of the
a parameter-generation phase executed by a trusted party. gates at layer i + 1, i.e., Vi (g) = ∑u,v∈{0,1}si+1 (addi+1 (g, u, v) ·
Definition 1. Let R be an NP relation. A tuple of algorithms (Vi+1 (u) +Vi+1 (v)) + multi+1 (g, u, v) · (Vi+1 (u) ·Vi+1 (v))).
(G, P, V) is an argument for R if the following holds. Protocol Details. One way for Vcmt to check correctness
• Completeness. For every (x; w) ∈ R and (pk, vk) output by of the values at layer i is to check that Vi (g) outputs the
G(1λ ) it holds that hP(pk, w), V(vk)i(x) = 1. correct value of the g-th gate for every gate g in that layer.
• Knowledge Soundness. For any PPT prover P ∗ there exists Since Vi (·) is a summation of other values, this can be done
a PPT extractor E which runs on the same randomness as using the sum-check protocol from Appendix C. However, the
P ∗ such that for any x we have Pr[(pk, vk) ← G(1λ ); w ← soundness guarantee of the sum-check protocol depends on
E(pk, x) : hP ∗ (pk), V(vk)i(x) = 1 ∧ (x, w) ∈
/ R] ≤ neg(λ ). the size of the underlying field. If C is defined over a small
We say that (G, P, V) is a succinct argument system if the field (e.g., if C is a boolean circuit) we replace Vi with its
running time of V is poly(λ , |x|, log |w|). multilinear extension Vei defined over a larger field F via
Interactive Proofs. An interactive proof [27] is a protocol that Ṽi (z) = ∑ fi,z (g, u, v) (1)
allows a prover P to convince a verifier V that f (x) = y where g∈{0,1}si , u,v∈{0,1}si+1

f , x, y are known to both parties. Here soundness is required def
= ˜ i+1 (g, u, v) ·
β̃i (z, g) · add
even for an unbounded cheating prover.

g∈{0,1}si , u,v∈{0,1}si+1

Definition 2. Let λ be a statistical soundness parameter. A ˜ i+1 (g, u, v) · (Ṽi+1 (u) · Ṽi+1 (v))
(Ṽi+1 (u) + Ṽi+1 (v)) + mult
pair of algorithms (P, V) is an interactive proof for a function
f with soundness ε(λ ) if: where add˜ i (resp., mult˜ i ) is the multilinear extension of addi
• Completeness. For any f , x, y such that f (x) = y it holds (resp., multi ) and β̃i is the multilinear extension of the function
that Pr[hP, Vi( f , x, y) = 1] = 1. that takes si -bit inputs z, g and outputs 1 iff a = b.1
• Soundness. For any f , x, y such that f (x) 6= y and any prover Assume for simplicity that C has a single output wire. The
P ∗ it holds that Pr[hP ∗ , Vi( f , x, y) = 1] ≤ ε(λ ). CMT protocol begins with Pcmt claiming y = Ve0 (0). Then Pcmt
and Vcmt execute the sum-check protocol, which results in
B. The CMT Protocol
Vcmt ’s needing to check that Ve0 (0) = ∑ g∈{0,1}s0 f0,0 (g, u, v). In
High-Level Overview. Cormode et al. [19] presented an u,v∈{0,1}s1

efficient interactive proof (the CMT protocol) for arithmetic turn, this requires the verifier to evaluate Ve1 on two random
circuits. At a high level, the protocol proceeds as follows. Let points q1 , q2 ∈ Fs1 . Since the verifier does not have the correct
C be a depth-d layered arithmetic circuit over a field F. The gate values for layer 1, it asks Pcmt to provide a1 = Ve1 (q1 )
protocol starts by having the CMT prover Pcmt claim that the and a2 = Ve1 (q2 ). We have thus reduced the claim about the
output wires have value y. Next, the CMT protocol processes value of the gate in layer 0 to the validity of two claims about
C one layer at a time, from layer 0 (the output gates) to layer d the gates in layer 1. Finally, in Appendix D, we describe the
(the input gates). During the ith round, Pcmt reduces a claim way to condense these two claims into a single claim about
about the values of C’s wires at layer i to a claim about the the gates in layer 1. Proceeding in this way layer by layer, the
values of C’s wires in layer i + 1. The protocol terminates prover and verifier end with a claim about the value of Ved ,
after d rounds with a claim about the wire values at the input which can be checked directly by the verifier who has access
layer. Since the input x is known to the CMT verifier Vcmt , it to the input x. In Appendix D we give a full description of the
can directly check Pcmt ’s claim. If the check succeeds, Vcmt CMT protocol, and formally state its security and asymptotic
accepts y as the output of C(x). performance guarantees.
Notation. Before presenting a formal description of the CMT C. A Canonical RAM Architecture
protocol, we establish some additional notation. We denote
In this section we establish notation for a random-access
the number of gates in the ith layer of C by Si and we set
machine supporting some instruction-set architecture.
si = dlog Si e (so si bits suffice to identify each gate is the ith
layer). The evaluation of C on an input x assigns (in the natural 1 Although using β̃ is not strictly necessary [46], we use it since it improves
way) a value from F to each gate in C based on its output wire. efficiency when C is composed of many parallel copies of a smaller circuit C0 .
S and S∗∗ t, pc, r1 , . . . , rK , O, flag, auxiliary2
I and I ∗∗ line number, opcode, i, j (source registers), k (target register)
valid for a program P on input x if there is an aux such that
A a, t, O, b (denoting memory load or store) P(x, aux) has trace tr. Similarly, a trace tr of a program P on
TABLE I input x is accepting if there exists aux such that tr is valid and
VALUES IN A STATE AND AN INSTRUCTION .
we say that P accepts input (x, aux).
Hardware. We focus on RAM machine computations, where A Universal NP Relation for RAM Programs. The follow-
the machine is parametrized by the number of registers K ing NP relation RAM`,n,T captures accepting RAM programs:
and the register width (word size) W . The CPU state consists Definition 3. For `, n, T ∈ N, relation RAM`,n,T consists of
of a W -bit program counter (pc) and K general-purpose, W - tuples (P, x; aux) such that: (i) P is a program with ≤ `
bit registers r1 , . . . , rK . Each instruction operates over two instructions, (ii) x is an input of ≤ n words, and (iii) P(x, aux)
operands (registers) and stores its result in a third register, to accepts in ≤ T steps.
which we shall refer as the destination register. The machine’s
memory is a randomly accessible array of 2W bytes. We also D. Previous Reductions from RAM to Circuit Satisfiability
assume two read-only unidirectional tapes containing W -bit Before describing our improvements, in this section we present
words. The first tape is used for the program input x, and the previous approaches for constructing a circuit that can verify
second tape may potentially be used for auxiliary input aux. the execution of RAM programs. More specifically, given a
Program Execution. A program is a sequence of instructions, time bound T , [11] constructs a circuit C such that for any
where each instruction has two operands (which are either RAM program P, ∃w : C(P, x; w) = 1 if and only if ∃aux such
register numbers or constants) and stores its result in a third that P(x; aux) accepts. Throughout this paper, unless otherwise
register called the destination register. A random-access ma- noted, we do not distinguish between the program and the
chine starts executing a program with all registers, its memory, input data, and we let ` be a bound on both the program
and the program counter initialized to 0. At each step, the length and the input size.
instruction pointed by the pc is executed. By default, every The circuit C takes as input a program P and a witness w
instruction increments the pc by one (i.e., pointing to the next that contains a trace tr = (S1 , I1 , S2 , I2 · · · , IT −1 , ST ) and aux. C
instruction), but an instruction (e.g., jump) can also modify the then outputs 1 only if S1 is the initial state, ST is an accepting
pc directly to facilitate arbitrary control flow. The machine’s state, and the following hold at every step i in tr:
inputs are the above-mentioned tapes, accessible via special 1) Correct Instruction Execution. State Si+1 is obtained
read instructions, as well as the initial contents of its memory. from Si after executing instruction Ii .
The machine outputs either accept or reject. We say program 2) Correct Instruction Fetches. Ii is the instruction in P
P accepts input (x, aux) if the machine running program P pointed to by the program counter (pc) in Si . If i = 1 we
with the specified input terminates with output accept. require that pc = 0.
Machine State and Instruction Encoding. We define the 3) Correct Memory Accesses. If Ii is a load instruction
notion of machine state as the values of the machine’s registers accessing address a then the value loaded is v, where v
pc, r1 , · · · , rK at any point during the program execution. Let is the last value written to address a by some previous
S1 , · · · , ST be a list of the machines states during the execution instruction (and v = 0 if Ii is the first load from a.)
of some program P. We augment each state Si to also include In order to verify the above three conditions, the circuit C is
i in it, referring to i as Si ’s step number as well as to include constructed from three sub-circuits Cexe , Cmem , and Croute (cf.
an additional field Oi , referring to it as the instruction’s output Figure 1(left)), which we explain below.
field. An instruction I contains information about what opera- Ensuring Correct Instruction Execution. To ensure (1),
tion the machine should execute (e.g., addition, multiplication, every triple Si , Ii , Si+1 is given as input to a circuit Cexe which
etc.), the two source registers ri , r j as well as the target register performs the following two checks. (a) Check that the value
rk . For a specific program (which is a sequence of instructions) Oi in Si is correctly computed by executing Ii .3 In case Ii is a
P = P1 , · · · , P` , we augment every instruction Pi to include its memory load instruction, Cexe optimistically assumes that the
location i (line number) within P. The detailed values in a loaded value Oi is correct (this will be tested separately when
state and an instruction used in our implementation is shown in checking memory accesses). (b) Check that Oi is equal to r j
Table I. We take as our set of available instructions from those of Si+1 (or pci+1 in case of jump), all other registers of Si+1
used by TinyRAM [9], [11]. This is an ideal starting point for are the same as Si , Si ’s step number is indeed i and pci is
our implementation as the universal circuit for the TinyRAM equal to the line number of Ii in P (as encoded in Ii ).
CPU can be described by a relatively small arithmetic circuit. Ensuring Correct Instruction Fetches. To ensure (2), C
Execution Traces. The trace tr = (S1 , I1 , S2 , I2 , . . . , IT −1 , ST ) must check that the instruction Ii is fetched from the location
of a program P on inputs x, aux is a sequence of CPU states in the program P pointed by pci in state Si (i.e., that Ii
and instructions, where S1 is the initial state and each Si is is the pci -th instruction in P). In [11], this is achieved by
produced by executing instruction Ii−1 on Si−1 . A trace tr is storing P in memory and then loading instructions before
they are executed. Formally, a booting sequence B1 , . . . , B` is
2 Auxiliary includes data from the prover for efficient implementation
purposes, i.e., bit-decomposition of the values for computation modulo 232 3 If I is a memory instruction, O is the loaded or stored value; if I is a
i i i
and bits denoting whether an instruction is jump, memory store or load. jump instruction, Oi is the jump destination.
prepended to the trace tr, with Bi storing the i-th instruction the construction of [11] (described in Section II-D) which
of P in memory at address i. This results in a new trace reduces the verification of RAM programs of T steps to the
tr = (B1 , · · · , B` , S1 , I1 , S2 , I2 · · · , IT −1 , ST ) of length 2T + `. task of verifying the correct evaluation of an arithmetic circuit
Each Ii ∈ tr is then viewed as two operations: One is a load fo size O(T log T ). However, unlike [11], in this work we
operation fetching an instruction from the memory address rely on an efficient argument for arithmetic circuits (explained
pointed by its line number, and the other is Ii itself. In this way, in detail in Section IV) which is both interactive and has a
the correctness of instruction fetches is reduced to checking circuit-independent preprocessing phase. As we show in this
the consistency of the memory stores and loads performed by section, it is possible to leverage these two properties in order
Bs and Is, which we describe next. to achieve a “tighter” reduction than the reduction of [11],
Ensuring Correct Memory Accesses. To ensure (3), Ben- resulting in a more efficient argument for RAM programs.
Sasson et al. [11] include in w an additional trace tr∗ = More specifically, having a circuit-independent preprocessing
(A1 , · · · , A2T +` ) , which is a permuted version of tr where: (a) phase allows us to produce a concretely smaller circuit where
all the states in which a memory access is performed are sorted at each step the prover only proves the correct execution
by the memory address a being accessed (with ties broken by of the instruction that is actually executed by the RAM
their step number in tr), and (2) non-memory instructions are program on its specific inputs, as opposed to proving the
pushed to the end of tr. Notice that Bi and Ii are also sorted, correctness of a circuit evaluating all possible instructions.
using the addresses i and the line number respectively. For Next, the interactivity property allows us to replace the routing
two adjacent entries Ai , Ai+1 ∈ tr∗ with outputs Oi , Oi+1 , step network used in [11] for checking trace consistency with
numbers ti ,ti+1 and accessing addresses ai , ai+1 , respectively, an efficient interactive protocol for randomized polynomial
the circuit Cmem checks the following:4 identity testing. This reduces the prover’s complexity from
• If ai = ai+1 then ti < ti+1 . If Ai+1 is a load instruction, the O((T + `) log(T + `)) to O(T + `) as well as improves the
loaded value Oi+1 is the same as the value Oi stored or prover’s concrete efficiency.
loaded by Ai . Our final circuit construction is shown in Figure 1(right). As
• If ai 6= ai+1 then ai+1 > ai , and if Ai+1 is a load instruction in Section II-D, we must check correctness of (1) instruction
then Oi+1 = 0. execution, (2) instruction fetches, and (3) memory accesses.
Checking Consistency Between tr and tr∗ . Finally, C must Next we describe our implementation of these checks.
ensure that tr∗ is a copy of tr that contains exactly the same A. Ensuring Correct Instruction Execution
states and instructions, just sorted by their accessed addresses.
Note that the fact that tr∗ is sorted correctly has already been Let tr = (S1 , I1 , S2 , I2 · · · , IT −1 , ST ). Recall that in the reduc-
checked by Cmem . Hence, it remains to ensure that a state tion described in Section II-D, the correct execution of tr’s
appears in tr∗ if and only if it appears in tr. This can be instructions is checked via a universal Cexe which performs
done by checking that there exists a permutation π such that two sets of tests on every triple Si , Ii , Si+1 ∈ tr. The first
π(tr∗ ) = tr. To that end, C contains a sub-circuit Croute which test (a) checks the correctness of Oi (i.e., that performing
implements a O(T log T ) switching network that routes every Ii on Si results in Oi ) while the second test (b) checks
entry in tr to its matching entry in tr∗ . The control bits used that the values from Si are consistently propagated to Si+1
for the switching network (which specifies the permutation π) (including correct pci update and ordering of steps). Notice
are provided by the prover and included in w. that while the second test is relatively simple and identical for
all triples, the majority of Cexe ’s gates are actually required
Overall Complexity. For a program of size ` running for
for performing the first test. This is since this part of Cexe is
T steps, the above reduction yields a circuit C of size T ·
often implemented by a composition of smaller circuits each
|Cexe | + (2T + `) · |Cmem | + |Croute |. Since Cexe ,Cmem are fixed
of which can check the execution of a specific instruction,
for a given architecture (i.e., they are independent of T, `), and
together with a multiplexer that specifies which instruction
Croute can be implemented using O((T + `) · log(T + `)) gates,
should be checked at this step. In order to optimize the size of
we have |C| = O((T + `) · log(T + `)).
Cexe , while maintaining the succinct representation of the result
III. O UR I NTERACTIVE A RGUMENT FOR RAM PROGRAMS circuit C, we split Cexe into two sub-circuits which perform
In this section, we present our argument for verifying the these two checks independently. For the second check we will
correct execution of RAM programs. Similar to previous the same circuit for all triples, whereas for the first one we we
approaches [9], [10], [11], [17], [50], [7], our argument will use a circuit that can only verify the logic of the particular
for RAM programs will use as a “back-end” an argument instruction Ii . Below, we describe in detail how these circuits
for verifying the correct evaluation of arithmetic circuits. are implemented.
We thus must somehow reduce the task of verifying RAM Ensuring Correct Propagation of Values. We define a
computation to the task of verifying the correct evaluation circuit Ctime that takes as input a triple Si , Ii , Si+1 , and verifies
of arithmetic circuits. One candidate for such a reduction is that the value of the destination register in Si+1 is equal to
4 In case A corresponds to B or I , the value O loaded is the encoding of
Oi , all other registers in Si+1 remain unchanged, and pci+1
i j j i
the instruction, i.e. the concatenation of the machine operation code and the was updated appropriately. Similar to Section II-D, Ctime also
source and destination registers. checks that Si ’s step number is indeed i and that pci is equal to
tr tr ⇤ tr ⇤⇤ tr tr ⇤
B1 A1 S1 A1
Cmem
Cmem S1⇤⇤ Ctime I1 A2

Cperm
···
A2 Cval,1 I1⇤⇤ S2

···
···
···
B` Sk⇤⇤1
Cval,1
Ik⇤⇤1

Croute
S1

···

Cperm

···
···
···
Cexe AT

···
···
I1 1
Cmem
S2 AT
Sk⇤⇤J
···
1 +1

···
Cval,J Ik⇤⇤J 1 +1

Cf etch
···
ST 1 A2T +` ST 1
1
Cval,J ST⇤⇤
Cexe IT 1
A2T +`
Cmem IT⇤⇤ Ctime IT 1
ST ST

Fig. 1. Circuits for the reductions from RAM programs to circuits from Section II-D (left) and Section III (right). Circuits C f etch and C perm receive additional
input from the verifier as described in Sections III-B and III-D, respectively.
−1
the alleged location of Ii in P (as encoded in Ii by the prover). at point r, i.e., ∏Ti=1 (Ii − r). The verifier also receives from
However, unlike Section II-D, we stress that Ctime does not the prover the multiplicity k j of Pj in {P1 , · · · , P` }. Thus, he
verify that Oi is the correct output after executing Ii . def T −1
can compute himself the value ∏`j=1 (Pj − r)k j = ∏i=1 (Ii − r)
Verifying Instruction Execution. Let J be the number and test whether it corresponds to the value output by the
of instruction types supported by the RAM architecture. We circuit. By the Schwartz-Zippel lemma, the probability the
include in the witness w an additional trace tr∗∗ that is the verifier accepts if the two polynomials are not the same (i.e.,
result of sorting the pairs (Si , Ii ) ∈ tr by the instruction type {I1 , · · · , IT −1 } is not a multiset of {P1 , · · · , P` }) is negligible.
of Ii . Define a circuit Cval, j which takes as input a pair We stress that this is only secure if we ensure that the prover
(Si∗∗ , Ii∗∗ ) ∈ tr∗∗ and checks that Si∗∗ is a valid state for the commits to the entire witness (including I1 , · · · , IT −1 ) before
instruction Ii∗∗ of type j (i.e., Oi is correctly computed by seeing r, as is the case in our construction in Section IV. In
executing Ii∗∗ on Si∗∗ ). In this way, Cval, j is specialized to a this way, we have replaced T +` copies of Cmem with a smaller
specific instruction type. Moreover, since tr∗∗ is sorted by circuit C f etch evaluating the characteristic polynomial at a
instruction type, the copies of Cval, j will also appear in C random value which leads to concrete efficiency improvement.
sorted by j. In this way, C can be succinctly described by
C. Ensuring Memory Accesses
(k1 , . . . , kJ ), where k j (for j = 1, . . . , J) denotes the number of
times instruction type j appears in trace tr when program P Similar to Section II-D, in order to verify memory accesses
is executed on input x (where ∑ j k j = T ). (ensuring (3)) we include in w a trace tr∗ = (A1 , · · · , AT )
sorted by the memory address being accessed (again with
B. Verifying Instruction Fetches
ties broken by step number and non-memory instructions
As described above, [11] ensures program consistency by located at the end of tr∗ ). Since the correctness of instruction
first storing the program to memory during the machine’s fetches is already ensured (as described above), we only sort
booting phase. Next, each instruction is sequentially loaded the states Si in tr, and the length of tr∗ now becomes T .
from memory for execution. These operations are treated the For every two adjacent entries Ai , Ai+1 ∈ tr∗ with outputs
same as regular memory stores and loads, and are checked by Oi , Oi+1 , step numbers ti ,ti+1 and accessing addresses ai , ai+1 ,
T + ` copies of Cmem . Here, we explain how the correctness respectively, the circuit Cmem checks the same two conditions
of these operations can be checked more efficiently assuming as in Section II-D. Finally, note that the number of instruction
instructions in the program are fixed and known to the verifier that actually perform memory operations may be smaller than
(i.e., if we assume that P does not contain self-modifying code, T , but we still include T copies of Cmem in C to account for the
similar to [9]). worst case. In Appendix E, we show how this can be further
Unlike the reduction of Section II-D, note that the trace tr improved to only include αT copies of Cmem , where 0 ≤ α ≤ 1
does not include a boot sequence. Instead, we observe that for is the ratio of memory operations in the trace.
each triple Si , Ii , Si+1 , the circuit Ctime already checks that pci
D. Checking Consistency Between tr, tr∗ and tr∗∗
is equal to the line number of Ii in P (as encoded in Ii by the
prover). All that remains is to verify that Ii is the instruction in Finally, it remains to check that tr∗ and tr∗∗ are indeed
P with the same line number. Equivalently, let {P1 , · · · , P` } be permutations of tr. Previous works [8], [9], [11] achieve this
the set of instructions in P where each Pi is augmented to also task by using routing networks, yielding a circuit of size
contain its line number within P (as defined in Section II-D). O((T +`) log(T +`), for a T -step RAM program of size `, and
Then we only need to check that the sequence {I1 , · · · , IT −1 } correspondingly increasing the prover’s asymptotic running
is a multiset of {P1 , · · · , P` } (the multiplicity of some Pi may time from linear to quasilinear. Using routing networks to
be 0 to account for non-executed instructions). To that end, achieve this would yield a circuit of size O((T + `) log(T + `),
we add a circuit C f etch that validates this multiset relation for a T -step RAM program of size `, which would correspond-
and leverages the interactive property or our scheme from ingly increase the prover’s asymptotic running time from linear
Section IV. The circuit takes the sequence I1 , · · · , IT −1 from to quasilinear. Following the approach of [52], we leverage the
tr and a random value r (provided by the verifier) as input. interactive nature of our argument in order to avoid the use
C f etch outputs the evaluation of its characteristic polynomial of routing networks, replacing them with a simple interactive
protocol that is similar to the one used above for verifying circuit C described in Section III. Recall that C contains
instruction fetches. The result is that our prover’s running T copies of Cmem and Ctime , and k j copies of Cval, j where
time is only O(T + `), i.e., asymptotically the same as simply ∑ j k j = T . Applying the CMT protocol described in Sec-
evaluating the program. tion II-B and Appendix D (Theorem 5) to C would thus result
More specifically, assume the prover holds lists x1 , . . . , xm in a prover complexity of O(|C| log |C|). In this section, we
and x10 , . . . , xm
0 and wants to convince the verifier that they are show how to modify the CMT protocol to efficiently handle
a permutation of each other. Consider a circuit C perm that takes circuits that consist of multiple (different) sub-circuits. When
x1 , . . . , xm and x10 , . . . , xm
0 (provided by the prover) and a random applied to our circuit C, this results in a prover time of
point r (provided by the verifier) and outputs the result of O(|C| log max{|Cmem |, |Ctime |, |Cvar |}). As the sizes of Cmem ,
0
∏m m
i=1 (xi − r) − ∏i=1 (xi − r). If the two lists are permutations Ctime , and Cvar are constants which only depend on the specific
of each other the output is always zero, otherwise by the RAM architecture, we obtain an asymptotically optimal prover
Schwartz-Zippel lemma it is zero with negligible probabil- running time of O(|C|) which is O(T + `).
ity.5 Finally, evaluating this polynomial requires O(m) gates. Let C be a depth-d, size-n, layered arithmetic circuit
For our argument, we use two executions of this interactive consisting of B independent (“parallel”) sub-circuits
protocol, one for the pair tr,tr∗ and one for tr,tr∗∗ , in a way C1 , · · · ,CB , each of depth at most d 0 and size at most n0 ,
that ensures that C outputs zero only if C perm outputs zero where the outputs of C1 , · · · ,Cn are fed into an aggregation
both times. From the above analysis, each of these circuits circuit D of depth-d 00 and size n00 . In this section, we show
consists of O(T + `) gates. We stress that it is crucial to have how to modify the CMT protocol so as to prove statements
the prover commit to the two lists ahead of time, in particular about the output of C in time which is linear in the size of
before seeing r, for security purposes. This is enforced by our C. Our modified protocol proceeds as follows. We start by
argument as P commits to the entire witness w in the first step following the standard CMT protocol for the d 00 layers of
of the protocol (cf. Construction 2, Evaluation Phase, Step 1). sub-circuit D. Next, for the remaining d − d 00 = d 0 layers,
We are now ready to state the following result. We defer a we modify things in a similar way to [45] and [46]. Let
proof to the full version due to space limitations. Si now denote the maximum number of gates in layer i
Theorem 1. Let ` be a program length parameter, T be across C1 , · · · ,CB , and let si = dlog Si e. We let Vi again
a time bound and let n be an input bound. Assuming that be a function mapping a gate at level i to its value, but
Construction 1 is an extractable verifiable polynomial del- we now specify a gate g by a pair g1 , g2 , where g2 ∈ [B]
egation protocol, then combining the results of Section III indicates the sub-circuit in which g lies and g1 ∈ [Si ] is the
with Construction 2 we obtain an argument system for the index of g (at level i) within that sub-circuit. The prover
relation RAM`,n,T (as per Definition 3). Moreover, as the sizes and verifier then run a CMT-like protocol, but using the
of Ctime ,Cval and Cmem are constants which are independent equation Vi (g1 , g2 ) = ∑u1 ,v1 ∈{0,1}si+1 (addi+1 (g1 , u1 , v1 , g2 ) ·
of n, T, `, the running time of P is O(n + T + `) and that of V (Vi+1 (u1 , g2 ) + Vi+1 (v1 , g2 )) + multi+1 (g1 , u1 , v1 , g2 ) ·
is O(n + ` + poylog(T )). This yields a succinct argument with (Vi+1 (u1 , g2 ) ·Vi+1 (v1 , g2 ))).
polylog (n + ` + T ) rounds of interaction. The equation above still recursively defines Vi in terms
of Vi+1 , but takes advantage of the fact that there is no
IV. A N I MPROVED A RGUMENT FOR A RITHMETIC interconnection between the different sub-circuits. This has
C IRCUITS the effect of reducing the number of variables in addi+1 and
multi+1 from 2si+1 +si +3dlog Be to 2si+1 +si +dlog Be. Next,
In this section, we present our modifications to the (implicit) we define the multilinear extension of Vi (g1 , g2 ).
argument of vSQL [52]. First, we introduce a modified version
of the CMT protocol that can efficiently handle circuits Ṽi (z1 , z2 ) = fi,z1 ,z2 (u1 , v1 , g2 ) (2)

consisting of parallel copies of different sub-circuits (which u1 ,v1 ∈{0,1}si+1 ,g2 ∈{0,1}logdBe
is the format of our circuit, from Section III above). We then def

˜ i+1 (z1 , u1 , v1 , g2 ) · (Ṽi+1 (u1 , g2 )
= ∑ β̃i (z2 , g2 ) · add
present a VPD scheme with improved efficiency and show u1 ,v1 ∈{0,1}si+1 ,g2 ∈{0,1}dlog Be
that combining the two yields an argument of knowledge with 
˜ i+1 (z1 , u1 , v1 , g2 ) · (Ṽi+1 (u1 , g2 ) · Ṽi+1 (v1 , g2 )) .
+ Ṽi+1 (v1 , g2 )) + mult
circuit-independent preprocessing.
A. Improving The Expressibility of the CMT Protocol
The only difference between equation 2 and the equation
Following [52], we can verify the execution of a RAM pro-
used for data-parallel circuits with same sub-circuits in [45],
gram by applying the CMT protocol to the RAM-verification ˜ i+1 and mult
˜ i+1 take an extra variable g2 ,
[46] is that add
5 As a state (e.g., A in tr ∗ ) contains multiple values such as O and t and
which denotes that the gates and wiring patterns can be
we want to ensure they are permuted together, we pack the values before the different in each sub-circuit. We further observe that running
check (e.g., for W -bit values (a, b, c), we set x = a × 22W + b × 2W + c). If the same algorithm for the sumcheck protocol as in [45], [46]
the result of a single pack overflows the field, we pack the values multiple on equation 2 results in the same complexity on the prover,
times with respect to the first value. In our implementation, we use a 254-bit
prime field, which allows packing of 7 32-bit numbers. We also use the same which is O(BSi log Si+1 ). In this way, we extend the class of
technique to ensure that Si∗∗ and Ii∗∗ in tr∗∗ are permuted together. the circuit efficiently supported by the CMT protocol in [45],
Definition 4. Let F be a finite field, F a family of `-variate polynomials over F, and d a variable-degree parameter.
(KeyGen, Commit, Evaluate, Ver) constitute an extractable VPD scheme for F if:
• Perfect
h Completeness. For any polynomial f ∈ F it holds that i
Pr (pp, vp) ← KeyGen(1λ , `, d); com ← Commit( f , pp); (y, π) ← Evaluate( f ,t, pp) : Ver(com,t, y, π, vp) = acc ∧ y = f (t) = 1.
• Soundness.
h For any PPT adversary A the following probability is negligible: i
Pr (pp, vp) ← KeyGen(1λ , `, d); ( f ∗ ,t ∗ , y∗ , π ∗ ) ← A(1λ , pp); com ← Commit( f ∗ , pp) : Ver(com,t ∗ , y∗ , π ∗ , vp) = acc ∧ y∗ 6= f ∗ (t ∗ ) .
• Extractability. For any PPT adversary A there exists a polynomial-time algorithm E with access to A0 s random tape such that for
allhbenign auxiliary inputs z ∈ {0, 1} poly(λ ) the following probability is negligible: i
Pr (pp, vp) ← KeyGen(1λ , `, d); com∗ ← A(1λ , pp, z); f 0 ← E(1λ , pp, z) : CheckCom(com∗ , vp) = acc ∧ com∗ 6= Commit( f 0 , pp) .

Construction 1 (Verifiable Polynomial Delegation). Let F be a prime-order field, and `, d variable and degree parameters such that
O( `(d+1)

`d ) is poly(λ ). Consider the following protocol for the family F of `-variate polynomials of variable-degree d over F.
1) KeyGen(1λ , `, d): Select uniform α, s1 , . . . , s` ∈ F, run bp ← BilGen(1λ ) and compute P = {g∏i∈W si , gα·∏i∈W si }W ∈W`,d . The public
parameters are pp = (bp, P, gα ), and the verifier parameters are vp = (bp, gs1 , · · · , gs` , gα ). For every f ∈ F we denote by pp f ⊆ pp
the minimal subset of the public parameters pp required to invoke Commit and Evaluate on f .
2) Commit( f , pp f ): If f 6∈ F output null. Else, compute c1 = g f (si ,...,s` ) and c2 = gα· f (si ,...,s` ) , and output the commitment com = (c1 , c2 ).
3) CheckCom(com, vp): Check whether com is well-formed, i.e., output accept if e(c1 , gα ) = e(c2 , g) and reject otherwise.
4) Evaluate( f ,t, pp f ): On input t = (t1 , . . . ,t` ), compute y = f (t). Next, using Lemma 1 compute the polynomials qi (xi , . . . , x` ) for
i = 1, . . . , `, such that f (x1 , . . . , x` )− f (t1 , . . . ,t` ) = ∑`i=1 (xi − ti )·qi (xi , . . . , x` ). Output y and the proof π := {gqi (s1 ,...,s` ) , gαqi (s1 ,...,s` ) }`i=1 .
?
5) Ver(com, y,t, π, vp): Parse the proof π as (π1 , π10 . . . , π` , π`0 ). If e(c1 /gy , g) = ∏`i=1 e(gsi −ti , πi ) and e(c1 , gα ) = e(c2 , g) and e(πi , gα ) =
e(πi0 , g) for 1 ≤ i ≤ ` output accept otherwise output reject.

[46] without any overhead on the prover time.6 We analyze scheme of Papamanthou et al. [39]. Unfortunately, selective
the complexity of the sum-check protocol from equation 2 in security means that the parameters used for the VPD protocol
Appendix F. We present the following result. are computed as a function of the specific point rd on which
the VPD will be executed. This is insufficient for our applica-
Theorem 2. Let C : Fn → F be a depth-d layered arith-
tion since VPD’s parameters will be generated once during the
metic circuit consisting of B parallel sub-circuits C1 , . . . ,CB
preprocessing phase which happens before the CMT protocol
connected to an “aggregation” circuit D such that |D| =
To overcome this limitation, we modify this scheme to require
O(|C|/ log |C|), and let S = max j {width(C j )}. Executing
the prover to provide additional “extractability” terms as part
the CMT protocol from Construction 3 using Equation 2
of the evaluation proof. Our modified VPD scheme is given as
and the above described modifications to the sum-check
Construction 1. We define the variable degree of a multivariate
protocol, yields an interactive proof for C with sound-
polynomial f be the maximum degree of f in any of its
ness O(d · width(C)/|F|). Moreover, P’s running time is
variables, and use W`,d to denote the collection of all multisets
O(|C| log S) and the protocol uses O(d log(width(C))) rounds
˜ i and mult
˜ i are computable in time of {1, . . . , `} for which the multiplicity of any element is
of interaction. If add
at most d. We formally state the security and asymptotic
O(polylog(width(C))) for all the layers of C, then the running
performance guarantees of the scheme in Appendix G.
time of the verifier V is O(n + d · polylog(width(C))).
B. A VPD Scheme with Linear Prover Time C. Putting it All Together

In the last step of the CMT protocol, the verifier Vcmt evaluates Finally, we present our argument system with circuit-
a polynomial Ved on a random point rd . Since the number of independent preprocessing. Our construction combines the
terms in Ved is equal to the number of input gates of C, this modified CMT protocol from Section IV-A with the VPD
makes the verifier’s work linear not only in the size of the scheme presented in Section IV-B. We refer to the prover
input x but also the length of the witness w. In vSQL [52], this and verifier of the CMT protocol as (Pcmt , Vcmt ), re-
is avoided by using a VPD scheme that allows P to provide spectively, and to the algorithms of the VPD scheme as
Ved (rd ) to V together with a succinct proof of its validity. (See (KeyGen, Commit, Evaluate, Ver). We construct an argument
Definition 4 for the definition of a VPD scheme. Our definition system (G, P, V) for the satisfiability of arithmetic circuits over
adapts that of [52] by introducing an additional algorithm finite fields, where the preprocessing done by G depends on a
CheckCom that checks if a commitment is well-formed.) Here, bound on the size of the circuit, the size of its input, and the
we improve the VPD scheme of [52] and present a new scheme field over which it is defined, but not the circuit itself.
1+2
with the same verifier complexity, but with linear prover in the Let Vcmt be the restriction of the CMT verifier from
number of terms of Ved (as opposed to quasi-linear). Construction 3 which performs Steps 1 and 2 of Vcmt and
As our starting point we use the selectively secure VPD outputs (rd , ad ) without performing Step 3. Construction 2 is
a formal description of our argument system. Consider the
6 The complexity of the CMT protocol for circuits composed of identical following theorem.
sub-circuits has recently been improved to O(BSi + Si log Si ) in [49]. Gener-
alizing the technique for different sub-circuits is left as a future work. Theorem 3. If Construction 1 is am extractable VPD scheme,
Construction 2. Let F be a prime-order field with |F| exponential in λ , and let n,t be input size and circuit size parameters. For
simplicity of exposition we assume that n is a power of 2. Consider the algorithms G, P, V described below.
Preprocessing Phase. G(1λ , n,t) runs (pp, vp) ← KeyGen(1λ , n, 1). The proving key pk is set to be pp and the verification key vk is
set to be vp.
Evaluation Phase. Let C : Fnx +nw → F be a depth-d arithmetic circuit with at most t gates such that nx + nw ≤ n. Moreover, let x ∈ Fnx
and w ∈ Fnw be such that C(x; w) = 1. Assume that nw /nx = 2m − 1 for some m ∈ N. Consider the following protocol between P and V.
1) P first commits to the multilinear extension Ved of the input layer of C(x; w). That is, P runs c ← Commit(Ved , pp) and sends c to V.
Upon receving c, V runs CheckCom(c, vp). If the output is reject, V rejects.
2) V computes the multilinear extension x̃ of the input x, generates a random point r ∈ (Flog(nx ) × 0log(nw ) ) and sends r to P. P executes
(a, π) ← Evaluate(Ṽd , r, pp) and sends (a, π) to V. V executes Ver(c, a, r, π, vp). In case Ver outputs reject or a 6= x̃(r), V rejects.
1+2 1+2
3) V runs Vcmt and P runs Pcmt to verify C(x; w) = 1. If Vcmt rejects at any point, V rejects. Otherwise, let rd , ad be the final values
1+2
returned by Vcmt . At this point, V must verify that Ṽd (rd ) = ad .
4) V sends rd to P. Upon receiving rd , P executes Evaluate(Ṽd , rd , pp) and obtains (a0d , π 0 ) which he sends to V.
5) V upon receiving (a0d , π 0 ) executes Ver(c, a0d , rd , π 0 , vp). In case Ver outputs reject or a0d 6= ad , V rejects. Otherwise, V accepts.
Benchmark Input Size # of Cycles Native
then Construction 2 is an argument system for arithmetic 1: Matrix Mult. n=215 96M 42ms
circuits. When used for a depth-d, layered circuit C con- 2: Pointer Chasing n = 16634 50K 22µs
sisting of B parallel sub-circuits C1 , . . . ,CB whose outputs 3: Merge Sort n = 512 65K 28µs
feed into a circuit D with |D| ≤ |C|/ log |C| , the running 4: KMP Search n = 2900, k = 256 30K 13µs
n = 1150, k = 2300
5: Sparse Mat-Vec Mult. 27K 12µs
time of P is O(|C| · log max j {width(C j )}) and the proto- TABLE II
col has O(d log(width(C))) rounds. If C has input length B ENCHMARKS IN OUR EXPERIMENTS . W E REPORT THE INPUT SIZE , THE
NUMBER OF CPU CYCLES AND THE NATIVE RUNNING TIME ON VERIFIER
n and is log-space uniform then the running time of V is
FOR THE INSTANCES WE USED IN TABLE III.
O(n + d · poylog(|C|)). Finally, if d is polylog (|C|), the above
construction is a succinct argument. chines, i.e., their memory-access patterns and control flow
V. E XPERIMENTAL E VALUATION do not depend on the program’s inputs. This allows for a
tighter RAM-to-circuit reduction since it can be determined
Software and Hardware. We implemented our constructions ahead of time which instruction will be executed at each
(including the RAM reduction, circuit generator, CMT proto- time step. Thus, the produced circuit only needs to handle
col, and VPD protocol) in C++. We use the GMP library [3] for a specific instruction per cycle. We use pointer chasing and
field arithmetic and OpenSSL’s [5] SHA-256 implementation merge sort as examples of such programs.
for hashing. For the bilinear pairing we use the ate-paring
3) Input-dependent Memory Access and Instruction Pat-
library [1] on a 254-bit elliptic curve.
terns. Such RAM programs use the full generality of
We run our experiments on an Amazon EC2 m4.2xlarge
RAM machines since they have input-dependent control
machine having 32 GB of RAM and an Intel Xeon E5-2686v4
flow and memory-access patterns. In particular, the circuit
CPU with eight 2.3 GHz virtual cores. Our implementations
generated by the RAM reduction must be able to handle
are not parallelized and only use a single CPU core.
multiple possible instructions at every step. We use KMP
A. Comparison with vnTinyRAM and Buffet string matching [34] and CSR sparse matrix-vector multi-
In this section, we compare the performance of our system plication [26] as examples of such programs.
to existing systems for verifiable RAM. We compare to Buffet Evaluation Methods. Buffet’s front-end takes a
Buffet [50], a verifiable RAM system with program-specific RAM program and outputs a circuit that verifies its execution
prepossessing (where the parameters generated by the trusted and its back-end uses a circuit-based VC system based on
preprocessing can only be used to verify one specific program Pinocchio [40]. We evaluate Buffet using the released code [2].
on different inputs) and vnTinyRAM [11], a universal veri- vnTinyRAM Evaluation Methodology. We evaluate
fiable RAM system (where the parameters generated by the vnTinyRAM [11] using the code at [4]. As the code that takes
trusted preprocessing can be used to verify any program up a TinyRAM program and outputs the traces for vnTinyRAM
to some bound on the number of steps). We also measure the is not available, we are unable to produce vnTinyRAM traces
performance of our system against naive unverified execution corresponding to the execution of any benchmark RAM pro-
of the RAM program. Finally, in Section V-C we also discuss gram. Instead, we estimate the cost of vnTinyRAM by running
comparisons to other verifiable RAM systems. the prover on traces of appropriate length resulting from
Benchmark. As a benchmark, we evaluate the RAM pro- execution random machine instructions. Since the performance
grams from [50] (see Table II). Following that work, we of vnTinyRAM only depends on the total number of CPU
benchmark our system using programs of three types. steps and not on the instruction being executed at each step,
1) Circuit Friendly. The function computed by these pro- this estimate is accurate.7
grams has a very efficient circuit representation. We use
7 A version of vnTinyRAM that removes unnecessary instructions in each
matrix multiplication as an example.
step after running the particular program to be verified was released by Wahby
2) Fixed Memory Access and Instruction Patterns. These et al. [50]. However, since the prover in this program-specific version is unable
programs do not exploit the full generality of RAM ma- to handle arbitrary RAM programs, it is not appropriate for our comparison.
Setup Time (min) Time (min) Prover |C| (Millions of gates) Verification Time (ms)
TinyRAM Buffet vRAM Buffet vRAM TinyRAM
TinyRAM Buffet vRAM (mult/total) TinyRAM Buffet vRAM
#1 460000∗ 16.6 14.4 0.65 290000∗
240000∗ 9.9 9.9 19.8 422∗ 401 26
#2 20.0 11.2 17.3 150∗125∗ 8.6 38.5 150.8 56∗ 69 93
#3 16.1 9.6 21.2 200∗164 ∗ 7.9 36.2 148.3 9∗ 8 91
38.7
#4 310∗ 22.9 12.6 9.2 90∗ 75∗ 10.5 18.2 72.4 15∗ 20 84
#5 20.8 11.8 10.2 82∗ 68∗ 9.4 18.1 74.3 20∗ 15 85
TABLE III

C OMPARISON OF THE PERFORMANCE OF V RAM VERSUS B UFFET AND VN T INY RAM ( DENOTES SIMULATION DUE TO MEMORY EXHAUSTION ).
Ours Ours
vnTinyRAM achieves an approximate 8× improvement in setup time and

Memory (GB)
vnTinyRAM
105 Buffet (ptr chase) 104 Buffet (ptr chase)
9× improvement in prover time compared to vnTinyRAM.
Time (s)

104 Buffet (KMP) 103 Buffet (KMP)


2
10
103 101 Note that vnTinyRAM is unable to exploit the fact that matrix
102 100
10−1 multiplication is circuit-friendly, leading to large circuit size,
101 10 11 12 13 14 15 16 17 18 10−210 11 12 13 14 15 16 17 18
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 setup, prover and verifier times. Since our system uses a
# of RAM Cycles # of RAM Cycles
Fig. 2. Prover time (left) and memory consumption (right) of our construction
preprocessing phase that only depends on the input size and is
vs vnTinyRAM and Buffet for various number of CPU steps. otherwise agnostic to the program representation, for circuit-
Using a Different Back-End for vnTinyRAM and Buffet. friendly benchmarks we are able to directly use the program’s
Both Buffet and vnTinyRAM can be re-factored to use the circuit representation and thereby obtain an improvement of
more recent construction of [31] as their back-end. This would more than 4 orders of magnitude for setup time and 5 orders
result in an approximate improvement of 30% in their setup, of magnitude for proving time compared to vnTinyRAM.8
prover time and public key size as well as 50% improvement in The speedup obtained by vRAM is due to (1) the better
their proof and verification key sizes. This would also improve RAM-to-circuit reduction from Section III; and (2) the faster
verification time by 3×, as per the benchmarks of [4]. argument system from Section IV. To isolate the effect of
vRAM Evaluation Methodology. For vRAM, we imple- (1), in Table III we report the number of gates in the circuits
mented our own TinyRAM simulator to output the program produced by our reduction. Note that unlike vnTinyRAM and
traces used by our prover and verifier backend. We then Buffet, in vRAM all types of gates (numbers reported in the
adapted the assembly code for the programs in the Buffet last column) contribute to the prover time, instead of multipli-
benchmark, and ran them in our TinyRAM simulator to cation gates only.9 Thus, to facilitate the comparison between
obtain execution traces, which we provided to prover-verifier vnTinyRAM’s circuit reduction and our circuit reduction, we
backend. In order to measure the cost of our system vs. naive also report the number of multiplication gates in the table. As
unverified execution, we estimate the execution time of random shown in Table III, the number of multiplication gates in our
instructions on a single-threaded 2.3 GHz CPU core. system is 3.3–4.5× less than in vnTinyRAM. Regarding (2),
Experimental Results. The results of the comparison are the performance of our argument system is demonstrated in
summarized in Tables II and IV as well as in Figure 2. more detail in Appendix H, where we show that the per-gate
We executed each program on the largest input size reported cost of our system is lower than that of QAP-based systems.
in [50]. Table II summarizes their input size, number of CPU Comparison with Buffet. The main advantage of our system
cycles and the native running time if executed on the verifier compared to Buffet is that it can support arbitrary programs
locally. As vnTinyRAM cannot handle such large parameters, with a single setup. As shown in Table III, the setup time for
we estimate its cost by extrapolation, assuming linear growth. our system is 38.7 minutes for any program that runs for up
This yields a conservative estimate since the overhead of to 65K CPU steps . Although the setup time of Buffet for the
vnTinyRAM’s prover grows quasilinearly (rather than linearly) indicated programs is lower, an independent setup would have
with the number of RAM instructions. We report setup time, to be run for each different program to be verified (and the set
prover and verifier time, proof size and the size of the circuit of programs being verified must be known at the time setup is
verifying the RAM program. In Figure 2, we show the prover run). Moreover, we note that Buffet’s setup time would likely
time and memory consumption of the three systems versus the be larger than ours if used for a program running for 65K
number of CPU steps. In vRAM, these are mainly determined CPU steps (which none of the benchmarks do).
by the number of CPU steps executed by the benchmark, rather Overall, the prover time of our system is comparable to
then the specific choice of instructions executed in these steps. that of Buffet. On one hand, for programs with fixed mem-
Consequently, we show the performance of pointer chasing ory access and instruction patterns (such as pointer chasing
as a representative example, with other programs behaving and merge sort) Buffet can perform numerous optimizations,
similarly. Since Buffet optimizes the circuit generated based since the instruction to be executed in each CPU step is
on a particular benchmark program, we report two cases: one is
8 Note that in order to support all the benchmarks in Table III, vnTinyRAM
pointer chasing, which is a fixed-RAM program, and the other
only needs to execute a single preprocessing phase which is as large as the
is string search, which is a data dependent RAM program. largest instance, i.e. matrix multiplication. However, for fair comparison, we
Comparison with vnTinyRAM. Both our system and report a separate setup time for the 4 RAM-friendly programs and compare
vnTinyRAM can verify the execution of arbitrary programs the performance of our system to this number.
9 Both vnTinyRAM and Buffet use the notion of quadratic constraints with
with a single setup. As shown in Table III and Figure 2 (left), each constraint verifying that the product of the outputs of two unbounded
for all benchmarks except matrix multiplication, our system fan-in gates equals to the output of a third unbounded fan-in add gate.
#1 #2 #3 #4 #5
Proof Size (KB) 4 256 255 236 235
our construction to handle the task of verifying programs that
Memory Usage (GB) 3.6 7.6 7.7 3.8 3.8 run for large amounts of CPU steps, we also ran our system
TABLE IV on an Amazon EC2 m4.16xlarge machine featuring 256GB
P ROOF SIZE AND MEMORY USAGE OF V RAM.
of RAM and an Intel Xeon E5-2676v3 CPU with 64 virtual
pre-determined. This allows Buffet to highly customize the cores running at 2.4GHz. Using this machine, we executed
resulting circuit. Nonetheless, our system is still only around our system for programs consisting of 221 instructions. The
2× slower than Buffet while avoiding program-dependent reported prover’s time is 51000s, the memory consumption
preprocessing. On the other hand, for programs with input- grows to 252 GB and the total number of gates in the circuit
dependent memory and instruction patterns (such as KMP is 4.8 billion. While these numbers are concretely large, we
string search and sparse matrix-vector multiplication), our stress that, to the best of our knowledge, this is by far the
system actually outperforms Buffet, despite the fact that the largest reported successfully performed instance of verifiable
latter can optimize the circuit during preprocessing. Moreover, RAM computation. In particular, this instance is about 65×
as mentioned in [50, Section 4.3], if a program has deep larger than the largest instance reported in [11] (which was
nesting of data dependent loops or complex conditions (e.g., achieved by using a 256GB solid state drive as additional
a state machine), the compiler of Buffet may have to incur a memory space). Finally, the reported verification time was less
significantly higher overhead, since the amount of applicable than 105ms and the total communication cost was 336.5KB.
optimizations will be limited. However, the performance of
B. Practical Limitations of vRAM
our construction is not adversely affected by such programs
therefore our speedup compared to Buffet can be higher. The obvious reason for the verifier to delegate computations
Finally, we note that when the program is circuit-friendly, to a prover is to save on resources such as time or memory
e.g., matrix multiplication, Buffet can also represent the com- consumption. For this to make sense, it must be the case
putation using a circuit. In this case, the circuit is exactly the that the resources required to verify a program are fewer
same in both systems, and the prover time of our system is than naively executing it. Assuming the verifier runs on a
22× faster than Buffet, since our argument system outperforms 1GHz machine computing 109 instructions per second, for
Buffet’s Pinocchio-based argument [40]. QAP-based systems such as vnTinyRAM and Buffet which
Memory Consumption. Another advantage of our system offer extremely efficient verification, the verifier’s break-even
is that it uses much less memory in order to prove the point for saving computational power is delegating programs
same statement. As shown in Figure 2 (right), the memory larger than 10 million TinyRAM instructions. As the verifier’s
consumption of our system is 55–110× less than vnTinyRAM, performance in our construction are 4 − 10× worse, the break
yielding a two orders of magnitude improvement. The memory even point for vRAM is about 135 million instructions.
consumption is also 4 − 8× less than Buffet. In particular, However, we argue that a naive computation of the break
on a desktop machine with 32GB of RAM, we can execute even point is an oversimplified performance metric which
218 CPU steps, while vnTinyRAM can only reach 212 steps, hides important practical considerations. First, it assumes that
and Buffet can reach 215 − 216 steps. We also report the the verifier’s computational resources are of the same cost
memory consumption for the benchmarks we run in Table IV. as the prover’s recourses. An example where this is not the
The improvement is largely due to our reliance on the CMT case is where the verifier is manufactured using old-but-trusted
protocol which imposes a minimal memory overhead for the hardware, compared to a newer but untrusted prover (e.g.,
non-input part of the circuit. In fact, although the circuit size see the setting of [48], [49]). Second, in the case of zero-
is much larger than the input size, the memory usage of our knowledge SNARKs, the verifier is unable to perform the
VPD protocol and the CMT protocol are on the same order. computation by itself, as it involves the prover’s private data.
In addition, in the VPD protocol, the memory is mainly used Thus, the verifier is forced to use the (slower) SNARK in
for storing the public key, thus the usage is roughly the same order to validate the computation’s correctness. While vRAM
in the setup and the evaluate phase of VPD. does not support zero-knowledge, recent follow up work [53]
Verification Time and Proof Size. We next compare the shows a zero-knowledge variant of the verifiable computation
verification time and communication cost of our system with protocol presented in Section IV. Moreover, vRAM can also
vnTinyRAM and Buffet, both of which outperform our system. be used for delegation of data to the prover (e.g., for cloud
In particular, the verification time is 9–56ms for vnTinyRAM storage, while keeping only a hash of the data locally). In this
and 8–35ms for Buffet (except matrix multiplication). Also, case, local execution is again impossible for the verifier, unless
vnTinyRAM and Buffet inherit a proof size of 288 Bytes from he is willing to download all of the data for each computation.
QAP-based SNARKS. For comparison, the verification time Finally, focusing on break-even-point metric completely
and the overall communication cost for our construction varies hides the prover’s overhead. In particular, under this metric,
on different sizes of circuits. As shown in Table III and IV, a VC protocol with a break-even point of a single instruction
the verification time is 84–93ms and the communication is where the prover takes decades to produce a proof appears to
235–256KB for different programs. However, we believe that be much more performent than current VC protocols where the
these are very modest quantities for any modern machine. proof is produced within hours and have a break even point of
Proving 2 Million Instructions. To demonstrate the ability of millions of instructions. This is especially problematic since
104 104
the largest instances supported by the current generation of Ours (Opt)
Ours
Ours (Opt)
Ours
VC protocols are still below the protocol’s break-even-point.

Time (s)
103 103

Time (s)
As almost all computations performed by modern machines 102 102
easily last hundreds of billions of cycles, we argue that it is
101 101
also important to consider the ratio between the break-even 1k 2k 4k
# of Bytes Copied
8k 1k 2k 4k
# of Bytes Generated
8k

point and the largest instance supported by a VC protocol Fig. 3. Prover time for evaluating memcpy (left), RC4 (right) using vRAM
(given fixed prover recourses). While having a slower verifier with (green) and without (blue) the optimizations of Section V-D. For memcpy
and a larger break-even-point, vRAM offers a much more we vary the size of the copied memory block and for RC4 we vary the number
of pseudorandom bytes generated.
efficient prover (both in time and more importantly in memory
consumption) compared to other VC protocols. In particular, D. Just-in-Time Architecture
for a prover with 256GB of memory, vRAM can support Next, we use the architecture-independent preprocessing
computations which are about 63× away from its break-even property of our scheme to improve performance for specific
point, compared to vnTinyRAM’s 312× and Buffet’s 20–40× tasks. Common just-in-time compilation methods are used to
(depending on the program to be verified). optimize the executed code for a specific architecture. The cir-
cuit independent preprocessing feature of our construction al-
C. Comparison to Other RAM-based VC systems lows us to take this approach further and modify the machine’s
architecture in order to better fit a specific program after
In this section, we briefly discuss the performance of our executing it, when the program’s exact behavior on its inputs is
system compared to other RAM-based VC systems. known. We illustrate this using two benchmarks from [9], [11].
Pantry and SNARKs for C. Pantry [17] and SNARKs We stress that since our protocol has architecture-independent
for C [9] are two VC schemes that predate Buffet and preprocessing we are able to change the architecture without
vnTinyRAM, with their performance subsumed by those sys- rerunning the preprocessing phase. In particular, the following
tems (see [50, Figure 10] and [11, Figure 3]). results were achieved with a single preprocessing execution.
In all cases, the verifier’s runtime remained below 150ms.
Exploiting Data Parallel Structure via Bootstrapping. Improving Performance by Adding Instructions. Fig-
Geppetto [20] is a VC system that takes a large circuit, splits ure 3(left) shows prover’s time for evaluating a program which
it into sub-circuits, and preprocesses each sub-circuit with a copies consecutive blocks of memory from one location to
SNARK separately. An additional SNARK is then applied another (e.g., memcpy). We achieve a 3.6× improvement by
to aggregate and verify the outputs of all sub-circuits in a introducing a memory instruction which (1) copies a byte from
”bootstrapping” step. Though verifiable RAM is not explicitly memory address A to memory address B and (2) increments A
considered in [20], the system can be potentially applied to and B by 1 for the next loop iteration. This reduces the number
circuits checking the correctness of a RAM program, such of gates in the obtained circuit, thus yielding lower prover
as ones in Sections II-D and III. Due to the data parallel time. In this case, we did not modify any of the machine’s
structure of these circuits, Geppetto can reduce the setup time other parameters (e.g., number of registers and register size).
asymptotically (e.g., only one setup for the sub-circuit Cmem , Improving Performance by Changing Register Sizes. Next,
Ctime etc.). However, it introduces a big concrete overhead for Figure 3(right) shows prover’s time for evaluating a RC4
both setup and prover time because of the bootstrapping phase. pseudorandom generator on a highly specialized architecture.
For example, it requires ∼ 30, 000-100, 000 gates to bootstrap More specifically, we modified the machine to contain 3 8-bit
one small sub-circuit of just 500 gates [20, Section 7.3.1]. registers, a 32-bit address register for memory accesses and
Constant or No Preprocessing. Two alternative approaches a 32-bit program counter. Each RC4 round was implemented
for RAM-based VC are suggested in [10], [7] by Ben-Sasson using 16 instructions operating over the 8-bit registers. Notice
et al. The first uses composition of elliptic curves to recursively the 2.4× speedup compared to the non-optimized version,
apply a SNARK in a sequence of T fixed-size circuits, each which again results from the overall reduction of necessary
of which validates the state of a single previous CPU step, gates in order to generate one pseudorandom byte.
executes the next CPU step, and outputs the new state. In
ACKNOWLEDGMENTS
this way, the resulting setup time is constant. The second
constructs a RAM-based VC without any preprocessing by We thank the anonymous reviewers for their comments, and
using PCPs. Both these systems incur a very large concrete Stefano Tessaro for shepherding the paper. This work was
overhead on the prover. It takes 35.5 seconds/cycle for the supported in part by NSF awards #1514261 and #1652259,
first system [10, Figure 1], which is about 3000× slower than financial assistance award 70NANB15H328 from the U.S.
ours. For the second one, it takes 0.33 seconds/cycle using 64 Department of Commerce, National Institute of Standards and
threads in parallel [7, Figure 1], which roughly corresponds Technology, the 2017-2018 Rothschild Postdoctoral Fellow-
to 21.1 seconds/cycle using single thread [7, Section 2]. This ship, and the Defense Advanced Research Project Agency
is compared to our single threaded implementation which (DARPA) under Contract #FA8650-16-C-7622.
achieves 0.015 seconds/cycle. We leave the task of achieving
a speedup for our system via parallelization as future work.
R EFERENCES [34] D. E. Knuth, J. H. Morris, Jr, and V. R. Pratt. Fast pattern matching in
strings. SIAM J. Computing, 6(2):323–350, 1977.
[1] Ate pairing. https://github.com/herumi/ate-pairing. [35] A. E. Kosba, D. Papadopoulos, C. Papamanthou, M. F. Sayed, E. Shi,
[2] Buffet. https://github.com/pepper-project/releases. and N. Triandopoulos. TRUESET: Faster verifiable set computations.
[3] The GNU multiple precision arithmetic library. https://gmplib.org/. In USENIX Security 2014.
[4] libsnark. https://github.com/scipr-lab/libsnark. [36] H. Lipmaa. Succinct non-interactive zero knowledge arguments from
[5] OpenSSL toolkit. https://www.openssl.org/. span programs and linear error-correcting codes. In Asiacrypt 2013.
[6] W. Aiello, S. N. Bhatt, R. Ostrovsky, and S. Rajagopalan. Fast ver- [37] C. Lund, L. Fortnow, H. Karloff, and N. Nisan. Algebraic methods for
ification of any remote procedure call: Short witness-indistinguishable interactive proof systems. J. ACM, 39(4):859–868, 1992.
one-round proofs for NP. In ICALP 2000. [38] S. Micali. Computationally sound proofs. SIAM J. Computing,
[7] E. Ben-Sasson, I. Bentov, A. Chiesa, A. Gabizon, D. Genkin, M. 30(4):1253–1298, 2000.
Hamilis, E. Pergament, M. Riabzev, M. Silberstein, E. Tromer, and M. [39] C. Papamanthou, E. Shi, and R. Tamassia. Signatures of correct
Virza. Computational integrity with a public random string from quasi- computation. In TCC 2013.
linear PCPs. In Eurocrypt 2017. [40] B. Parno, J. Howell, C. Gentry, and M. Raykova. Pinocchio: Nearly
[8] E. Ben-Sasson, A. Chiesa, D. Genkin, and E. Tromer. Fast reductions practical verifiable computation. In IEEE S&P 2013.
from RAMs to delegatable succinct constraint satisfaction problems. In [41] B. Parno, M. Raykova, and V. Vaikuntanathan. How to delegate and
ITCS 2013. verify in public: Verifiable computation from attribute-based encryption.
[9] E. Ben-Sasson, A. Chiesa, D. Genkin, E. Tromer, and M. Virza. In TCC 2012.
SNARKs for C: Verifying program executions succinctly and in zero [42] S. Setty, B. Braun, V. Vu, A. J. Blumberg, B. Parno, and M. Walfish.
knowledge. In Crypto 2013. Resolving the conflict between generality and plausibility in verified
[10] E. Ben-Sasson, A. Chiesa, E. Tromer, and M. Virza. Scalable zero computation. In EuroSys 2013.
knowledge via cycles of elliptic curves. In Crypto 2014. [43] S. T. V. Setty, V. Vu, N. Panpalia, B. Braun, A. J. Blumberg, and M.
[11] E. Ben-Sasson, A. Chiesa, E. Tromer, and M. Virza. Succinct non- Walfish. Taking proof-based verified computation a few steps closer to
interactive zero knowledge for a von Neumann architecture. In USENIX practicality. In USENIX Security 2012.
Security 2014. [44] S. T. Setty, R. McPherson, A. J. Blumberg, and M. Walfish. Making
[12] N. Bitansky, R. Canetti, A. Chiesa, and E. Tromer. From extractable argument systems for outsourced computation practical (sometimes). In
collision resistance to succinct non-interactive arguments of knowledge, NDSS 2012.
and back again. In ITCS 2012. [45] J. Thaler. Time-optimal interactive proofs for circuit evaluation. In
[13] N. Bitansky, R. Canetti, O. Paneth, and A. Rosen. On the existence of Crypto 2013.
extractable one-way functions. In STOC 2014. [46] J. Thaler. A note on the GKR protocol, 2015. Available at http://people.
[14] N. Bitansky, A. Chiesa, Y. Ishai, R. Ostrovsky, and O. Paneth. Succinct cs.georgetown.edu/jthaler/GKRNote.pdf.
non-interactive arguments via linear interactive proofs. In TCC 2013. [47] V. Vu, S. Setty, A. J. Blumberg, and M. Walfish. A hybrid architecture
[15] D. Boneh and X. Boyen. Short signatures without random oracles. In for interactive verifiable computation. In IEEE S&P 2013.
Eurocrypt 2004. [48] R. S. Wahby, M. Howald, S. Garg, A. Shelat, and M. Walfish. Verifiable
[16] E. Boyle and R. Pass. Limits of extractability assumptions with asics. In IEEE SP 2016, 2016.
distributional auxiliary input. In Asiacrypt 2015. [49] R. S. Wahby, Y. Ji, A. J. Blumberg, A. Shelat, J. Thaler, M. Walfish,
[17] B. Braun, A. J. Feldman, Z. Ren, S. T. V. Setty, A. J. Blumberg, and and T. Wies. Full accounting for verifiable outsourcing. In ACM CCS
M. Walfish. Verifying computations with state. In SIGOPS 2013. 2017.
[18] A. Chiesa, E. Tromer, and M. Virza. Cluster computing in zero [50] R. S. Wahby, S. T. Setty, Z. Ren, A. J. Blumberg, and M. Walfish.
knowledge. In Eurocrypt 2015. Efficient RAM and control flow in verifiable outsourced computation.
[19] G. Cormode, M. Mitzenmacher, and J. Thaler. Practical verified In NDSS, 2015.
computation with streaming interactive proofs. In ITCS 2012. [51] M. Walfish and A. J. Blumberg. Verifying computations without
[20] C. Costello, C. Fournet, J. Howell, M. Kohlweiss, B. Kreuter, M. reexecuting them. Comm. ACM, 58(2):74–84, 2015.
Naehrig, B. Parno, and S. Zahur. Geppetto: Versatile verifiable com- [52] Y. Zhang, D. Genkin, J. Katz, D. Papadopoulos, and C. Papaman-
putation. In IEEE S&P 2015. thou. vSQL: Verifying arbitrary SQL queries over dynamic outsourced
[21] G. Danezis, C. Fournet, J. Groth, and M. Kohlweiss. Square span databases. In IEEE S&P 2017.
programs with applications to succinct NIZK arguments. In Asiacrypt [53] Y. Zhang, D. Genkin, J. Katz, D. Papadopoulos, and C. Papamanthou.
2014. A zero-knowledge version of vsql. Cryptology ePrint Archive, Report
[22] A. Fiat and A. Shamir. How to prove yourself: Practical solutions to 2017/1146, 2017.
identification and signature problems. In Crypto 1986.
[23] D. Fiore, C. Fournet, E. Ghosh, M. Kohlweiss, O. Ohrimenko, and A PPENDIX A
B. Parno. Hash first, argue later: Adaptive verifiable computations on A RITHMETIC C IRCUITS AND M ULTILINEAR E XTENSIONS
outsourced data. In ACM CCS 2016.
[24] R. Gennaro, C. Gentry, and B. Parno. Non-interactive verifiable An arithmetic circuit C is a directed acyclic graph whose
computing: Outsourcing computation to untrusted workers. In Crypto
2010.
vertices are called gates and whose edges are called wires.
[25] R. Gennaro, C. Gentry, B. Parno, and M. Raykova. Quadratic span Every in-degree 0 gate in C is labeled by a variable from
programs and succinct NIZKs without PCPs. In Eurocrypt 2013. a set of variables X = {x1 , · · · , xn } and is referred to as an
[26] A. M. Ghuloum and A. L. Fisher. Flattening and parallelizing irregular, input gate. All other gates in C have in-degree 2, are labeled
recurrent loop nests. In PPOPP 1995.
[27] S. Goldwasser, S. Micali, and C. Rackoff. The knowledge complexity by elements from {+, ×} and referred to as addition and
of interactive proof-systems. In STOC 1985. multiplication gates, respectively. Every gate of out-degree
[28] S. Goldwasser, Y. T. Kalai, and G. Rothblum. Delegating computation: 0 is called an output gate. In the following, we focus only
Interactive proofs for muggles. In STOC 2008.
[29] J. Groth. On the size of pairing-based non-interactive arguments. In on layered circuits and we assume that the output gates are
Eurocrypt 2016. ordered. We say that a circuit is layered if it can be divided
[30] J. Groth. Short pairing-based non-interactive zero-knowledge arguments. into disjoint sets L1 , · · · , Lk such that every gate of g belongs
In Asiacrypt 2010.
[31] J. Groth. On the size of pairing-based non-interactive arguments. In to some set Li and all the wires of C connect gates in two
Eurocrypt 2016, pages 305–326, 2016. consecutive layers (i.e., between L j and L j+1 for some j). We
[32] Y. Ishai, E. Kushilevitz, and R. Ostrovsky. Efficient arguments without write C : Fn → Fk to indicate that C is an arithmetic circuit
short PCPs. In CCC 2007.
[33] J. Kilian. A note on efficient zero-knowledge proofs and arguments. In with n inputs and k outputs evaluated (as defined in a natural
STOC 1992. way) over a field F. We denote by |C| the number of gates in
the circuit C, by widthi (C) the number of gates in the i-the In the above, we assume that z comes from a benign
layer of C and by width(C) the maximum width of C, i.e., distribution (similar to [20], [29], [23]), in order to avoid
width(C) = maxi {widthi (C)}. the negative results of [16], [13]. Concretely, our proofs hold
Polynomial Decomposition. We use the following lemma assuming the auxiliary input necessary for extraction comes
when proving properties of our VPD protocol. from a benign distribution.
To simplify the exposition, we assume symmetric (Type I)
Lemma 1 ([39]). Let f : F` → F be a polynomial of variable pairings. However, since asymmetric pairings are more effi-
degree d. For all t ∈ F` there exist efficiently computable poly- cient in practice, our implementations use a version of our
nomials q1 , . . . , q` such that: f (x) − f (t) = ∑`i=1 (xi − ti )qi (x) constructions based on asymmetric pairings; our assumptions
where ti is the ith element of t. can be re-stated for that setting in a straightforward manner.
Multilinear Extensions. For any function V : {0, 1}` → F we A PPENDIX C
define the multilinear extension, Ve : F` → F, of V as follows: T HE S UM -C HECK P ROTOCOL
Ve (x1 , · · · , x` ) = ∑b∈{0,1}` ∏`i=1 Xbi (xi )V (b) where bi is the i-th
Introduced in [37], the sum-check protocol allows a prover
bit of b, X1 (xi ) = xi and X0 (xi ) = 1 − xi . Note that Ve is the P to convince a verifier V that
unique polynomial that has degree at most 1 in each of its
variables that satisfies Ve (x) = V (x) for all x ∈ {0, 1}` .
H= ∑ ∑ · · · ∑ g(b1 , b2 , · · · , b` )
b1 ∈{0,1} b2 ∈{0,1} b` ∈{0,1}
Multilinear Extensions of Arrays. An array A =
where g(x1 , · · · , x` ) is an `-variate polynomial over some
(a0 , · · · , an−1 ) where ai ∈ F can be viewed as a function
finite field F. While the direct computation of H will require
A : {0, 1}log n → F such that A(i) = ai for all 0 ≤ i ≤ n − 1. In
V to evaluate g at least 2` times, V’s work can be made
the sequel, we abuse the terminology of multilinear extensions,
polynomial in ` using the sum-check protocol which we now
by defining (in the natural way) a multilinear extension à of an
describe. Indeed, the protocol proceeds in ` rounds. During
array A. A useful property of multilinear extensions of arrays is
the first round, P sends V the following univariate polynomial
the ability to efficiently combine them. That is, given 2m arrays
g1 (x) = ∑b2 ,··· ,b` ∈{0,1} g(x, b2 , · · · , b` ). Next, V checks that the
A1 , · · · , A2m of equal length n, the multilinear extensions of the
degree of x in g1 is at most the degree of x1 in g and that
array corresponding to their concatenation A = A1 || · · · ||A2m
H = g1 (0) + g1 (1), rejecting if any of these checks fails. In
can be evaluated on a point x = (x1 , · · · , xm+log n ) as
2m −1 m case both checks pass, V sends P a uniform challenge r1 .
Ã(x1 , · · · , xm+logn ) = ∑ ∏ Xi j (x j )Ãi (xm+1 , · · · , xm+log n ) During the i-th round of the protocol, P sends the polynomial
i=0 j=1
where i j is the j-th bit of i and Xi j (x j ) is defined above. gi (x) = ∑bi+1 ,··· ,b` ∈{0,1} g(r1 , · · · , ri−1 , x, bi+1 , · · · , b` ). V then
checks that gi−1 (ri−1 ) = gi (0) + gi (1), rejecting otherwise. In
A PPENDIX B
case the check passes, V sends a uniform ri to P, to be
C RYPTOGRAPHIC A SSUMPTIONS
used in the next round. At the final round, V accepts only
Our constructions make use of the following assumptions. if g(r1 , · · · , r` ) = g` (r` ). Define the degree of each monomial
in g as the sum of the powers of its variables. The total degree
Assumption 1 ([15] (q-Strong Diffie-Hellman)). For any PPT
of g is defined as the maximal degree of any of its monomial.
adversary A, the following probability is negligible:
  Theorem 4 ([37]). For any `-variate, total-degree-d poly-
bp ← BilGen(1λ );
R 1 nomial g over finite field F, the above-described sum-check
Pr  s ← Z∗p ; : (x, e(g, g) s+x ) ← A(1λ , σ )  .
 
q
protocol is an interactive proof for the (no-input) function
σ = (bp, gs , . . . , gs ) ∑b1 ,··· ,b` ∈{0,1} g(b1 , · · · , b` ) with soundness d · `/|F|. Moreover,
V performs poly(`) arithmetic operations over F and one
The following is a direct generalization of Groth’s q-PKE
evaluation of g on a random point r.
assumption [30] for multivariate polynomials.
Remark 1. When g is a multilinear polynomial (the degree
Assumption 2 ((d, `)-Power Knowledge of Exponent [52]).
of each variable is at most 1, and the total degree is `), the
For any PPT adversary A there is a polynomial-time algorithm
running time of P in round i of the sum-check protocol is
E (running on the same random tape) such that for all benign
min{O(m), O(2`−i )}, where m is the total number of distinct
auxiliary inputs z ∈ {0, 1} poly(λ ) the following probability is
monomials in g [19], [45], [47].
negligible:
  A PPENDIX D
bp ← BilGen(1λ );
R F ORMAL D ESCRIPTION OF THE CMT P ROTOCOL
 s ,...,s ,α ←
 1 ` Z∗p , s0 = 1; e(h, gα ) = e(h̃, g) 

σ1 = {g∏i∈W si }W ∈W`,d ; In this section, we describe the final part of the CMT

 
 
protocol which condenses to a single evolution per circuit
Pr  σ2 = {gα·∏i∈W si }W ∈W`,d ; : ∏ gaW ∏i∈W si  .
 
W ∈W`,d
layer, we present the formal description of the CMT protocol
σ = (bp, σ1 , σ2 , gα );
 
 

6= h
 (in Construction 3), and we state the corresponding theorem.
 G × G 3 (h, h̃) ← A(1λ , σ , z);
Condensing to a Single Claim Per Layer. Let γ : F → Fs1

(a0 , . . . , a|W`,d | ) ← E(1λ , σ , z) be the unique line defined by γ(0) = q1 and γ(1) = q2 . The
Construction 3 (CMT protocol). Let C : Fn → F be a depth-d layered arithmetic circuit over a finite field F. Let x ∈ Fn be inputs of
C such that C(x) = 1. In order for the prover Pcmt to convince the verifier Vcmt that C(x) = 1, the protocol proceeds as follows.
1) Both parties set a0 = 1 and r0 = 0. The protocol process as follows.
2) For i = 1, . . . , d, the protocol proceeds as follows.
a) Pcmt and Vcmt run the sum-check protocol for value ai−1 and polynomial fi−1,ri−1 as per Equation (1). In the last step of the
sum-check protocol, Vsc is supposed to evaluate fi−1,ri−1 at a random point ρi . Psc then provides values (a1 , a2 ) for which it claims
that a1 = Ṽi (q1 ) and that a2 = Ṽi (q2 ) where q1 , q2 are the last 2si elements of ρi .
b) Let γ : F → Fsi be the line defined by γ(0) = q1 and γ(1) = q2 . Pcmt sends the degree-si polynomial hi (x) = Ṽi (γ(x)). Next, V
verifies that h(0) = a1 and h(1) = a2 . In case both check pass, Vcmt chooses uniformly at random ri0 ∈ F, sets ai = h(ri0 ), ri = γ(ri0 )
and sends (ri , ai ) to Pcmt .
3) V accepts if ad = Ṽd (rd ), where Ṽd is the multilinear extension of the polynomial representing the input x.
CMT prover Pcmt sends a degree-s1 polynomial h claimed In order to achieve this, we split the witness to two
to be Ve1 (γ(·)) (i.e., the restriction of Ve1 to the line γ). The separate parts. The first contains tr and tr∗∗ sorted by time
CMT verifier Vcmt then checks that h(0) = a1 and that h(1) = and instruction type, and the second contains tr∗ . Recall
a2 . In case both checks pass, Vcmt picks a random point r1 that after the optimization in Section III, tr∗ only contains
and initiates a single execution of the sum-check protocol in AT +` , · · · , A2T +` , which is a permutation of S1 , · · · , ST sorted
order to verify that h(r1 ) = Ve1 (γ(r1 )). Thus, this condensing by accessed memory addresses. Then, by our design, if Ii
procedure reduces the total number of invocations of the sum- is not a memory load/store instruction, we set the accessed
check protocol was from O(2d ) to O(d). memory address of Si as 0 and all the values in Si as 0s before
So far, we have assumed that the circuit C has only a single sorting. In this way, the first (1−α)T states in AT +` , · · · , A2T +`
output value y ∈ {0, 1}. Larger outputs can be handled [47] are all zeros (assuming the real memory address starts from
by having the initial claim made by the prover to be stated 1) and there is no need to check anything for these states,
directly about the multilinear extension of the claimed circuit as they are not memory operations. Because of this layout,
output. Formally, consider the following theorem regarding the now the prover only includes AT +`+αT , · · · , A2T +` in tr∗ , and
CMT protocol presented in Construction 3. tells the verifier the number of non-memory operations. With
these information, it is sufficient to validate the new tr∗ is a
Theorem 5 ([28], [19], [47], [45]). Let C : Fn → Fk be a depth- permutation of non-zero states in tr using CMT on circuit C0 ,
d layered arithmetic circuit over a finite field F. The protocol and the technique is described in [52] for handling circuits that
presented in Construction 3 is an interactive proof for the func- receive inputs at different levels. With this optimization, we
tion computed by C with soundness O(d · log S/|F|), where S is manage to reduce the number of Cmem further from T to αT ,
the maximal number of gates per circuit layer. Moreover, P’s which is a significant improvement in practice. However, the
running time is O(|C| log S) and the protocol uses O(d log S) verifier now needs to run two VPD instances (once for each
˜ i and mult
rounds of interaction. Finally, if add ˜ i are computable
part of the witness). See [52] for a more detailed explanation.
in time O(polylog S) for all the layers of C, then the running
time of the verifier V is O(n + k + d · polylog S). A PPENDIX F
The following remark is particularly useful in case the circuit C OMPLEXITY OF THE M ODIFIED CMT
C being evaluated has a highly regular repetitive structure. We now analyze the complexity of the sum-check protocol
Remark 2 ([45]). If C can be expressed as a composition of (i) of equation 2 in Section IV-A. For the first 2si+1 rounds,
parallel copies of a layered circuit C0 whose maximum number there are at most BSi monomials per round, as there are
of gates at any layer is S0 , and (ii) a subsequent “aggregation” at most BSi gates in the i-th layer of the circuit and the
number of non-zero monomials in add ˜ i+1 and mult
˜ i+1 is
layered circuit C00 of size O(|C|/ log |C|), the running time of
P is reduced to O(|C| log |S0 |). bounded by the number of gates. By Remark 1, this takes
O(BSi ) arithmetic operations per round, so the complexity for
A PPENDIX E these rounds is O(BSi log Si+1 ). For the remaining rounds, by
F URTHER R EDUCING THE C OST OF M EMORY C HECKING Remark 1, P’s running time is O(2dlog Be− j ) in round 2si+1 + j
Even after our RAM optimizations from Section III, the ( j = 1, . . . , dlog Be) and the complexity is O(B). Thus, the
circuit C contains T copies of Cmem . In practice however, it complexity is dominated by the first part, i.e., O(BSi log Si+1 ).
is almost certain that not every cycle will perform a memory A PPENDIX G
access. E.g., even for a program that consists of a single for A NALYSIS OF OUR N EW VPD S CHEME
loop that simply loads a memory location per repetition, the
total percentage of memory accesses is 25% (one instruction Theorem 6. Under Assumptions 1 and 2, Construction 1 is
for the memory load, plus three for counter increase, loop an extractable VPD scheme. For a variable-degree-d `-variate
bound check, and jump). Motivated by this, we exploit the polynomial f ∈ F containing m monomials, algorithm KeyGen
runs in time O( `(d+1)

circuit-independent pre-processing of our argument to modify `d ), Commit in time O(m), Evaluate in
C so that it only contains αT copies of Cmem where α is the time O(`dm), Ver in time O(`) and CheckCom in time O(1). If
percentage of general memory accesses over the total steps. d = 1, Evaluate runs in time O(2` ). The commitment produced
by Commit consists of O(1) group elements, and the proof Query Evaluation. Upon receiving ( f ∗ ,t ∗ , y∗ , π ∗ ) from
def
produced by Evaluate consists of O(`) elements of G. A, B first runs Commit( f ∗ , pp) to receive com =
Proof. The completeness requirement immediately follows (c1 , c2 ) and then runs Ver(com,t ∗ , y∗ , π ∗ , vp) where vp =
2 `·d
from the construction of (KeyGen, Commit, Evaluate, Ver). (1λ , p, G, GT , e, g, gs , gs , . . . , gs , gα ). If Ver rejects, B aborts,
We now prove the extractability property. Let A be a PPT else he runs extractors E1 , . . . , E` (defined above) on the same
adversary that on input (1λ , pp), where (pp, vp) is the output input as A and receives polynomials q01 , . . . , q0` . If for the
of KeyGen(1λ , `, d), outputs commitment com∗ such that output of any of the Ei it holds that ∏W ∈W`,d gaW,i ∏ j∈W s j 6= πi ,
CheckCom(com∗ , vp) accepts. This implies that e(c1 , gα ) = B aborts. Otherwise, let δ = y∗ − f ∗ (t ∗ ) and let Q(x) be
def
def
e(c2 , g) where com∗ = (c1 , c2 ). By Assumption 2, there exists the polynomial over F defined as Q(x) = f ∗ (x) − f ∗ (t ∗ ) −
def
PPT extractor E 0 for A such that upon the same input as A, and ∑`i=1 (xi − ti )q0i (x) where t ∗ = (t1 , . . . ,t` ). B picks τ ∈ F uni-
−s
formly at random. If g = g , he sets τ ← τ +1. He then com-
τ
with access to the same random tape, outputs a0 , . . . , a|W`,d | ∈ F
def
such that ∏W ∈W`,d gaW ∏i∈W si = c1 , except with negligible putes polynomial Q0 (x) = Q(x)/(τ + x1 ) and finally outputs
−1 0
probability. Note that, the coefficients (a0 , . . . , a|W`,d | ) can be (τ, e(g, g)δ ·Q (s1 ,...,s` ) ) as a challenge tuple for Assumption 1.
encoded as a variable-degree-d, `-variate polynomial that has Since s1 = s, s2 = r2 · s, . . . , s` = r` · s, we have
ai as its monomial coefficients. We now build extractor E: Q0 (s1 , . . . , s` ) = Q00 (s) where Q00 is an efficiently computable
0
1) Upon input (1λ , pp), E runs E 0 on the same input. univariate polynomial of degree ` · d hence e(g, g)−δ ·Q (s1 ,...,s` )
2 `·d
2) E tries to parse the output of E 0 as a0 , . . . , a|W`,d | ∈ F and is computable from (1λ , p, G, GT , e, g, gs , gs , . . . , gs ). B
aborts if this fails. is clearly PPT since all of Ei are PPT and he performs
3) E outputs f 0 , where f 0 ∈ F is the polynomial with coeffi- polynomially many operations in F, G, GT . Next, we analyze
cients a0 , . . . , a|W`,d | . the success probability of B. Recall that, by assumption
A succeeds in violating soundness with probability ε. We
Note that E is PPT as E 0 is PPT and it only performs observe that, conditioned on not aborting, B’s output is
polynomially many operations in F. It remains to argue that always a valid tuple for breaking Assumption 1. Let us argue
f 0 is a valid pre-image of Commit except with negligible why this is true. Since verification succeeds, it holds that
probability. Observe that, if E does not abort, it follows from ∗
e(c1 /gy , g) = ∏`i=1 e(gsi −ti , πi ); since extraction succeeds,
the construction of Commit that Commit( f 0 , pp) = com, where this can be replaced with
com is the output commitment of A. By assumption 2, the `
∗ (s ∗ (t ∗ ) 0
e(g, g) f 1 ,...,s` )−δ − f = ∏ e(gsi −ti , gqi (s1 ,...,s` ) )
probability that the output E 0 is not a valid set of coefficients
i=1
is negligible which concludes the proof. `
∗ (s ∗ (t ∗ ) 0
Next, we prove the soundness property. Let A be a PPT e(g, g)δ = e(g, g) f 1 ,...,s` )− f
∏ e(gsi −ti , g−qi (s1 ,...,s` ) )
adversary that wins the soundness game with non-negligible i=1
probability. For i = 1, . . . , ` we define adversary Ai that re- δ
e(g, g) = e(g, g) f ∗ (s1 ,...,s` )− f ∗ (t ∗ )− ∑`i=1 (si −ti )q0i (s1 ,...,s` )
.
ceives the same input as A and executes the same code, but
outputs only (πi , πi0 ) ∈ π ∗ (where π ∗ is the proof output by By the definition of Q0 it follows that
A). Moreover, since A is PPT, all these adversaries are also
e(g, g)δ = e(g, g)Q(s1 ,...,s` )
PPT . Thus, for i = 1, . . . , `, from Assumption 2 there exists PPT
δ Q(s1 ,...,s` )
0
Ei (running on the same random tape as Ai ) which on input e(g, g) τ+s1 = e(g, g) τ+s1
= e(g, g)Q (s1 ,...,s` )
(1λ , pp) outputs a0,i , . . . , a|W`,d |,i ∈ F such that the following 1 −1 ·Q0 (s
e(g, g) τ+s1 = e(g, g)δ 1 ,...,s` ) .
holds: If e(πi , gα ) = e(πi0 , g) then ∏W ∈W`,d gaW,i ∏ j∈W s j 6= πi ,
except with negligible probability. Note that, the coefficients Thus, the final piece in order to conclude the proof is to bound
(a0,i , . . . , a|W`,d |,i ) for i = 1, . . . , ` can always be encoded as a the probability that B aborts. Note that, conditioned on A
variable-degree-d, `-variate polynomial which we denote by winning, B will only abort if extraction fails which can only
q0i (x) for undefined variable x = (x1 , . . . , x` ). happen with negligible probability neg(λ ). This holds since,
We construct an adversary B that breaks Assumption 1. On if verification succeeds it must be that e(πi0 , g) = e(πi , gα ) for
2 `·d
input (1λ , p, G, GT , e, g, gs , gs , . . . , gs ), B does the following: i = 1, . . . , ` and in this case, by Assumption 2, extraction for
Parameter Generation. B implicitly sets s1 = s and for i = any of E1 , . . . , E` fails with negligible probability. Since ` is
1, . . . , ` he chooses ri ∈ F uniformly at random and sets (also polynomial in λ it follows that the probability any of them fails
implicitly) si = s · ri . Then he chooses uniformly at random (which by a union bound is at most equal to the sum of each
a value α ∈ F. Next B needs to generate the terms in P = individual failure probability) is also negligible. Finally, let us
{g∏i∈W si , gα·∏i∈W si }W ∈W`,d . Since the exponent of each term argue that the polynomial division Q(x)/(τ + x1 ) is always
is a product of at most ` · d factors where each factor is one possible. Recall, that for polynomials defined over finite fields
of the values si = s · ri , it can be written as a polynomial in division is always possible assuming that the divident’s degree
s with degree at most ` · d. Therefore, B can compute these is at least as large as that of the divisor’s. Moreover, the degree
2 `·d
terms from the values g, gs , gs , . . . , gs and α. Finally, B runs of the quotient is at most that of the divident’s and that of the
A on input (1λ , pp), where pp = (p, G, GT , e, g, gα , P). remainder is strictly smaller than that of the divisor. Let us
Bench VPD CMT Buffet vnTinyRAM
assume for contradiction that Q(x) is a constant polynomial. -mark Time/Input Time/Gate Time/Gate Time/Gate
Since, e(g, g)δ = e(g, g)Q(s1 ,...,s`+1 ) and e(g, g) is a generator #2 11.32 5.42 77.50 72.20
def #3 13.17 6.36 68.34 72.15
or GT , it must be that Q(x) = δ therefore we can write
` #4 13.23 6.18 74.97 72.33
−δ = ∑ (xi − ti )q0i (x) − f ∗ (x) + f ∗ (t ∗ ) #5 12.90 5.77 72.10 72.37
i=1 TABLE V
` P ER GATE ( INPUT ) PROVER TIME FOR OUR VPD AND CMT, B UFFET AND
f ∗ (x) − δ − f ∗ (t ∗ ) = ∑ (xi − ti )q0i (x) VN T INY RAM FOR THE LAST 4 RAM PROGRAMS IN THE BENCHMARK
( SAME ORDER AND SIZE AS IN TABLE III). T IME REPORTED IN µ S .
i=1
` same time. Each pair πi , πi0 is computed with two exponenti-

f (x) − y = ∑∗
(xi − ti )q0i (x) ations, thus the overall running time is O(2` ).
i=1 A PPENDIX H
M ICROBENCHMARKS
From the above relation it follows that t ∗ is a root of the
def Verifiable Polynomial Delegation. Table V shows the prover
polynomial f 0 = f ∗ (x) − y∗ , i.e., f 0 (t ∗ ) = 0 which implies
time of our implementation of the VPD from Section IV-B.
that f ∗ (t1 , . . . ,t` ) = y∗ . Thus, in this case, y∗ is the correct
The prover time is about 12µs per input gate, which is about
evaluation of f ∗ on t ∗ , i.e., δ = 0 and A did not cheat. In all
8× faster than that of [52]. This is due to (i) our improved
other cases, the polynomial division is possible.
VPD construction (amounting to around 2-4×) and (ii) due to
From the above analysis it follows that the probability that the fact that 85% of the inputs used in our RAM reduction
B succeeds is at least (1 − neg(λ ))ε. By assumption, ε is the are field elements that encode single bit values (due to bit
non-negligible probability that A wins the soundness game, decomposition, register indices and flags), which leads to faster
therefore B’s success probability is also non-negligible. This exponentiation times for the VPD prover.
contradicts Assumption 1 and our proof is complete. CMT Protocol. Next, we evaluate the performance of our
Asymptotic Analysis. The claims for the general polynomial CMT protocol. As can be seen in Table V, the average time
case follow directly from the analysis of [39]. For d = 1, i.e., required per gate for the CMT prover is about 6µs, which is
for multi-linear polynomials, we prove the tighter bound for about 4× slower than the 1.7µs number reported in [52]. This
the runtime of Evaluate below. is because we implemented our new CMT protocol supporting
Recall that during Evaluate the prover computes circuits with different copies of sub-circuits in Section IV-A,
polynomials qi (xi , . . . , x` ) for i = 1, . . . , `, such that while the CMT protocol for regular circuits is used in [52].
f (x1 , . . . , x` ) = ∑`i=1 (xi − ti ) · qi (xi , . . . , x` ) + f (t1 , . . . ,t` ) Both the per-input time for the VPD protocol and the per- gate
and proof π = {gqi (si ,...,s` ) , gαqi (si ,...,s` ) }`i=1 . We start by time for the CMT protocol are much faster than the per-gate
computing q1 (x1 , . . . , x` ). Since the degree of every variable time for Buffet and vnTinyRAM.
is at most 1, the multi-linear polynomial f can be written Circuit Generator. Finally, we report the number of gates
as f (x1 , . . . , x` ) = g(x2 , . . . , x` ) + x1 · h(x2 , . . . , x` ), where required by our reduction to verify a TinyRAM cycle. We
g(x2 , . . . , x` ) and h(x2 , . . . , x` ) are multi-linear polynomials of measure this by dividing the total number of gates of the cir-
variables x2 , . . . , x` . In this way, f can be decomposed as cuits produced by the experimental evaluation of Section V-A
f (x1 , . . . , x` ) = g(x2 , . . . , x` ) + x1 · h(x2 , . . . , x` ) over the number of TinyRAM steps. For our tested programs
= (g(x2 , . . . , x` ) + t1 · h(x2 , . . . , x` )) + (x1 − t1 )h(x2 , . . . , x` ) this circuit contained about 2500 gates, 600 of which are
= R1 (x2 , . . . , x` ) + (x1 − t1 )h(x2 , . . . , x` ) . multiplications (for comparison, vnTinyRAM takes roughly
2000 multiplication gates, as reported in [11]. Notice that the
We set q1 (x1 , . . . , x` ) = h(x2 , . . . , x` ) (which means q1 con- work of [11] only needs to report the number of multiplication
tains no monomial with x1 ), and proceed to decompose the gates while we must report on the total number of gates.
multi-linear polynomial R1 (x2 , . . . , x` ) with ` − 1 variables in
the same way as f to compute q2 (x2 , . . . , x` ). Regarding the
complexity of this, note that both g(x2 , . . . , x` ) and h(x2 , . . . , x` )
contain at most 2`−1 monomials. Therefore, it takes 2`−1
additions and multiplications to compute q1 (x1 , . . . , x` ) and
R1 (x2 , . . . , x` ), and 2`−1 exponentiations to generate gq1 (s1 ,...,s` )
and gαq1 (s1 ,...,s` ) in the proof, respectively. The exact same
reasoning applies for all of q3 , . . . , q` . At the last step after
computing q` (x` ), the remaining constant term is equal to
the answer f (t1 , . . . ,t` ). In general, in the ith step, we are
decomposing Ri−1 (xi , . . . , x` ) with ` − i + 1 variables in the
same way above to compute qi (xi , . . . , x` ) and Ri (xi+1 , . . . , x` ),
and the complexity is O(2`−i ). Thus, the total complexity of
computing q1 , . . . , q` is O(2`−1 ) + O(2`−2 ) + . . . = O(2` ). The
polynomial evaluation in order to get the answer takes the

You might also like