Abstract
We explore time-memory and other tradeoffs for memory-hard functions, which are supposed to impose significant computational and time penalties if less memory is used than intended. We analyze three finalists of the Password Hashing Competition: Catena, which was presented at Asiacrypt 2014, yescrypt and Lyra2.
We demonstrate that Catena’s proof of tradeoff resilience is flawed, and attack it with a novel precomputation tradeoff. We show that using \(M^{4/5}\) memory instead of M we have no time penalties and reduce the AT cost by the factor of 25. We further generalize our method for a wide class of schemes with predictable memory access. For a wide class of data-dependent schemes, which addresses memory unpredictably, we develop a novel ranking tradeoff and show how to decrease the time-memory and the time-area product by significant factors. We then apply our method to yescrypt and Lyra2 also exploiting the iterative structure of their internal compression functions.
The designers confirmed our attacks and responded by adding a new mode for Catena and tweaking Lyra2.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Memory-hard functions are a fast emerging trend which has become a popular remedy to the hardware-equipped adversaries in various applications: cryptocurrencies, password hashing, key derivation, and more generic Proof-of-Work constructions. It was motivated by the rise of various attack techniques, which can be commonly described as optimized exhaustive search. In cryptocurrencies, the hardware arms race made the Bitcoin mining [29] on regular desktops tremendously inefficient, as the best mining rigs spend 30,000 times less energy per hash than x86-desktops/laptopsFootnote 1. This causes major centralization of the mining efforts which goes against the democratic philosophy behind the Bitcoin design. This in turn prevents wide adoption and use of such cryptocurrency in economy, limiting the current activities in this area to mining and hoarding, whith negative effects on the price. Restoring the ability of CPU or GPU mining by the use of memory-hard proof-of-work functions may have dramatic effect on cryptocurrency adoption and use in economy, for example as a form of decentralized micropayments [15]. In password hashing, numerous leaks of hash databases triggered the wide use of GPUs [3, 34], FPGAs [27] for password cracking with a dictionary. In this context, constructions that intensively use a lot of memory seem to be a countermeasure. The reasons are that memory operations have very high latency on GPU and that the memory chips are quite large and thus expensive on FPGA and ASIC environments compared to a logic core, which computes, e.g. a regular hash function.
Memory-intensive schemes, which bound the memory bandwidth only, were suggested earlier by Burrows et al. [8] and Dwork et al. [17] in the context of spam countermeasures. It was quickly realized that to be a real countermeasure, the amount of memory shall also be bounded [18], so that memory must not be easily traded for computations, time, or other resources that are cheaper on certain architecture. Schemes that are resilient to such tradeoffs are called memory-hard [21, 30]. In fact, the constructions in [18] are so strong that even tiny memory reduction results in a huge computational penalty.
Disadvantage of Classical Constructions and New Schemes. The provably tradeoff-resilient superconcentrators [32] and their applications in [18, 19] have serious performance problems. They are terribly slow for modern memory sizes. A superconcentrator requiring N blocks of memory makes \(O(N\log N)\) calls to F. As a result, filling, e.g., 1 GB of RAM with 256-bit blocks would require dozens of calls to F per block (\(C\log N\) calls for some constant C). This would take several minutes even with lightweight F and is thus intolerable for most applications like web authentication or cryptocurrencies. Using less memory, e.g., several megabytes, does not effectively prohibit hardware adversaries.
This has been an open challenge to construct a reasonably fast and tradeoff-resilient scheme. Since the seminal paper by Dwork et al. [18] the first important step was made by Percival, who suggested scrypt [30]. The idea of scrypt was quite simple: fill the memory by an iterative hash function and then make a pseudo-random walk on the blocks using the block value as an address for the next step. However, the entire design is somewhat sophisticated, as it employs a stack of subfunctions and a number of different crypto primitives. Under certain assumptions, Percival proved that the time-memory product is lower bounded by some constant. The scrypt function is used inside cryptocurrency Litecoin [4] with 128 KB memory parameter and is now adapted as an IETF standard for key-derivation [5]. scrypt is a notable example of data-dependent schemes where the memory access pattern depends on the input, and this property enabled Percival to prove some lower bound on adversary’s costs. However, the performance and/or the tradeoff resilience of scrypt are apparently not sufficient to discourage hardware mining: the Litecoin ASIC miners are more efficient than CPU miners by the factor of 100 [1].
The need for even faster, simpler, and possibly more tradeoff-resilient constructions was further emphasized by the ongoing Password Hashing Competition [2], which has recently selected 9 finalists out of the 24 original submissions. Notable entries are Catena [20], just presented at Asiacrypt 2014 with a security proof based on [26], and yescrypt and Lyra2 [25], which both claim performance up to 1 GB/sec and which were quickly adapted within a cryptocurrency proof-of-work [7]. The tradeoff resilience of these constructions has not been challenged so far. It is also unclear how possible tradeoffs would translate to the cost
Our Contributions. We present a rigorous approach and a reference model to estimate the amortized costs of password brute-force on special hardware using full-memory algorithms or time-space tradeoffs. We show how to evaluate the adversary’s gains in terms of area-time and time-memory products via computational complexity and latency of the algorithm.
Then we present our tradeoff attacks on the last versions of Catena and yescrypt, and the original version of Lyra2. Then we generalize them to wide classes of data-dependent and data-independent schemes. For Catena we analyze the faster Dragonfly mode and show that the original security proof for it is flawed and the computation-memory product can be kept constant while reducing the memory. For ASIC-equipped adversaries we show how to reduce the area-time product (abbreviated further by AT) by the factor of 25 under reasonable assumptions on the architecture. The attack algorithm is then generalized for a wide class of data-independent schemes as a precomputation method.
Then we consider data-dependent schemes and present the first generic tradeoff strategy for them, which we call the ranking method. Our method easily applies to yescrypt and then to the second phase of Lyra2, both taken with minimally secure time parameters. We further exploit the incomplete diffusion in the core primitives of these designs, which reduces the time-memory and time-area products for both designs.
Altogether, we show how to decrease the time-memory product by the factor of 2 for yescrypt and the factor of 8 for Lyra2. Our results are summarized in Table 1. To the best of our knowledge, our methods are the first generic attacks so far on data-dependent or data-independent schemesFootnote 2.
Related Work. So far there have been only a few attempts to develop tradeoff attacks on memory-hard functions. A simple tradeoff for scrypt has been known in folklore and was recently formalized in [20]. Alwen and Serbinenko analyzed a simplified version of Catena in [9]. Designers of Lyra2 and Catena attempted to attack their own designs in the original submissions [20, 25]. Simple analysis of Catena has been made in [16].
Paper Outline. We introduce necessary definitions and metrics in Sect. 2. We attack Catena-Dragonfly in Sect. 3 and generalize this method in Sect. 4. Then we present a generic ranking algorithm for data-dependent schemes in Sect. 5 and attack yescrypt with this method in Sect. 6. The attack on Lyra2 is quite sophisticated and we leave it for Appendix A.
2 Preliminaries
2.1 Syntax
Let \(\mathcal {G}\) be a hash function that takes a fixed-length string I as input and outputs tag H. We consider functions that iteratively fill and overwrite memory blocks \(X[1],X[2],\ldots , X[M]\) using a compression function F:
where \(\phi _i\) are some indexing functions referring to some already filled blocks and \(f_j\) are auxiliary hash functions (similar to F) filling the initial s blocks for some positive s.
We say that the function makes p passes over the memory, if \(T = pM\). Usually p and M are tunable parameters which are responsible for the total running time and the memory requirements, respectively.
2.2 Time-Space Tradeoff
Let \(\mathcal {A}\) be an algorithm that computes \(\mathcal {G}\). The computational complexity \(C(\mathcal {A})\) is the total number of calls to F and \(f_i\) by \(\mathcal {A}\), averaged over all inputs to \(\mathcal {G}\). We do not consider possible complexity amortization over successive calls to \(\mathcal {A}\). The space complexity \(S(\mathcal {A})\) is the peak number of blocks (or their equivalents) stored by \(\mathcal {A}\), again averaged over all inputs to \(\mathcal {G}\). Suppose that \(\mathcal {A}\) can be represented as a directed acyclic graph with vertices being calls to F. Then the latency \(L(\mathcal {A})\) is the length of the longest chain the graph from the input to the output. Therefore, \(L(\mathcal {A})\) is the minimum time needed to run \(\mathcal {A}\) assuming unlimited parallelism and instant memory access.
A straightforward implementation of the scheme (1) results in an algorithm with computational complexity T and latency \(L=T\) and space complexity M. However, it might be possible to compute \(\mathcal {G}\) using less memory. According to [24], any function, that is described by Eq. (1) and whose reference block indices \(\phi _j(i)\) are known in advance, can be computed using \(c_k\frac{T}{\log T}\) memory blocks for some constant \(c_k\) depending on the number k of input blocks for F. Therefore, any p-pass function can be computed using less than \(M=T/p\) memory for sufficiently large M.
Let us fix some default algorithm \(\mathcal {A}\) of \(\mathcal {G}\) with \((C_1,M_1,L_1)\) being computational and space complexities and latency of \(\mathcal {A}\), respectively. Suppose that there is a time-space tradeoff given by the family of algorithmsFootnote 3 \(\mathcal {B}= \{B_q\}\) that compute \(\mathcal {G}\) using \(\frac{M_1}{q}\) space for different q. The idea is to store only one of q memory blocks on average and recompute the missing blocks whenever they are needed. Then we define the computational penalty \(C_{\mathcal {B}}(q)\) as
and latency penalty \(L_{\mathcal {B}}(q)\).
2.3 Attackers and Cost Estimates
We consider the following attack. Suppose that \(\mathcal {G}\) with time and memory parameters (T, M) is used as a password hashing function with \(I=(P,S)\), where P is a secret password and S is a public salt. An attacker gets H and S (e.g., from a database leak) and tries to recover P. He attempts a dictionary attack: given a list L of most probable passwords, he runs \(\mathcal {G}\) on every \(P\in L\) and checks the output.
Definition 1
Let \(\varPhi \) be a cost function defined over a space of algorithms. Let also \(\mathcal {G}_{T,M}\) be a hash function with fixed algorithm \(\mathcal {A}_0\) (default algorithm). Then \(\mathcal {G}_{T,M}\) is called \((\alpha ,\varPhi )\)-secure if for every algorithm \(\mathcal {B}\) for \(\mathcal {G}_{T,M}\)
In other words, \(\mathcal {G}_{T,M}\) can not be computed cheaper than by the factor of \(\frac{1}{\alpha }\).
The cost function is more difficult to determine. We suggest evaluating amortized computing costs for a single password trial. Depending on the architecture, the costs vary significantly for the same algorithm \(\mathcal {A}\). For the ASIC-equipped attackers, who can use parallel computing cores, it is widely suggested that the costs can be approximated by the time-area product \(\mathrm {AT}\) [9, 11, 28, 35]. Here T is the time complexity of the used algorithm and A is the sum of areas needed to implement the memory cells and the area needed to implement the cores. Let the area needed to implement one block of memory be the unit of area measurement. Then in order to know the total area, we need core-memory ratio \(R_c\), which is how many memory blocks we can place on the area taken by one core.
Suppose that the adversary runs algorithm \(B_q\) using M / q memory and l computing cores, thus having computational complexity \(C_q = C(B_q)\). The running time is lower bounded by the latency \(L_q = L(B_q)\) of the algorithm. If \(L_q<C_q/l\), i.e. the computing cores can not finish the work in minimum time, then the time T can be approximated by \(C_q/l\), and the costs are estimated as follows:
We see that the costs drop as l increases. Therefore, the adversary would be motivated to push it to the maximum limit \(C_q/L_q\). Thus we obtain the final approximation of costs:
Here we assume unlimited memory bandwidth. Taking the bandwidth restrictions into account is even more difficult, as they depends on the relative frequency of the computing core and the memory as well as on the architecture of the memory bus. Moreover, the memory bandwidth of the algorithm depends on the implementation and is not easy to evaluate. We leave rigorous memory bandwidth evaluation and restrictions for the future work.
We recall that the value \(R_c\) is depends on the architecture, the function F, and the block size. To give a concrete example, suppose that the block is 64 bytes and F is the Blake-512 hash function. We use the following reference implementationsFootnote 4:
-
The 50-nm DRAM [22], which takes 550 mm\({}^2\) per GByte;
-
The 65-nm Blake-512 [23], which takes about 0.1 mm\({}^2\).
Then the core-memory ratio is \(\frac{2^24 \cdot 0.1}{550} \approx 3000\). For more lightweight hash functions this ratio will be smaller.
The actual functions F in the designs that we attack are often ad-hoc and have not implemented yet in hardware. Moreover, the numbers may change when going to smaller feature size. To make our estimates of the attack costs architecture-independent, we introduce a simpler metric — the time-memory product \(\mathrm {TM}\):
which for not so high computational penalties gives a good approximation of \(\mathrm {AT}\).
In our tradeoff attacks, we are mainly interested to compare the AT and TM costs of \(B_q\) with that of the default algorithm \(\mathcal {A}\). Thus we define the AT ratio of \(B_q\) as \(\frac{\mathrm {AT}_{B_q}}{\mathrm {AT}_{\mathcal {A}}}\), and the TM ratio of \(B_q\) as \(\frac{\mathrm {TM}_{B_q}}{\mathrm {TM}_{\mathcal {A}}}\)
We note that for the same \(\mathrm {TM}\) value the implementation with less memory is preferable, as its design and production will be cheaper. Thus we explore how much the memory can be reduced keeping the AT or TM costs below those of the default algorithm.
Definition 2
Tradeoff algorithms \(\mathcal {B}\) have AT compactness q if it is the maximal q such that
Tradeoff algorithms \(\mathcal {B}\) have TM compactness q if it is the maximal q such that
For the concrete schemes we take “minimally secure” values of T, i.e. those that supposed to have \((\alpha ,\varPhi )\)-security for reasonably high \(\alpha \). Unfortunately, no explicit security claim of this kind is present in the design documents of the functions we consider.
Data-Dependent and Data-Independent Schemes. The existing schemes can be categorized according to the way they access memory. The data-independent schemes Catena [20], Pomelo [36], Argon2i [13] computes \(\phi (j)\) independently of the actual password in order to avoid timing attacks like in [33]. Then the algorithm \(\mathcal {B}\) that uses less memory can recompute the missing blocks just by the time they are requested. Therefore, it has the same latency as the full-memory algorithm, i.e. \(L(\mathcal {B}) = L_0\). For these algorithms the time-memory product can be arbitrarily small, and the minimum \(\mathrm {AT}\) value is determined by the core-memory ratio.
The data-dependent schemes scrypt [30] yescrypt [31], Argon2d [13] compute \(\phi (j)\) using the just computed block: \( \phi (j) = \phi (j,X_{i_{j-1}})\). Then precomputation is impossible, and for each recomputing block the latency is increased by the latency of the recomputation algorithm, so \(L_q>L_0\). There exist hybrid schemes [25], which first run a data-independent phase and then a data-dependent one.
3 Cryptanalysis of Catena-Dragonfly
3.1 Description
Short History. Catena was first published on ePrint [20] and then submitted to the Password Hashing Competition. Eventually the paper was accepted to Asiacrypt 2014 [21]. In the middle of the reviewing process, we discovered and communicated the first attack on Catena to the authors. The authors have introduced a new mode for Catena in the camera-ready version of the Asiacrypt paper, which is resistant to the first attack. The final version of Catena, which is the finalist of the Password Hashing Competition, contains two modes: Catena-Dragonfly (which we abbreviate to Catena-D), which is an extension to the original Catena, and Catena-Butterfly, which is a new mode advertised as tradeoff-resistant. In this paper we present the attack on Catena-Dragonfly, which is very similar to the first attack on Catena.
Specification. Catena-D is essentially a mode of operation over the hash function F, which is be instantiated by Blake2b [10] in the full or reduced-round version. The functional graph of Catena-D is determined by the time parameter \(\lambda \) (values \(\lambda =1,2\) are recommended) and the memory parameter n, and can be viewed as \((\lambda +1)\)-layer graph with \(2^n\) vertices in each layer (denoted by Catena-D-\(\lambda \)). We denote the X-th vertex in layer l (both count from 0) by \([X]^l\). With each vertex we associate the corresponding output of the hash function F and denote it by \([X^l]\) as well. The outputs are stored in the memory, and due to the memory access pattern it is sufficient to store only \(2^n\) blocks at each moment. The hash function H has 512-bit output, so the total memory requirements are \(2^{n+6}\) bytes.
First layer is filled as follows
-
\([0]^0 = G_1(P,S)\), where \(G_1\) invokes 3 calls to F;
-
\([1]^0 = G_2(P,S)\), where \(G_2\) invokes 3 calls to F
-
\([i]^0 \leftarrow F([{i-1}]^0,[{i-2}]^0),\; 2\le i \le 2^n-1\).
Then \(2^{3n/4}\) nodes of the first layer are modified by function \(\varGamma \). The details of \(\varGamma \) are irrelevant to our attack.
The memory access pattern at the next layers is determined by the bit-reversal permutation \(\nu \). Each index is viewed as an n-bit string and is transformed as follows:
The layers are then computed as
-
\([0]^j = F([0]^{j-1}\,||\,[{2^n-1}]^{j-1})\);
-
\([i]^j = F([{i-1}]^j\,||\,[{\nu ({i})}]^{j-1})\).
Thus to compute \([X]^l\) we need \([\nu ({X})^{l-1}]\). The latter can be then overwrittenFootnote 5. An example of Catena-D with \(\lambda =2\) and \(n=3\) is shown at Fig. 1.
The bit-reversal permutation is supposed to provide memory-hardness. The intuition is that it maps any segment to a set of blocks that are evenly distributed at the upper layer.
Original Tradeoff Analysis. The authors of Catena-D originally provided two types of security bounds against tradeoff attacks. Recall that Catena-D-\(\lambda \) can be computed with \(\lambda 2^n\) calls to F using \(2^n\) memory blocks. The Catena-D designers demonstrated that Catena-D-\(\lambda \) can be computed using \(\lambda S\) memory blocks with time complexityFootnote 6
Therefore, if we reduce the memory by the factor of q, i.e. use only \(\frac{2^n}{q}\) blocks, we get the following penalty:
The second result is the lower bound for tradeoff attacks with memory reduction by q:
However the constant in \(\varOmega ()\) is too small (\(2^{-18}\) for \(\lambda =3\)) to be helpful in bounding tradeoff attacks for small q. More importantly, the proof is flawed: the result for \(\lambda =1\) is incorrectly generalized for larger \(\lambda \). The reason seems to be that the authors assumed some independence between the layers, which is apparently not the case (and is somewhat exploited in our attack).
In the further text we demonstrate a tradeoff attack yielding much smaller penalties than Eq. (5) and thus asymptotically violating Eq. (6).
3.2 Our Tradeoff Attack on Catena-D
The idea of our method is based on the simple fact that
where X can be a single index or a set of indices. We exploit it as follows. We partition layers into segments of length \(2^k\) for some integer k, and store the first block of every segment (first two blocks at layer 0). As the index of such a block ends with k zeros, we denote the set of these blocks as \([*^{n-k}0^k]\). We also store all \(2^{3n/4}\) blocks modified by \(\varGamma \), which we denote by \([\varGamma ]\).
Consider a single segment \([AB*^k]\), where A is a k-bit constant, B is a \(n-2k\)-bit constant. Then
Blocks \([*^k\nu (B)\nu (A)]\) belong to \(2^k\) segments that have \(\nu (B)\) in the middle of the index. Denote the union of these segments by \([*^k\nu (B)*^k]\). Now note that
and
Therefore, when we iterate the permutation \(\nu \), we are always within some \(2^k\) segments. We suggest the computing strategy in Algorithm 1. At layer t we recompute \(2^k\) full segments from layers 0 to \(t-2\) and \(2^k\) subsegments of length \(\nu (A)\) (interpreted as a number in the binary form) at layer \(t-1\). Therefore, the total cost of computing layer t is
The total cost of computing Catena-D-\(\lambda \) is
We store \((t+1) 2^{n-k}\) blocks as segment starting points, \(2^{3n/4}\) blocks \([\varGamma ]\) and \(2^{2k}\) blocks for intermediate computations. For \(k = \log q +\log (\lambda +1) \) and \(q<2^{n/4}\) we store about \(2^n/q\) blocks, so the memory is reduced by the factor of q. This value of k yields the total computational complexity of
Since the computational complexity of the memory-full algorithm is \((\lambda +1)2^n\), our tradeoff method gives the computational penalty
Since Catena is a data-independent scheme, the latency of our method does not increase. Therefore, the time-memory product (Eq. (4)) can be reduced by the factor of \(2^{n/4}\). We can estimate how AT costs evolves assuming the reference implementation in Sect. 2.3:
For \(q = 2^{n/5}\) and \(\lambda =2\) we get
For \(n=24\) (1 GB of RAM) we get
whereas
Therefore, we expect the time-area product dropped by the factor of about 25 if the memory is reduced by the factor of 30. In the terms of Definition 1, Catena-D-2 is not \((1/25,\mathrm {AT})\)-secure. Our tradeoff method also have AT and TM compactness at least \(2^{n/5} = 64\).
On other architectures the \(\mathrm {AT}\) may drop even further, and we expect that an adversary would choose the one that maximizes the tradeoff effect, so the actual impact of our attack can be even higher.
Violation of Catena-D Lower Bound. Our method shows that the Catena-D lower bound is wrong. If we summarize the computational costs for \(\lambda \) layers, we obtain the following computational penalty for the memory reduction by the factor of q:
which is asymptotically smaller than the lower bound \(\varOmega (q^{\lambda })\) (Eq. (6)) from the original Catena submission [20].
3.3 Other Results for Catena
Our attack on Catena can be further scrutinized and generalized to non-even segments. More details are provided in [14] with the summary given in Table 2.
4 Generic Precomputation Tradeoff Attack
Now we try to generalize the tradeoff method used in the attack on Catena for a class of data-independent schemes. We consider schemes \(\mathcal {G}\) where each memory block is a function of the previous block and some earlier block:
where \(\phi \) is a deterministic function such that \(\phi (i) <i\). A group of existing password hashing schemes falls into this category: Catena [20], Pomelo [36], Lyra2 [25] (first phase). Multiple iterations of such a scheme are equivalent to a single iteration with larger T and an additional restriction
so that the memory requirements are M blocks.
The crucial property of the data-independent attacks is that they can be tested and tuned offline, without hashing any real password. An attacker may spend significant time to search for an optimal tradeoff strategy, since it would then apply to the whole set of passwords hashed with this scheme.
Precomputation Method. Our tradeoff method generalizes as follows. We divide memory into segments and store only the first block of each segment. For every segment I we calculate its image \(\phi (I)\). Let \(\overline{\phi ({I})}\) be the union of segments that contain \(\phi ({I})\). We repeat this process until we get an invariant set \(U_k = U(I)\):
The scheme \(\mathcal {G}\) is then computed according to Algorithm 2.
The total amount of calls to F is \(\sum _{i\ge 0}|U_i|\), and the penalty to compute I is
How efficient the tradeoff is depends on the properties of \(\phi \) and the segment partition, i.e. how fast \(U_i\) expands. As we have seen, Catena uses a bit permutation for \(\phi \), whereas Lyra2 uses a simple arithmetic function or a bit permutation [20, 25]. In both cases \(U_i\) stabilizes in size after two iterations. If \(\phi \) is a more sophisticated function, the following heuristics (borrowed from our attacks on data-dependent schemes) might be helpful:
-
Store the first \(T_1\) computed blocks and the last \(T_2\) computed blocks for some \(T_1,T_2\) (usually about N / q).
-
Keep the list \(\mathcal {L}\) of the most expensive blocks to recompute and store M[i] if \(\phi (i)\in \mathcal {L}\) (Fig. 2).
5 Generic Ranking Tradeoff Attack
Now we present a generic attack on a wide class of schemes with data-dependent memory addressing. Such schemes include scrypt [30] and the PHC finalists yescrypt [31], Argon2d [13], and Lyra2 [25]. We consider the schemes described by Eq. (1) with \(k=2\) and the following addressing (cf. also Fig. 3):
Here g is some indexing function. This construction and our tradeoff method can be easily generalized to multiple functions F, to stateful functions (like in Lyra2), to multiple inputs, outputs, and passes, etc. However, for the sake of simplicity we restrict to the construction above.
Our tradeoff strategy is following: we compute the blocks sequentially and for each block X[i] decide if we store it or not. If we do not store it, we calculate its access complexity A(i) – the number of calls needed to recompute it as a sum of access complexities of \(X[i-1]\) and \(X[r_i]\) plus one. If we store X[i], its access complexity is 0.
The storing heuristic rule is the crucial element of our strategy. The idea is to store the block if \(A(r_i)\) is too high.
Our ranking tradeoff method works according to Algorithm 3 (Fig. 4).
Here w, s and l are parameters, and we usually set \(l=3s\). The computational complexity is computed as
We also compute the latency L(i) of each block as \(L(i) = \max (L(r_i),L(i-1))+1\) if we do not store X[i] and \(L(i) = 0\) if we store it. Then the total latency is
We implemented our attack and tested it on the class of functions described by Eq. (9). For fixed w and s the total number of calls to F and the number of stored blocks is entirely determined by indices \(\{r_i\}\). Thus we do not have to implement a real hash function, and it is sufficient to generate \(r_i\) according to some distribution, model the computation as a directed acyclic graph, and compute C and L for this graph. We made a number of tests with uniformly random \(r_i\) (within the segment [0; i] and \(T=2^{12}\)) and different values of w and s. Then we grouped C and L values by the memory complexity and figured the lowest complexities for each memory reduction factor. These values are given in Table 3.
We conclude that generic 1-pass data-dependent schemes with random addressing are (0.75, AT)- and (0.75, TM)-secure using our ranking method. Both AT and TM ratios exceed 1 when \(q\ge 4\), so both the AT- and the TM-compactness is about 4.
6 Cryptanalysis of yescrypt
6.1 Description
yescrypt [31] is another PHC finalist, which is built upon scrypt and is notable for its high memory filling rate (up to 2 GB/sec) and a number of features, which includes custom S-boxes to thwart exhaustive search on GPU, multiplicative chains to increase the ASIC latency, and some others. yescrypt is essentially a family of functions, each member activated by a combination of flags. Due to the page limits, we consider only one function of the family.
Here we consider the yescrypt setting where flag yescrypt_RW is set, there is no parallelism, and no ROM (in the further text – just yescrypt). It operates on 1024-byte memory blocks \(X[1],X[2],\ldots , X[M]\). The scheme works as follows:
Here F and \(F'\) are compression functions (the details of \(F'\) are irrelevant for the attack). Therefore, the memory is filled in the first M steps and then \((T-M)\) blocks are updated using the state variable Y. Here \(\phi (i)\) is the data-dependent indexing function: it takes 32 bits of \(X[i-1]\) and interprets it as a random block index among the last \(2^k\) blocks, where \(2^k\) is the largest power of 2 that is smaller than i.
Transformation F operates on 1024-byte blocks as follows:
-
Blocks are partitioned into 16 64-byte subblocks \(B_0, B_1,\ldots ,B_{15}\).
-
New blocks are produced sequentially:
$$\begin{aligned} B_{0}^{new}&\leftarrow f(B_{0}^{old}\oplus B_{15}^{old});\\ B_{i}^{new}&\leftarrow f(B_{i-1}^{new}\oplus B_{i}^{old}),\; 0 <i<16. \end{aligned}$$The details of f are irrelevant to our attack.
6.2 Tradeoff Attack on yescrypt
Our crucial observation is that there is no diffusion from the last subblocks to the first ones. Thus if we store all \(B_0\), we break the dependencies between consecutive blocks and the subblocks can be recomputed from \(B_1\) to \(B_{15}\) with pipelining (Fig. 5). Suppose that the block X[i] is computed with latency L(i), i.e. its computation tree has L(i) levels if measured in F. However, if we consider the tree of f, then the actual latency of X[i] is \(L(i)+15\) instead of expected 16L(i) if measured in calls to f.
The tradeoff strategy is given in Algorithm 4.
If the missing block is recomputed by a tree of depth D, then the latency of the new block is \(D+16\) measured in calls to f, or \(\frac{D}{16}+1\) if measured in calls to F. This number should be compared to the latency \(D+1\) if we had not exploited the iterative structure of F. Thus if the ranking method gives the total latency L (measured in F), the actual latency should be \(\frac{L+15}{16}\).
For the smallest secure parameter (\(T=4M/3\)) we get the final computational and latency penalties as well as AT and TM penalties are given in Table 4 (1 / 16-th of each block is added to the attacker’s memory). We conclude yescrypt is only (0.45, AT)- and (0.45, TM)-secure, whereas the AT compactness is 4 and the TM compactness is 6. Since this numbers are worse than for generic 1-pass schemes, our attack clearly signals of a vulnerability in the design of BlockMix. We expect that our attack becomes inefficient for \(T=2M\) and higher.
7 Future Work
Our tradeoff methods apply to a wide class of memory-hard functions, so our research can be continued in the following directions:
-
Application of our methods to other PHC candidates and finalists: Pomelo [36] and the modified Lyra2.
-
Set of design criteria for the indexing functions that would withstand our attacks.
-
New methods that directly target schemes that make multiple passes over memory or use parallel cores.
-
Design a set of tools that helps to choose a proof-of-work instance in various applications: cryptocurrencies, proofs of space, etc.
8 Conclusion
Tradeoff cryptanalysis of memory hard functions is a young, relatively unexplored and complex area of research combining cryptanalytic techniques with understanding of implementation aspects and hardware constraints. It has direct real-world impact since its results can be immediately used in the on-going arms race of mining hardware for the cryptocurrencies.
In this paper we have analyzed memory-hard functions Catena-Dragonfly and yescrypt. We show that Catena-Dragonfly is not memory-hard despite original claims and the security proof by the designers’, since a hardware-equipped adversary can reduce the attack costs significantly using our tradeoffs. We also show that yescrypt is more tradeoff-resilient than Catena, though we can still exploit several design decisions to reduce the time-memory and the time-area product by the factor of 2.
We generalize our ideas to the generic precomputation method for data-independent schemes and the generic ranking method for the data-dependent schemes. Our techniques may be used to estimate the attack cost in various applications from the fast emerging area of memory-hard cryptocurrencies to the password-based key derivation.
Notes
- 1.
The estimate comes from the numbers given in [6]: the best ASICs make \(2^{32}\) hashes per joule, whereas the most efficient laptops can do \(2^{17}\) hashes per joule.
- 2.
The full version of this paper is available at [14].
- 3.
As well as \(\mathcal {A}\), the family \(\mathcal {B}\) admits parallel implementations.
- 4.
We take low-area implementations, as possible parallelism is already taken into account.
- 5.
In terms of Eq. (1) we could enumerate all blocks as \([i]^j = j||\underbrace{i}_{n\text { bits }}\) so that \(\phi (j||i) = (j-1)||\nu (i) \).
- 6.
This result is a part of Theorem 6.3 in [20].
References
Litecoin: Mining hardware comparison. https://litecoin.info/Mining_hardware_comparison
Password Hashing Competition. https://password-hashing.net/
Software tool: John the Ripper password cracker. http://www.openwall.com/john/
Litecoin - Open source P2P digital currency (2011). https://litecoin.org/
IETF Draft: The scrypt Password-Based Key Derivation Function (2012). https://tools.ietf.org/html/draft-josefsson-scrypt-kdf-02
Bitcoin: Mining hardware comparison (2014). https://en.bitcoin.it/wiki/Mining_hardware_comparison
Vertcoin: Lyra2RE reference guide (2014). https://vertcoin.org/downloads/Vertcoin_Lyra2RE_Paper_11292014.pdf
Abadi, M., Burrows, M., Manasse, M.S., Wobber, T.: Moderately hard, memory-bound functions. ACM Trans. Internet Techn. 5(2), 299–327 (2005)
Alwen, J., Serbinenko, V.: High parallel complexity graphs and memory-hard functions. IACR Cryptology ePrint Archive 2014/238 (2014)
Aumasson, J.-P., Neves, S., Wilcox-O’Hearn, Z., Winnerlein, C.: BLAKE2: simpler, smaller, fast as MD5. In: Jacobson, M., Locasto, M., Mohassel, P., Safavi-Naini, R. (eds.) ACNS 2013. LNCS, vol. 7954, pp. 119–135. Springer, Heidelberg (2013)
Bernstein, D.J., Lange, T.: Non-uniform cracks in the concrete: the power of free precomputation. In: Sako, K., Sarkar, P. (eds.) ASIACRYPT 2013, Part II. LNCS, vol. 8270, pp. 321–340. Springer, Heidelberg (2013)
Bertoni, G., Daemen, J., Peeters, M., Van Assche, G.: Duplexing the sponge: single-pass authenticated encryption and other applications. In: Miri, A., Vaudenay, S. (eds.) SAC 2011. LNCS, vol. 7118, pp. 320–337. Springer, Heidelberg (2012)
Biryukov, A., Dinu, D., Khovratovich, D.: Argon and argon2: password hashing scheme. Technical report (2015). https://password-hashing.net/submissions/specs/Argon-v2.pdf
Biryukov, A., Khovratovich, D.: Tradeoff cryptanalysis of memory-hard functions. Cryptology ePrint Archive, Report 2015/227 (2015). http://eprint.iacr.org/
Biryukov, A., Pustogarov, I.: Proof-of-work as anonymous micropayment: rewarding a Tor relay. IACR Cryptology ePrint Archive 2014/1011 (2014). To appear at Financial Cryptography 2015
Chang, D., Jati, A., Mishra, S., Sanadhya, S.K.: Time memory tradeoff analysis of graphs in password hashing constructions. In: Preproceedings of PASSWORDS 2014, pp. 256–266 (2014). http://passwords14.item.ntnu.no/Preproceedings_Passwords14.pdf
Dwork, C., Goldberg, A.V., Naor, M.: On memory-bound functions for fighting spam. In: Boneh, D. (ed.) CRYPTO 2003. LNCS, vol. 2729, pp. 426–444. Springer, Heidelberg (2003)
Dwork, C., Naor, M., Wee, H.M.: Pebbling and proofs of work. In: Shoup, V. (ed.) CRYPTO 2005. LNCS, vol. 3621, pp. 37–54. Springer, Heidelberg (2005)
Dziembowski, S., Faust, S., Kolmogorov, V., Pietrzak, K.: Proofs of space. IACR Cryptology ePrint Archive 2013/796 (2013). To appear at Crypto 2015
Forler, C., Lucks, S., Wenzel, J.: Catena: a memory-consuming password scrambler. IACR Cryptology ePrint Archive, Report 2013/525 (2013). Version of 5 January 2014. http://eprint.iacr.org/eprint-bin/getfile.pl?entry=2013/525&version=20140105:194859&file=525.pdf
Forler, C., Lucks, S., Wenzel, J.: Memory-demanding password scrambling. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT 2014, Part II. LNCS, vol. 8874, pp. 289–305. Springer, Heidelberg (2014)
Giridhar, B., Cieslak, M., Duggal, D., Dreslinski, R.G., Chen, H.M., Patti, R., Hold, B., Chakrabarti, C., Mudge, T.N., Blaauw, D.: Exploring DRAM organizations for energy-efficient and resilient exascale memories. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2013), pp. 23–35. ACM (2013)
Gürkaynak, F., Gaj, K., Muheim, B., Homsirikamol, E., Keller, C., Rogawski, M., Kaeslin, H., Kaps, J.-P.: Lessons learned from designing a 65nm ASIC for evaluating third round SHA-3 candidates. In: Third SHA-3 Candidate Conference, March 2012
Hopcroft, J.E., Paul, W.J., Valiant, L.G.: On time versus space. J. ACM 24(2), 332–337 (1977)
Simplicio Jr., M.A., Almeida, L.C., Andrade, E.R., dos Santos, P.C.F., Barreto, P.S.L.M.: The Lyra2 reference guide, version 2.3.2. Technical report, April 2014
Thomas Lengauer and Robert Endre Tarjan: Asymptotically tight bounds on time-space trade-offs in a pebble game. J. ACM 29(4), 1087–1130 (1982)
Malvoni, K.: Energy-efficient bcrypt cracking. In: Passwords 2014 Conference (2014). http://www.openwall.com/presentations/Passwords14_Energ_Efficient_Cracking/
Mukhopadhyay, S., Sarkar, P.: On the effectiveness of TMTO and exhaustive search attacks. In: Yoshiura, H., Sakurai, K., Rannenberg, K., Murayama, Y., Kawamura, S. (eds.) IWSEC 2006. LNCS, vol. 4266, pp. 337–352. Springer, Heidelberg (2006)
Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system (2009). http://www.bitcoin.org/bitcoin.pdf
Percival, C.: Stronger key derivation via sequential memory-hard functions (2009). http://www.tarsnap.com/scrypt/scrypt.pdf
Peslyak, A.: Yescrypt - a password hashing competition submission. Technical report (2014). http://password-hashing.net/submissions/specs/yescrypt-v0.pdf
Pippenger, N.: Superconcentrators. SIAM J. Comput. 6(2), 298–304 (1977)
Ristenpart, T., Tromer, E., Shacham, H., Savage, S.: Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds. In: Proceedings of the 2009 ACM Conference on Computer and Communications Security, CCS 2009, Chicago, Illinois, USA, 9–13 November 2009, pp. 199–212 (2009)
Sprengers, M., Batina, L.: Speeding up GPU-based password cracking. In: SHARCS 2012 (2012). http://2012.sharcs.org/record.pdf
Thompson, C.D.: Area-time complexity for VLSI. In: STOC 1979, pp. 81–88. ACM (1979)
Wu, H.: POMELO: a password hashing algorithm. Technical report (2014). https://password-hashing.net/submissions/specs/POMELO-v1.pdf
Acknowledgement
We would like to thank the authors of Catena for verifying and confirming our attack.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Cryptanalysis of Lyra2 V1
A Cryptanalysis of Lyra2 V1
1.1 A.1 Description of Lyra2 V1
Lyra2 [25] is a PHC finalist, notable for its high memory filling rate (up to 1 GB/sec). Very recently, Lyra2 has been significantly changed for the second round of the competition. This section describes the original submission to PHC [25], Lyra2 v1 (just Lyra2 in the further text).
Lyra2 is a hybrid hashing scheme, which uses data-independent addressing in the first phase and data-dependent addressing in the second phase. Lyra2 operates on blocks of 768 bits (96 bytes) each, and fills the memory with \(2^n\cdot C\) such blocks, where n and C are parameters, and C is by default set to 128 [25, p. 39]. In this paper we use \(C=128\). The entire memory is represented as a \((2^n\times C)\)-matrix M, and we refer to its components as rows and columns. Rows are denoted by M[i].
Lyra2 has two main phases: the single-iteration Setup phase, where the memory is addressed data-independently, and the multiple-iteration Wandering phase, where the memory is addressed data-dependently. The number T of iterations in the Wandering phase can be as low as 1, and we take this value in our analysis.
Setup Phase. The first phase fills rows sequentially from M[0] to \(M[2^n-1]\) as follows:
Here \( \phi (i) = 2^k -i\), where \(2^k\) is the smallest power of 2 that is not smaller than i, \(\overleftarrow{M[]}\) stands for the left rotation of each 768-bit word by 32 bits, and G is a cryptographic hash function.
The following details of F are relevant to our attack:
-
Function F is stateful: it operates on the 1024-bit state S, which is preserved between rows.
-
Function F(X, Y) processes columns \(X_i, Y_i\) of X and Y sequentially. The internal state undergoes C rounds (similarly to the duplex-sponge construction [12]), where in round i column \(Z_i\) of the output Z is produced as follows:
$$\begin{aligned} S&\leftarrow S\oplus X_i\oplus Y_i;\\ S&\leftarrow P(S);\\ Z_i&\leftarrow \text {768 least sign. bits of }(S). \end{aligned}$$Here P is a single round of the Blake2b internal permutation [10]. We do not exploit any specific property of P. Thus F can be seen as a duplex-sponge instantiated with a Blake2b round function.
We remind the reader that Z is used not only to produce a new row M[i] but also to overwrite the row \(M[2^k-i]\).
Wandering Phase. The Wandering phase transforms the blocks produced in the Setup phase. First, it reverses the ordering. Then it operates similarly to the Setup phase, but the second input block to F is taken pseudo-randomly:
Here g truncates the first input to the least significant 32 bits and xores with the second input. All indices are computed modulo \(2^n\).
1.2 A.2 Tradeoff Attack on the Setup and Wandering Phases of Lyra2
Our strategy for the Setup phase is similar to the one for Catena. Again, we exploit the properties of the indexing function \(\phi \).
Let us denote a segment of rows \(\{M[i],M[i+1],\ldots ,M[j]\}\) by M[i : j]. Consider a, b such that \(2^{k-1}<a<b<2^{k}\). Then
Thus to construct a single segment we need another segment of the same length. This suggests the following strategy for computing \(2^n\) rows in the Setup phase.
-
1.
First \(2^{n-l}\) rows \(M[0],\ldots ,M[2^{n-l}-1]\) for some \(l>0\) (parameter of the attack).
-
2.
We split rows from \(M[2^{n-l}]\) to \(M[2^n-1]\) into segments of length q for some \(q<2^{n-l}\). Store the entire state S at the start of each segment.
Then to compute segment \(M[a:a+q-1], 2^{k-1}<a<2^k\) we have to compute \(M[\phi (a:a+q-1)]\), which has been updated when computing segments between \(2^{k-2}\) and \(2^{k-1}\). Eventually we reach the stored \(2^{n-l}\) rows. To compute \(M[a:a+q-1], 2^{k-1}<a<2^k\) we need to compute a segment in the interval \([2^i:2^{i+1}]\) for each \(n-l <i < k\) (Fig. 6).
Let us figure out the memory reduction and the computational overhead of this procedure. We store \(2^{n-l}\) first rows and \(\frac{2^n}{96q}\) rows for starting state in each segment, then a segment of length q during recomputation. For segments between rows \(M[2^{n-l}]\) and \(M[2^{n-l+1}]\) we need 1 call to F per row, as there is no recomputation. For segments between rows \(M[2^{n-l+1}]\) and \(M[2^{n-l+2}]\) we need 2 calls to F per row, and so on. In general, we make
calls to F to compute a segment of length q between row indices \(2^{k}\) and \(2^{k+1}\). For the entire Setup phase we spend
calls to F. The memory requirements are \(2^{n-l} + q + \frac{2^n}{96q}\), which reaches the minimum of \(2^{n-l} + 2^{n/2-4.5}\) for \(q = 2^{n/2-5.5}\).
To summarize, our tradeoff algorithm B has computational penalty \((l-0.5)\) if the memory is reduced by the factor of \(2^l\) (Table 5).
Access Complexity of a Single Row. In the next phase we will need to calculate the cost of recomputing a single row rather than a segment. To compute a single row, we need to recompute \((l-0.5)\) segments on average, so the average recomputation complexity is:
Tradeoff Attack on the Wandering Phase of Lyra2 with \(T=1\). We apply the ranking method to the Wandering phase of Lyra2. Since Lyra2 updates two rows at once, its penalties are higher than in generic data-dependent schemes and are given in Table 6.
1.3 A.3 Tradeoff for the Full Lyra2 with \(T=1\)
Memory Partition. To run the attack on the full Lyra2 with fraction 1 / l of memory, we have to split the available memory between Setup and Wandering phases. Suppose that we allocate fraction \(\alpha \) of memory for the Setup phase and fraction \(\beta \) of memory for the Wandering phase. Let \(P_S(\alpha )\) be the penalty of running the Setup phase with fraction \(\alpha \), \(P_R(\alpha )\) be the average access complexity of a single row from the Setup phase run with fraction \(\alpha \) (Eq. (11)), and \(P_W (\beta )\) be the penalty of running the Wandering phase with fraction \(\beta \) (Table 6). Then the total memory reduction will be \( \alpha + \beta \). To estimate the time penalty, we note that in our tradeoff for the Wandering phase, each recomputation requests as many rows from the Setup phase as many hash calls is made in the Wandering phase. Therefore, the total time penalty would be estimated as
as we construct \(2\cdot 2^n\) blocks in two phases.
Exploiting Iterative Compression Function. Similarly to the attack on yescrypt we can exploit the fact that Lyra2 produces blocks of a row columnwise. Therefore, we have to make D calls to P to compute the first column of the block, whereas computation of other columns can be pipelined: the second column of the deepest tree level can be computed simultaneously with the first column of one level higher. To compute all 128 columns, we spend time needed to compute \(D+128\) columns only, so the actual latency penalty is \(1+D/128\). Therefore, the total latency penalty can be calculated as follows:
where \(D_S(\alpha ) = 1-(\log \alpha )/256\) is the average latency penalty in the Setup phase, \(D_R(\alpha ) = -\log \alpha -0.5\) is the average latency penalty for accessing the row from the Setup phase, and \(D_W (\beta )\) is the depth penalty for the Wandering phase given in Table 6.
The results are given in Table 7. We conclude that Lyra2 is only (0.33, AT)-secure and (0.1, TM)-secure. The AT compactness is 4 and the TM compactness is 16. Thus Lyra2 v1 is more susceptible to tradeoff attacks compared to yescrypt.
Rights and permissions
Copyright information
© 2015 International Association for Cryptologc Research
About this paper
Cite this paper
Biryukov, A., Khovratovich, D. (2015). Tradeoff Cryptanalysis of Memory-Hard Functions. In: Iwata, T., Cheon, J. (eds) Advances in Cryptology – ASIACRYPT 2015. ASIACRYPT 2015. Lecture Notes in Computer Science(), vol 9453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48800-3_26
Download citation
DOI: https://doi.org/10.1007/978-3-662-48800-3_26
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48799-0
Online ISBN: 978-3-662-48800-3
eBook Packages: Computer ScienceComputer Science (R0)