Parallelizing the Weil and Tate Pairings
Diego F. Aranha1,⋆ Edward Knapp2 , Alfred Menezes2 ,
and Francisco Rodrı́guez-Henrı́quez3
1
2
Institute of Computing, University of Campinas
dfaranha@gmail.com
Department of Combinatorics & Optimization, University of Waterloo
edward.m.knapp@gmail.com, ajmeneze@uwaterloo.ca
3
CINVESTAV-IPN, Computer Science Department
francisco@cs.cinvestav.mx
Abstract. In the past year, the speed record for pairing implementations on desktop-class machines has been broken several times. The speed
records for asymmetric pairings were set on a single processor. In this
paper, we describe our parallel implementation of the optimal ate pairing
over Barreto-Naehrig (BN) curves that is about 1.23 times faster using
two cores of an Intel Core i5 or Core i7 machine, and 1.45 times faster
using 4 cores of the Core i7 than the state-of-the-art implementation on
a single core. We instantiate Hess’s general Weil pairing construction and
introduce a new optimal Weil pairing tailored for parallel execution. Our
experimental results suggest that the new Weil pairing is 1.25 times faster
than the optimal ate pairing on 8-core extensions of the aforementioned
machines. Finally, we combine previous techniques for parallelizing the
eta pairing on a supersingular elliptic curve with embedding degree 4,
and achieve an estimated 1.24-fold speedup on an 8-core extension of an
Intel Core i7 over the previous best technique.
1
Introduction
Since the publication in 2001 of Boneh and Franklin’s seminal paper on identitybased encryption [13], pairings have been used extensively to design ingenious
protocols for meeting security objectives that are seemingly unachievable using conventional cryptographic techniques. Researchers have made remarkable
progress in designing and implementing both symmetric and asymmetric pairings. For asymmetric pairings based on Barreto-Naehrig (BN) elliptic curves at
the 128-bit security level, there have recently been several notable improvements
on the 10-million cycle Core2 implementation reported in [26]. Naehrig et al. [38]
exploited high-speed vector floating-point operations and were able to compute
a pairing in 4.47 million cycles on a Core2. Shortly thereafter, Beuchat et al.
[10] and Aranha et al. [3] reported timings of 2.33 million cycles and 1.70 million
cycles, respectively, when employing the fastest integer multiplier available on
the Intel Core i7 64-bit platform.
⋆
A portion of this work was performed while the author was visiting the University
of Waterloo.
L. Chen (Ed.): Cryptography and Coding 2011, LNCS 7089, pp. 275–295, 2011.
c Springer-Verlag Berlin Heidelberg 2011
276
D.F. Aranha et al.
The aforementioned timings were all for a single-core implementation. In this
paper, we continue the work initiated by Grabher, Großschädl and Page [21] on
implementing pairings on multi-core platforms. This work is especially challenging for asymmetric pairings because of the apparent paucity of opportunities
available for parallelizing Miller’s basic algorithm. In particular, it seems hopeless to expect the optimal 2-fold speedup for known algorithms when going from
1 core to 2 cores on existing computing platforms. Furthermore, effective usage of parallel computation resources depends on expensive operating system
calls for thread creation and synchronization and on the relative immaturity of
development tools such as compilers, profilers, and debuggers. Concurrent programming is difficult in general, due to fundamentally nondeterministic nature
of the multi-threading programming model [33], but it becomes even harder
when the computational cost of what is being computed is not several orders of
magnitude higher than the parallelization overhead itself.
Our Results. We focus our attention on two of the fastest known symmetric
and asymmetric pairings at the 128-bit security level, namely the eta pairing [6]
over a supersingular elliptic curve with embedding degree 4, and Vercauteren’s
optimal ate pairing [45] over BN elliptic curves. It is worthwhile studying both
symmetric and asymmetric pairings because they provide different functionalities (see [19]). Our target platforms are popular Intel architectures. For the
eta pairing, we combine techniques from [11] and [4] and achieve an estimated
1.24-fold speedup on an 8-core extension of an Intel Core i7 over the previous
best technique. For the optimal ate pairing, we exploit a method introduced in
[4] for parallelizing a Miller function evaluation and a new ‘delayed squaring’
technique. Our implementation is about 1.23 times faster using two cores of an
Intel Core i5 or Core i7 machine, and 1.45 times faster using 4 cores of the
Core i7 than the state-of-the-art implementation on a single core [3]. We observe
that the straightforward methods for parallelizing extension field multiplication
that were deemed effective in [21] fail on the platforms under consideration because our field multiplication is much faster, whereby the cost of managing the
resulting threads dominates the cost of useful computation.
The limited success in parallelizing the optimal ate pairing on BN curves is
due in part to the apparent difficulty in parallelizing the final exponentiation.
This motivated us to consider the Weil pairing, whose potential speed advantages
over the Tate pairing due to the absence of a final exponentiation in the former
were first considered in [32]. We study two optimal Weil pairings, both of which
can be computed using the equivalent of four independent Miller functions each
having optimal length, and without an expensive final exponentiation. The first
pairing is an instantiation of Hess’s general Weil pairing construction [27], while
the second pairing is an elegant new construction tailored for parallel execution.
These pairings are faster than previous variants of the Weil pairing proposed
in [47] and [27]. Our experimental results suggest that the new Weil pairing is
1.25 times faster than the optimal ate pairing on 8-core extensions of the Intel
Core i5 and Core i7 machines.
Parallelizing the Weil and Tate Pairings
277
We emphasize that our implementations are for a single pairing evaluation
on multiple cores. If a protocol requires multiple pairing evaluations, the best
strategy may be to simply execute each pairing on a single core of a multi-core
platform — the optimal strategy depends on several factors including the number
of available cores. Thus, our work is primarily directed at protocols that require
a single pairing evaluation in applications that have stringent response time requirements or where the processing power of individual cores in a multi-core platform is low. Some examples of protocols that require a single pairing evaluation
are the encryption and decryption procedures in the Boneh-Franklin identitybased encryption scheme [13], signature verification in the Boneh-Boyen short
signature scheme [12], the Sakai-Ohgishi-Kasahara non-interactive key agreement scheme [42], and Scott’s identity-based key agreement scheme [43].
Other Platforms. In order to exploit the fine-grained parallelism inherent in
hardware platforms, a designer must carefully craft the circuit’s control unit
and schedule its hundreds of thousands of micro instructions [17], an extremely
challenging and complex task. FPGA devices can achieve substantial performance improvements and power efficiency for cryptographic applications. However, their reconfigurable design feature results in unavoidable highly-redundant
architectures that cause an overhead area factor between 20 and 40 when compared to static ASIC designs [25]. Mainly because of this, contemporary FPGA
devices can run at maximum clock frequencies of less than 600MHz (although
this value is rarely or never achieved in actual cryptographic designs). Thus,
it is not entirely surprising that timings for the FPGA implementations of the
optimal ate pairing over BN curves reported by Duquesne and Guillermin [16]
and Yao et al. [46] are slightly slower than the ones for software implementation
by Aranha et al. [3].
In contrast, ASIC designs and multi-core processors can easily operate at
higher frequencies. Fan et al. [18] and Kammler et al. [30] presented ASIC and
ASIP designs of pairings over BN curves at the 128-bit security level. The timings achieved by these designs are both slower than the ones reported in the
aforementioned FPGA designs [16,46] and software implementation [3]. On the
other hand, modern multi-core processors are supported by standard C/C++
compilers such as GCC and Intel’s ICC that can be combined with OpenMP to
add parallelism. Even though achieving optimal performance using these tools is
a challenging task, the software implementor’s work is significantly easier than
that of the hardware designer. In practice, the achievable parallelism on the
multi-core processors tends to be coarse-grained, but this should be compared
with the high frequencies of operation that these platforms enjoy, the periodic
addition by major manufacturers of faster and more powerful sets of instructions, and the constant reduction of the retail price due to the large market for
these processors.
We believe that deciding whether multi-core processors or FPGA/ASIC
hardware devices are the best choice for parallel realizations of pairings is far
from clear. Perhaps in the near future, the preferred design option will be a
combination of both platforms as some hybrid computer manufacturers whose
278
D.F. Aranha et al.
architectures combine both technologies seem to suggest. This combination of
technologies can also be of interest for emerging resource-constrained and embedded multi-core architectures such as the dual-core Cortex ARM. It is conceivable
that such constrained processors can be supported by fast hardware accelerators
attached to them.
We expect that our study of parallel pairing implementation on desktop
processors presented will be useful in predicting the performance that parallel
pairing implementations can achieve in resource-constrained embedded systems.
Even though the parallelization overhead is likely to be much more expensive in
embedded systems than on desktop processors, the pairing computation time will
be slower due to the usage of smaller processor word sizes and less sophisticated
multipliers. Hence, we expect that the ratio between the pairing computation
time and the parallelization overhead reported in our work will remain roughly
the same as the ones that we can expect in resource-constrained platforms. Because of that, we anticipate that many of the observations and conclusions for
pairing parallelization on desktop processors that we arrive at can be extrapolated to the embedded-system scenario.
Outline. The remainder of this paper is organized as follows. The optimal ate
and eta pairings are reviewed in §2. Our parallelization of the optimal ate pairing
is described in §3, and the optimal Weil pairings are presented in §4. Our parallel
implementation of the optimal ate and optimal Weil pairings is described in §5.
Our improvements to the parallel implementation of the eta pairing are presented
in §6. We draw our conclusions in §7.
2
Background on Pairings
Let E be an elliptic curve defined over the finite field Fq , and let r be a prime
with r | #E(Fq ) and gcd(r, q) = 1. The embedding degree k is the smallest
positive integer with r | (q k − 1). We will assume that k is even, whence k > 1
and E[r] ⊆ E(Fqk ).
Miller Functions. Let R ∈ E(Fqk ) and let s be a non-negative integer. A
Miller function fs,R [36] of length s is a function in Fqk (E) with divisor (fs,R ) =
s(R) − (sR) − (s − 1)(∞). Note that fs,R is uniquely defined up to multiplication
by nonzero constants in Fqk . The length s of a Miller function determines the
number ⌊log2 s⌋ of doubling steps, and the Hamming weight of s determines
the number of addition steps in Miller’s algorithm for computing fs,R [36]. We
will always assume that Miller functions are minimally defined; that is, if R ∈
E(Fqe ), then fs,R is selected from the function field Fqe (E). Let u∞ be an Fq rational uniformizing parameter for ∞. A function f ∈ Fqk (E) is said to be
normalized if lc∞ (f ) = 1, where lc∞ (f ) = (u−t
∞ f )(∞) and t is the order of f at
∞. Furthermore, f is said to be semi-normalized if lc∞ (f ) belongs to a proper
subfield of Fqk .
Parallelizing the Weil and Tate Pairings
279
The Tate Pairing. Let GT denote the order-r subgroup of F∗qk . The (reduced)
Tate pairing er : E[r] × E[r] → GT can be defined by
er : (P, Q) →
fr,P (Q + R)
fr,P (R)
(qk −1)/r
(1)
where R ∈ E(Fqk ) satisfies R ∈ {∞, P, −Q, P − Q}. Several variants of the Tate
pairing have been defined in the literature, e.g., [6,28,34,45,39,27]. All these
pairings have the property that they are fixed powers of the Tate pairing with
domain restricted to the product of two order-r subgroups of E[r]. Next we
describe two of the fastest asymmetric and symmetric pairings — the optimal
ate pairing on BN curves, and the eta pairing on k = 4 supersingular curves.
The Optimal Ate Pairing. A BN elliptic curve E : Y 2 = X 3 + b [8] of order
r is defined over a prime field Fp , where p(z) = 36z 4 + 36z 3 + 24z 2 + 6z + 1 and
where r = #E(Fp ) = 36z 4 + 36z 3 + 18z 2 + 6z + 1 is prime. These curves have
embedding degree k = 12. The integer z is called the BN parameter.
Let π : (x, y) → (xp , y p ) be the Frobenius endomorphism, and let G1 =
{P ∈ E[r] : π(P ) = P } = E(Fp )[r]; G1 is the 1-eigenspace of π acting on
E[r]. There is a unique sextic twist Ẽ of E over Fp2 with r | #Ẽ(Fp2 ) [28];
let Ψ : Ẽ → E be the associated twisting isomorphism. Let Q̃ ∈ Ẽ(Fp2 ) be
a point of order r; then Q = Ψ (Q̃) ∈ E(Fp ). The group G2 = Q is the
p-eigenspace of π acting on E[r]. Points in G2 have x-coordinates in Fp6 , a
property that is exploited in the important denominator elimination speedup
[7]. For future reference, we note that for a suitable third root of unity δ ∈ Fp ,
the automorphism φ : (x, y) → (δx, −y) has order 6 and satisfies φ(P ) = p2 P
for all P ∈ G1 .
The optimal ate pairing [45] is aopt : G1 × G2 → GT defined by
(p12 −1)/r
(P, Q) → f6z+2,Q (P ) · ℓ(6z+2)Q,π(Q) (P ) · ℓ(6z+2)Q+π(Q),−π2 (Q) (P )
,
(2)
where ℓA,B denotes the line through points A and B, and where f6z+2,Q and the
line functions are semi-normalized. This pairing is called “optimal” because the
length 6z + 2 of the Miller function appearing in (2) has bitlength roughly onefourth that of the length r of the Miller function in the Tate pairing definition
(1) [45]. The exponent (p12 − 1)/r in (2) can be written as (p6 − 1)(p2 + 1)(p4 −
p2 + 1)/r. Since p-th powering is inexpensive in Fp12 , the exponentiation by
(p6 − 1)(p2 + 1) is said to be the easy part of the final exponentiation in (2); the
exponentiation by (p4 − p2 + 1)/r is called the hard part.
The Eta Pairing. Consider the supersingular elliptic curve E : Y 2 + Y =
X 3 + X defined over F2m , with m odd. For simplicity, we further assume that
m ≡ 7 (mod 8). We have #E(F2m ) = 2m + 1 + 2(m+1)/2 . Let r be a large
prime divisor of #E(F2m ). The embedding degree is k = 4. The extension field
F24m is represented using tower extensions F22m = F2m [s]/(s2 + s + 1) and
F24m = F22m [t]/(t2 + t + s). A distortion map on G1 = E(F2m )[r] is ψ : (x, y) →
280
D.F. Aranha et al.
(x + s2 , y + sx + t). Let GT be the order-r subgroup of F∗24m . The eta pairing of
Barreto et al. [6] is ηT : G1 × G1 → GT defined by
ηT : (P, Q) → fT,−P (ψ(Q))M ,
(3)
where T = 2(m+1)/2 + 1 and M = (22m − 1)(2m − 2(m+1)/2 + 1). Note that if
r ≈ 2m , then the length of the Miller function appearing in (3) is approximately
half that of the length r of the Miller function in the Tate pairing definition
(1). Observe also that the final exponentiation by M can be computed relatively
quickly since squaring in F24m is an inexpensive operation.
3
Parallelizing the Optimal Ate Pairing
This section shows how the Miller function fs,R in the optimal ate pairing can
be split into shorter Miller functions, which can then be computed on separate
cores. We shall assume that all Miller functions and line functions are seminormalized. Equations involving Miller and line functions hold up to multiplication by nonzero constants. The following are two well-known properties of Miller
functions.
Lemma 1 (Miller [36]). Let a and b be non-negative integers, and let R ∈
E(Fqk ). Then
(i) fa+b,R = fa,R ·fb,R ·ℓaR,bR /v(a+b)R , where vP denotes the vertical line through
P ; and
a
· fa,bR .
(ii) fab,R = fb,R
The method of [4] for parallelizing the computation of a Miller function fs,R is
the following. We first write s = 2w s1 + s0 , where s0 < 2w . Applying Lemma 1,
we obtain
w
ℓ2w s1 R,s0 R
.
(4)
fs,R = fs21 ,R · f2w ,s1 R · fs0 ,R ·
vsR
If s0 is small, then the Miller function fs0 ,R can be computed relatively cheaply.
w
Thus the computation of fs,R can be parallelized by computing fs21 ,R on one
processor and f2w ,s1 R on a second processor. The parameter w should be carefully selected in order to balance the time of the two function computations. The
relevant criteria for selecting w include the Hamming weight of s1 (which determines the number of additions in the Miller loop for the first function), and the
cost of the w-fold squaring in the first function relative to the cost of computing
w
s1 R in the second function. The w-fold squaring in fs21 ,R can be sped up by first
(p6 −1)(p2 +1)
(recall that exponentiation by (p6 −1)(p2 +1) is the
computing α = fs1 ,R
w
easy part of the final exponentiation), followed by α2 . The advantage of this
‘delayed squaring’ trick is that α belongs to the order-(p4 − p2 + 1) cyclotomic
subgroup of F∗p12 whence Karabina’s squaring method [31] (see also [23]) can
be deployed at a cost of 12 Fp multiplications plus some small overhead — this
is considerably less than squaring a general element in Fp12 which costs 24 Fp
multiplications.
Parallelizing the Weil and Tate Pairings
281
Each of the two expensive Miller function computations in (4) can be recursively parallelized. For this purpose, one writes s = st 2wt +· · ·+s2 2w2 +s1 2w1 +s0 ,
where si 2wi = (s mod 2wi+1 )−(s mod 2wi ) for some wt > · · · > w2 > w1 > w0 =
0. We also note that the lines that appear in (2) should be scheduled for execution on the processor that handles s0 since this processor does not have any
overhead of computing consecutive squarings.
Remark 1. We were unable to find any effective method for parallelizing the hard
part of the final exponentiation. The exponent (p4 −p2 +1)/r can be decomposed
2
into a multi-addition chain requiring the consecutive z-th powers αz , αz and
z3
α where α ∈ Fp12 [44]. However, the extremely low Hamming weight of z limits
the potential for parallelization. Furthermore, techniques that exploit very fast
squaring (e.g., [20]) and fixed bases (e.g., [35]) are not applicable.
Remark 2. We measured the parallelization overhead in the target platforms using OpenMP Microbenchmarks [40] and observed costs on the order of 1µs for
multi-threading primitives such as creation or synchronization; the costs confirm
those reported in [2]. These overheads also scaled linearly with the number of
threads involved. During 1µs on a 2.53GHz machine, it is possible to perform 6
Fp2 multiplications or 8 Fp2 squarings [3]. On the other hand, the most expensive Fp12 operation within the Miller loop of a pairing computation is a sparse
multiplication costing 13 Fp2 multiplications. Hence, it seems that any potential
speedup in a parallel implementation of Fp12 arithmetic (for example, by assigning each of the three Fp6 multiplications required for an Fp12 multiplication to
different processors), as suggested by Grabher, Großschädl and Page [21], would
be nullified by the overheads. Furthermore, this approach is clearly not scalable
to many processors.
4
Optimal Weil Pairings
This section presents two Weil pairings, called the α (§4.2) and β (§4.3) pairings,
that are well suited for parallelixation. In this section, E is BN curve defined
over Fp with BN parameter z. Unless otherwise stated, all functions are assumed
to be normalized. By a ‘pairing’ we will mean a non-degenerate bilinear pairing
from G1 × G2 to GT . Note that if e is a pairing and gcd(ℓ, r) = 1, then eℓ is also
a pairing. It is understood that pairing values are defined to be 1 if either input
point is equal to ∞.
The classical Weil pairing [36] is eW : G1 × G2 → GT defined by
eW : (P, Q) → −
fr,P (Q)
.
fr,Q (P )
(5)
Note that the Weil pairing does not have a final exponentiation. The two Miller
functions in (5) each have length approximately z 4 and can be independently
computed on two processors.
282
4.1
D.F. Aranha et al.
Hess’s Weil Pairing Construction
Hess [27] developed a framework for designing Tate and Weil-type pairings. Let
d
i
s be an integer. Let h =
i=0 hi x ∈ Z[x] with h(s) ≡ 0 (mod r), and let
m = h(s)/r; we assume that r ∤ m. For R ∈ E[r], Lemma 1(ii) gives
m
= fmr,R = f d
fr,R
i=0
hi si ,R .
As shown by Vercauteren [45, Theorem 1], repeated application of Lemma 1
yields
d
d−1
d
ℓsi+1 R,hi si R
hi
m
fr,R =
(6)
fhi ,si R ·
fsi ,R ·
vsi R
i=0
i=0
i=1
where si = dj=i hj sj . Let fs,h,R denote the expression within the parenthesis
in (6); fs,h,R is called an extended Miller function and can be verified to have
d
divisor i=0 hi ((si R) − (∞)). Note that fs,R = fs,s−x,R . Hess’s result for Weil
pairings specialized to BN curves is the following.
Theorem 1 (Theorem 1 in [27]). Let s be a primitive 6th root of unity modulo
r2 with s ≡ p2 (mod r). Let h ∈ Z[x] with h(s) ≡ 0 (mod r) and h(s) ≡ 0
(mod r2 ). Then es,h : G1 × G2 → GT defined by
es,h : (P, Q) →
fs,h,P (Q)
fs,h,Q (P )
6
(7)
is a pairing. In particular, there exists a sixth root of unity w ∈ Fp such that
taking h(x) = p2 − x gives the eil pairing eeil : G1 × G2 → GT defined by
eeil : (P, Q) → w
4.2
fp2 ,P (Q)
.
fp2 ,Q (P )
(8)
The α Weil Pairing
Taking h(x) = (2z + 1) + (6z 2 + 2z)x in Theorem 1 yields the pairing α′ :
G1 × G2 → GT defined by
α′ : (P, Q) → w
f2z+1,P (Q) f6z2 +2z,p2 P (Q) ℓ(6z2 +2z)p2 P,(2z+1)P (Q)
·
·
f2z+1,Q (P ) f6z2 +2z,p2 Q (P ) ℓ(6z2 +2z)p2 Q,(2z+1)Q (P )
(9)
for some sixth root of unity w ∈ Fp . Since 6 | p6 − 1 and r ∤ (p6 − 1)(p2 + 1), it
follows that the map α : G1 × G2 → GT defined by
6
α = (α′ )(p
−1)(p2 +1)
: (P, Q) →
f2z+1,P (Q) f6z2 +2z,p2 P (Q)
·
f2z+1,Q (P ) f6z2 +2z,p2 Q (P )
(p6 −1)(p2 +1)
(10)
is also a pairing. The advantage of α over α′ is that field elements that lie in a
proper subfield of Fp12 can be ignored. In particular, the four Miller functions in
Parallelizing the Weil and Tate Pairings
283
(10) only need to be semi-normalized, the important denominator elimination
speedup of [7] can be applied, and the two line functions in (9) can be ignored.
Furthermore, the delayed squaring technique of §3 can be employed as described
below. In order to shorten the length of the Miller functions f6z2 +2z,R for R ∈
{p2 P, p2 Q} in (10), we can use Lemma 1(ii) to write f6z2 +2z,R = fz,(6z+2)R ·
z
f6z+2,R
. The revised formula for the α pairing now has 6 Miller functions, and it
may appear that at least 6 processors would be necessary to effectively parallelize
the pairing. However, we observe that
2
p
f6z2 +2z,p2 Q (P ) = f6z
2 +2z,Q (P )
(11)
since p2 Q = π 2 (Q) and π(P ) = P . Lemma 2 shows that an analogous conclusion
can be drawn with P and Q exchanged.
Lemma 2. For all P ∈ G1 and Q ∈ G2 we have
(p6 −1)(p2 +1)
p2 (p6 −1)(p2 +1)
f6z2 +2z,p2 P (Q) = f6z2 +2z,P
(Q).
(12)
Proof. To simplify the notation, we set a = 6z 2 + 2z and c = (p6 − 1)(p2 + 1).
Two applications of Lemma 1(ii) yields
2
cp
c
ac
c
c
fap
2 ,P = fp2 ,P · fa,p2 P = fa,P · fp2 ,aP .
2
(13)
2
Let π̃ : (x, y) → (xp , y p ) be the p2 -power Frobenius acting on the twist Ẽ. Let
G̃1 and G̃2 be the 1-eigenspace and the p2 -eigenspace, respectively, of π̃ acting on
Ẽ[r]. Then Ψ −1 (G1 ) = G̃2 and Ψ −1 (G2 ) = G̃1 , where Ψ : Ẽ → E is the twisting
isomorphism1 . Lemma 6 of [22] applied to Ẽ shows that (P̃ , Q̃) → f˜p2 ,P̃ (Q̃) is a
pairing on G̃2 × G̃1 , where f˜ is a normalized Miller function associated with Ẽ.
Thus
f˜p2 ,aP̃ (Q̃) = f˜pa2 ,P̃ (Q̃).
(14)
It follows from the work in [1] (see also the proof of Theorem 1 in [15]) that
f˜pc2 ,P̃ (Q̃) = fpc2 ,P (Q) where P = Ψ −1 (P̃ ) ∈ G1 and Q = Ψ −1 (Q̃) ∈ G2 . Hence
(14) can be written as
fpc2 ,aP (Q) = fpac2 ,P (Q).
The result now follows from (13).
Using (11) and (12), we obtain the following alternate formulation of the α
pairing:
⎛
α : (P, Q) → ⎝
1
f2z+1,P (Q)
·
f2z+1,Q (P )
z
(Q)
fz,(6z+2)P (Q) · f6z+2,P
z
fz,(6z+2)Q (P ) · f6z+2,Q (P )
p2
⎞(p6 −1)(p2 +1)
⎠
.
(15)
We couldn’t find the statement Ψ (G̃2 ) = G1 in the literature. For completeness, we
include a proof in Appendix A.
284
D.F. Aranha et al.
Now, if fz,R is computed first, then f2z+1,R and f6z+2,R can be computed thereafter with little additional work. Thus, there are effectively only 4 Miller functions in (15). Each of these Miller functions has length approximately z, and
therefore the α pairing is considered optimal in the sense of [45].
Figure 1 illustrates the execution path when the 4 Miller functions in (15)
are computed in parallel using 4 processors. A further optimization is to raise
the Miller functions to the power (p6 − 1)(p2 + 1) as soon as they are computed
z
— this enables the use of Karabina’s fast squaring when computing f6z+2,P
(Q)
z
2
3
and f6z+2,Q (P ). Note also that since (6z + 2) + p − p + p ≡ 0 (mod r), we
have (6z + 2)Q = −π(Q) + π 2 (Q) − π 3 (Q) and thus (6z + 2)Q can be computed
very quickly. The fourth processor in Figure 1 is the bottleneck because of the
exponentiation by z.
1. (6z + 2)P
2. f2z+1,P (Q)
3. (6z + 2)Q
4. f2z+1,Q (P )
2
p
fz,(6z+2)P
(Q)
f6z+2,P (Q)
z
f6z+2,P
(Q)
2
p
fz,(6z+2)Q
(P )
f6z+2,Q (P )
z
f6z+2,Q
(P )
2
p
f2z+1,P (Q) · f6z
2 +2z,P (Q)
2
zp
f2z+1,P (Q) · f6z+2,P
(Q)
2
p
f2z+1,Q (P ) · f6z
2 +2z,Q (P )
2
zp
f2z+1,Q (P ) · f6z+2,Q
(P )
Fig. 1. Execution path for computing the α pairing on 4 processors
In §4.3, we present a variant of the Weil pairing that is slightly faster than
the α pairing.
4.3
The β Weil Pairing
Consider h(x) = (6z + 2) + x − x2 + x3 . From (6) we obtain
−m
−1
fp,h,R
= fr,R
· fp,R · fp−1
2 ,R · fp3 ,R
(16)
and
fp,h,R = f6z+2,R · f−1,p2 R ·
ℓ(6z+2)R,(p−p2 +p3 )R ℓpR,(−p2 +p3 )R ℓ−p2 R,p3 R
·
·
. (17)
v∞
v(p−p2 +p3 )R v(−p2 +p3 )R
Theorem 2. For h(x) = (6z + 2) + x − x2 + x3 , the map β ′ : G1 × G2 → GT
defined by
p
fp,h,P (Q)
fp,h,pP (Q)
′
(18)
β : (P, Q) → w
fp,h,Q (P )
fp,h,Q (pP )
is a pairing, where w ∈ Fp is some sixth root of unity.
Parallelizing the Weil and Tate Pairings
285
Proof. For simplicity, multiplicative factors that are sixth roots of unity will be
omitted in the proof. For y ∈ {r, p, p2 , p3 }, define the functions
γy : (P, Q) →
fy,P (Q)p fy,pP (Q)
·
fy,Q (P )p fy,Q (pP )
−1
−m
· fp,P · fp−1
on G1 × G2 . Since fp,h,P
= fr,P
2 ,P · fp3 ,P , it follows that
β ′ (P, Q)−1 = γr (P, Q)−m · γp (P, Q) · γp2 (P, Q)−1 · γp3 (P, Q)
for all (P, Q) ∈ G1 × G2 . The conclusion that β ′ is a pairing immediately follows
if it can be shown that γr , γp , γp2 and γp3 are pairings.
Now, (P, Q) → fr,P (Q)/fr,Q (P ) and (P, Q) → fp2 ,P (Q)/fp2 ,Q (P ) are, respectively, the classic Weil pairing (5) and the eil pairing (8). It follows that γr and
γp2 are also pairings.
p
Using the facts that fp2 ,R = fp,R
· fp,pR (Lemma 1(ii)) and that (P, Q) →
fp,Q (P ) is a pairing (Lemma 6 of [22]), the eil pairing can be written as
fp2 ,P (Q)
fp,P (Q)p fp,pP (Q)
fp,P (Q)p fp,pP (Q)
=
=
= γp (P, Q),
·
·
p
fp2 ,Q (P )
fp,Q (P ) fp,pQ (P )
fp,Q (P )p fp,Q (pP )
and hence γp is a pairing.
p
Finally, we note that (P, Q) → fp2 ,Q (P ) is a pairing since fp2 ,Q = fp,Q
· fp,pQ
and (P, Q) → fp,Q is a pairing. Using this observation and the fact that fp3 ,R =
2
p
fp,R
· fp2 ,pR , one can check that
2
γp3 (P, Q) = γp (P, Q)p · γp2 (pP, Q).
Hence γp3 is a pairing.
Remark 3. The proof of Theorem 2 can easily be modified for all polynomials
h(x) ∈ Z[x] that satisfy h(p) ≡ 0 (mod r).
Since 6 | p6 − 1 and r ∤ (p6 − 1)(p2 + 1), the map β : G1 × G2 → GT defined by
′ (p6 −1)(p2 +1)
β = (β )
: (P, Q) →
fp,h,P (Q)
fp,h,Q (P )
p
fp,h,pP (Q)
fp,h,Q (pP )
(p6 −1)(p2 +1)
(19)
is also a pairing. Since each extended Miller function in (19) is essentially a
Miller function of length approximately z (see (17)), the β pairing is considered
optimal. As was the case with the α pairing, the exponentiation by (p6 −1)(p2 +1)
means that the four extended Miller functions in (19) only need to be seminormalized and denominator elimination can be applied. Moreover, the vertical
lines v(p−p2 +p3 )R , v(−p2 +p3 )R , f−1,p2 R and ℓ(6z+2)R,(p−p2 +p3 )R for R ∈ {P, pP, Q}
in (17) can be ignored. Once pP has been computed, the remaining line functions
ℓpR,(−p2 +p3 )R and ℓ−p2 R,p3 R for R ∈ {P, pP, Q} can be computed at very little
additional cost since p2 P = φ(P ), p3 P = φ(pP ), and pQ = π(Q). Furthermore,
286
D.F. Aranha et al.
the delayed squaring technique of §3 can be employed if the extended Miller
functions are divided using (4).
Figure 2 illustrates the execution path when the 4 extended Miller functions
in (19) are computed in parallel using 4 processors. The fourth processor is the
bottleneck, and thus it is desirable to accelerate the computation of pP . To this
effect, we observe that p ≡ 2z(p2 − 2) + p2 − 1 (mod r), whence
pP = 2z(p2 − 2)P + p2 P − P = 2z(φ(P ) − 2P ) + φ(P ) − P.
(20)
Thus, computing pP has roughly the same cost as zP .
p
1. f6z+2,P
(Q)
2. pP
p
3. f6z+2,Q
(P )
4. pP
p
f6z+2,P
(Q) · f6z+2,pP (Q)
f6z+2,pP (Q)
p
f6z+2,Q
(P ) · f6z+2,Q (pP )
f6z+2,Q (pP )
Fig. 2. Execution path for computing the β pairing on 4 processors
5
Parallel Implementation of the BN Pairings
The parallelization approaches described in §3 and §4 were implemented on top
of the state-of-the-art implementation of an optimal ate pairing at the 128-bit
security level described in [3]. The underlying elliptic curve is a BN curve with
parameter z = −(262 + 255 + 1) [41].
Let π denote the number of available processors on the target platform. To
select parameters (wπ−1 , . . . , w2 , w1 , w0 ) that split the Miller loop, we employ the
load balancing scheme suggested in [4] with fine granularity, taking into account
the relative cost of inversions, multiplications, squarings, additions and modular
reductions on the target platform. The following parameters wi for splitting the
Miller loop were selected for the optimal ate pairing:
without delayed squaring
2 cores
30, 0
4 cores
46, 28, 12, 0
8 cores
54, 42, 32, 24, 16, 9, 4, 0
with delayed squaring
2 cores
35, 0
4 cores
52, 37, 19, 0
8 cores
61, 56, 49, 42, 33, 23, 12, 0
Parallelizing the Weil and Tate Pairings
287
With the optimal parameters determined, elementary operation counting makes
it possible to estimate the performance improvement of the corresponding implementation. Figure 3 presents the estimated speedups for the parallelization
approaches discussed in this work in comparison with the optimal ate serial implementation of [3]. Notice how the performance of the α and β Weil pairings
scales better with the number of processing cores. Scaling stills suffers from the
same saturation effect experienced by the ate pairing variants, but at a higher
number of cores.
!"#
$ % "#
& % "#
Fig. 3. Estimated speedups for parallelization techniques for BN pairings. Speedups
are computed in relation to a serial implementation of the optimal ate pairing.
The parallel implementation was realized on platforms — a 2-core Intel Core i5
Westmere 540M 32nm 2.53GHz machine (“Platform 1”) and a 4-core Intel
Core i7 Sandy Bridge 2630QM 32nm 2.0GHz machine (“Platform 2”), using
GCC v4.6.0 as compiler with optimization flags -O3 -funroll-loops. Parallel sections were implemented through the compiler’s native OpenMP support.
The same split in the Miller loop was used in both machines, as they have similar field operation costs. Table 1 presents the experimental results, including
both the speedups estimated by operation counting and actual timings. For the
optimal ate pairing, the table confirms that delayed squaring yields a slightly
better scaling, and that the increase in overall speedup starts to stagnate at 8
cores. Even with these obstacles, the pairing latency of the optimal ate pairing
is reduced by 18-20% when using 2 processor cores, a significant improvement
over the 10% reported as a preliminary result for the R-ate pairing in [4]. The
measured timings are compatible with the estimated speedups — the differences
are due to the parallelization overheads that are not accounted for in the model
for speedup estimation. The gap between the estimated and measured timings
increases with the number of cores due to the linear increase in overhead.
The performance of the β Weil pairing is generally superior to the α Weil
pairing due to the difference in the cost of computing zP in the former and an
288
D.F. Aranha et al.
Table 1. Experimental results for serial/parallel executions of BN pairings. Times are
presented in thousands of clock cycles and the speedups are computed as the ratio of
the execution time of serial and parallel implementations. The dashes represent data
points where there is no expected improvement over the serial implementation. The
columns marked with (*) present estimates based on per-thread data. Timings were
obtained with the Turbo Boost feature turned off, and therefore are compatible with
the timings in Table 4 of the extended version of [3].
Estimated speedup
Optimal ate
Optimal ate with delayed squaring
α Weil
β Weil
Platform 1: Intel Core i5 Nehalem 32nm
Optimal ate – latency
Optimal ate – speedup
Optimal ate with delayed squaring – latency
Optimal ate with delayed squaring – speedup
α Weil – latency
α Weil – speedup
β Weil – latency
β Weil – speedup
Platform 2: Intel Core i7 Sandy Bridge 32nm
Optimal ate – latency
Optimal ate – speedup
Optimal ate with delayed squaring – latency
Optimal ate with delayed squaring – speedup
α Weil – latency
α Weil – speedup
β Weil – latency
β Weil – speedup
Number of threads
1
2
4
8
1.00 1.24 1.40 1.48
1.00 1.27 1.47 1.58
0.43 0.80 1.33 1.84
0.44 0.86 1.48 2.05
1
2 4*
8*
2038 1682 1513 1453
1.00 1.21 1.35 1.40
– 1655 1435 1389
– 1.23 1.42 1.47
–
– 1590 1214
–
– 1.28 1.68
–
– 1481 1104
–
– 1.38 1.84
1
2
4
8*
1562 1287 1137 1107
1.00 1.21 1.37 1.41
– 1260 1080 1056
– 1.24 1.45 1.48
–
– 1272 936
–
– 1.23 1.67
–
– 1104 840
–
– 1.41 1.86
exponentiation by z in the cyclotomic subgroup in the latter (see Figures 1 and
2). It is important to observe that since the β pairing is tailored for parallel
execution, any future improvements in the parallelization of the Miller loop
in the optimal ate variants can be directly applied to the β pairing. In the
columns of Table 1 marked with (*), we present estimates for machines with
higher numbers of cores. These estimates were obtained by running multiples
threads per core and then measuring the cost of the most expensive thread.
This serves as an accurate prediction of performance scaling in future machines,
assuming that critical platform characteristics such as the memory organization
and multi-threading overhead will not change dramatically.
6
Parallel Implementation of the Eta Pairing
Four approaches to parallelizing the eta pairing are outlined next, followed by a
description of our implementation.
Parallelizing the Weil and Tate Pairings
289
Algorithm 1. Aranha et al. [4] applied the parallelization method given in (4)
to the reverse-loop eta pairing algorithm presented in [9], obtaining Algorithm 1.
The Miller loop is split across π processors inside a parallel execution section. The
intermediate results Fi , 0 ≤ i ≤ π − 1, from each core are multiplied together
in parallel in step 17 at a cost of ⌈log2 π⌉ F24m -multiplications. The starting
points (w0 , w1 , . . . , wπ−1 ) can be determined so that the precomputation cost of
wi
wi
wi
wi
processor i to compute (xQ )2 , (yQ )2 , (xP )1/2 , (yP )1/2 in step 10 can be
balanced with the number of iterations, wi+1 −wi , so that all processors incur the
same cost [4]. This balancing can be deduced statically with a simple platformdependent operation count. The main advantage of Algorithm 1 is the negligible
storage requirements, which makes it well suited for embedded platforms.
Algorithm 1. For parallelization of the eta pairing on π processors
Input: P = (xP , yP ), Q = (xQ , yQ ) ∈ E(F2m )[r], starting point wi for processor
i ∈ [0, π − 1].
Output: ηT (P, Q) ∈ F∗24m .
1: yP ← yP + 1
2: parallel section (processor i)
3: if i = 0 then
4:
ui ← xP , vi ← xQ
5:
g0 i ← ui · vi + yP + yQ + 1, g1 i ← ui + xQ , g2 i ← vi + x2P
6:
Gi ← g0 i + g1 i s + t, Li ← (g0 i + g2 i ) + (g1 i + 1)s + t, Fi ← Li · Gi
7: else
8:
Fi ← 1
9: end if
wi
wi
wi
wi
10: xQ i ← (xQ )2 , yQ i ← (yQ )2 , xP i ← (xP )1/2 , yP i ← (yP )1/2
11: for j ← wi to wi+1 − 1 do
√
√
12:
xP i ← xP i , yP i ← yP i , xQ i ← xQ 2i , yQ i ← yQ 2i
13:
ui ← xP i , vi ← xQ i
14:
g0 i ← ui · vi + yP i + yQ i + 1, g1 i ← ui + xQ i
15:
Gi ← g0 i + g1 i s + t, Fi ← Fi · Gi
16: end for
17: F ← π−1
i=0 Fi
18: end parallel
19: return F M
Algorithm 2. An alternate parallelization strategy proposed in [11] is to prej
j
j
j
compute all the squares and square-roots (xQ )2 , (yQ )2 , (xP )1/2 , (yP )1/2 for
0 ≤ j < (m − 1)/2 and split the Miller loop in equal parts so that wi+1 − wi =
⌈(m − 1)/2π⌉ without the need for any static scheduling. This strategy requires
storage capacity for 4 · (m − 1)/2 ≈ 2m field elements, and therefore is more
useful for the desktop platform scenario where storage is abundant. However,
Algorithm 2 was found to be slower that Algorithm 1 because the precomputation cost is higher. Since this precomputation step is executed serially in Algorithm 2, it can be viewed as the equivalent of all the processors incurring a
higher precomputation cost compared with Algorithm 1.
290
D.F. Aranha et al.
Algorithm 3. A straightforward improvement to Algorithm 2 is to further parallelize the precomputation step using at most 4 processors, where each processor
takes care of one iterated squaring or square-root routine.
Algorithm 4. An improvement to Algorithm 3 is to utilize a recently-introduced
time-memory trade-off for computing consecutive squares [14]. Since squaring
and square-root are linear maps on F2m = F2 [x]/(f (x)), for any fixed power 2k ,
a table T of 16⌈m/4⌉ F2m elements can be precomputed so that
k
T [j, i0 + 2i1 + 4i2 + 8i3 ] = (i0 x4j + i1 x4j+1 + i2 x4j+2 + i3 x4j+3 )2
(21)
for all 0 ≤ j < ⌈m/4⌉ and i0 , i1 , i2 , i3 ∈ {0, 1}. After the table has been comk
puted, any power α2 can be quickly computed as
⌈m/4⌉
k
α2 =
T [j, ⌊a/24j ⌋ mod 16],
j=0
where a denotes the integer whose base-2 digits are the coefficients of the polynomial α ∈ F2 [x] (which has degree at most m-1). A similar table can be computed
for 2−k powers. The objective of this technique is to permit the computation of
any 2k or 2−k -powers in constant time independent of how large k is. When applied to the precomputation step required for parallelization, this in turn allows
each processor to use the same partition size at a storage cost of 2·16·m/4 = 8m
field elements per processor, divided into two tables for squarings and squareroots. This has the effect of making load balancing trivial while speeding up
the precomputation step. The technique has further application in reducing an
important part of the serial cost of parallel pairing computation — the final exponentiation by M . The time-memory trade-off can also be used to accelerate the
2(m+1)/2 -th power in F24m and inversion with the Itoh-Tsujii algorithm [29,11].
Implementation Report. For concreteness, we focus our attention on the
supersingular elliptic curve E1 : Y 2 + Y = X 3 + X over F21223 . This curve
is a candidate for the 128-bit security level [26]. We have #E1 (F21223 ) = 5r
where r is a 1221-bit prime. The representation chosen for the underlying field
is F21223 = F2 [x]/(x1223 + x255 + 1).
We implemented Algorithms 1–4 on an Intel Core i5 Westmere 32nm processor (“Platform 1”) and an Intel Core i7 Sandy Bridge 32nm processor (“Platform 2”), both equipped with the new carry-less multiplier [24]. The finite field
implementation closely followed [5] with the exception of multiplication, where
a multi-level Karatsuba multiplier of 128-bit granularity that exploits the new
native multiplication instruction was deployed. The first level is the traditional
2-way Karatsuba, where each of the three multiplications is solved by a 5-way
Karatsuba formula [37]. Each of the final 128-digit multiplications is computed
by another Karatsuba instance requiring 3 executions of the carry-less multiplication instruction.
Let M , S, R, T , I be the respective costs of multiplication, squaring, squareroot, repeated squaring or square-root computation using a precomputed table
Parallelizing the Weil and Tate Pairings
291
as given in (21), and inversion in F2m . The following ratios were obtained from
our implementation of arithmetic in F21223 on Platform 1: M = 14S and S ≈ R;
for Algorithm 1 we have I = 95M , and for Algorithm 4 we have T = 73S and
I = 51M . Using these ratios, the speedups for the eta pairing of the four parallelization approaches over a serial implementation were estimated; the results
are depicted in Figure 4. It can be observed from Figure 4 that the proposed parallelization is expected to be faster and more scalable, at the expense of storage
requirements for 8πm field elements. On Platform 2, the ratios were: M = 18S,
S ≈ R, I = 78M for Algorithm 1, and T = 81S and I = 38M for Algorithm 4.
Fig. 4. Estimated speedups of the eta pairing on Platform 1 for four parallelization
algorithms using up to π = 8 processors
Table 2 presents experimental results for this parallelization and updates results from [5] by accounting for the native support for binary field multiplication.
The speedups obtained by the actual implementation closely match what was
estimated in Figure 4. The one exception is for Algorithm 4 running in 8 threads
on Platform 1 where the competition for cache occupancy among the processors for quick access to the precomputed tables degrades performance. One can
expect that this effect will be reduced in a proper 8-core machine because the
memory hierarchy would be better prepared for parallel access. The native support for binary field multiplication has two effects: dramatically improving the
serial pairing computation time from 17.4 million cycles in [4] to 7.2 million
cycles (Platform 1) and 6.7 million cycles (Platform 2) in Algorithm 1; and reducing the relative cost of the Miller loop compared to the final exponentiation,
which further reduces the 8-core estimated speedups for Platform 1 from 6.11
in [4] to 4.96 in Algorithm 1. The improvements in Algorithm 4 eliminate an
important serial cost in the final exponentiation. For Platform 1, this resulted in
an increase in the 8-core estimated speedup to 6.40, and an acceleration of the
serial pairing latency by about 4.5%.
292
D.F. Aranha et al.
Table 2. Experimental results for serial/parallel executions the eta pairing. Times are
presented in thousands of clock cycles and the speedups are computed as the ratio of
the execution time of serial and parallel implementations. Timings were obtained with
the Turbo Boost feature turned off. The columns marked with (*) present estimates
based on per-thread data.
Platform
Algorithm
Algorithm
Algorithm
Algorithm
Algorithm
Algorithm
Platform
Algorithm
Algorithm
Algorithm
Algorithm
Algorithm
Algorithm
7
1: Intel Core i5 Nehalem 32nm
1 – estimated speedup
1 – latency
1 – speedup
4 – estimated speedup
4 – latency
4 – speedup
2: Intel Core i7 Sandy Bridge 32nm
1 – estimated speedup
1 – latency
1 – speedup
4 – estimated speedup
4 – latency
4 – speedup
Number of threads
1
2 4*
8*
1.00 1.85 3.18 4.96
8687 4707 2766 1818
1.00 1.84 3.14 4.77
1.00 1.92 3.62 6.40
8284 4331 2393 1575
1.00 1.91 3.46 5.26
1
2
4
8*
1.00 1.87 3.29 5.26
6648 3622 2072 1284
1.00 1.84 3.21 5.17
1.00 1.94 3.67 6.54
6455 3370 1794 1034
1.00 1.92 3.60 6.24
Concluding Remarks
Our work has demonstrated that asymmetric pairings derived from BN curves
can be significantly accelerated on multi-processor machines. Furthermore, our
experiments suggest that there are variants of the Weil pairing that are more
amenable to parallelization than the optimal ate pairing. Unlike the case with
asymmetric pairings, our parallel implementations of the eta pairing on k = 4
supersingular elliptic curves come close to achieving the ideal parallelization
factor. Nevertheless, we found that asymmetric pairings derived from BN curves
are faster than the eta pairing on multi-processor machines. Of course, these
conclusions are heavily dependent on the characteristics of the hardware we
employed and may change on future generations of multi-core architectures.
References
1. Akane, M., Nogami, Y., Morikawa, Y.: Fast ate pairing computation of embedding
degree 12 using subfield-twisted elliptic curve. IEICE Trans. Fundam. Electron.
Commun. Comput. Sci E92.A, 508–516 (2009)
2. Akhter, S., Roberts, J.: Multi-Core Programming: Increasing Performance through
Software Multi-threading. Intel Press (2006)
3. Aranha, D.F., Karabina, K., Longa, P., Gebotys, C.H., López, J.: Faster Explicit
Formulas for Computing Pairings over Ordinary Curves. In: Paterson, K.G. (ed.)
EUROCRYPT 2011. LNCS, vol. 6632, pp. 48–68. Springer, Heidelberg (2011),
http://eprint.iacr.org/2010/526
Parallelizing the Weil and Tate Pairings
293
4. Aranha, D.F., López, J., Hankerson, D.: High-Speed Parallel Software Implementation of the ηT Pairing. In: Pieprzyk, J. (ed.) CT-RSA 2010. LNCS, vol. 5985, pp.
89–105. Springer, Heidelberg (2010)
5. Aranha, D.F., López, J., Hankerson, D.: Efficient Software Implementation of Binary Field Arithmetic using Vector Instruction Sets. In: Abdalla, M., Barreto,
P.S.L.M. (eds.) LATINCRYPT 2010. LNCS, vol. 6212, pp. 144–161. Springer, Heidelberg (2010)
6. Barreto, P., Galbraith, S., Eigeartaigh, C.Ó., Scott, M.: Efficient pairing computation on supersingular abelian varieties. Designs, Codes and Cryptography 42,
239–271 (2007)
7. Barreto, P., Lynn, B., Scott, M.: On the Selection of Pairing-friendly Group.
In: Matsui, M., Zuccherato, R.J. (eds.) SAC 2003. LNCS, vol. 3006, pp. 17–25.
Springer, Heidelberg (2004)
8. Barreto, P.S.L.M., Naehrig, M.: Pairing-Friendly Elliptic Curves of Prime Order. In: Preneel, B., Tavares, S. (eds.) SAC 2005. LNCS, vol. 3897, pp. 319–331.
Springer, Heidelberg (2006)
9. Beuchat, J.-L., Brisebarre, N., Detrey, J., Okamoto, E., Rodrı́guez-Henrı́quez, F.:
A Comparison Between Hardware Accelerators for the Modified Tate Pairing over
F2m and F3m . In: Galbraith, S.D., Paterson, K.G. (eds.) Pairing 2008. LNCS,
vol. 5209, pp. 297–315. Springer, Heidelberg (2008)
10. Beuchat, J.-L., González-Dı́az, J.E., Mitsunari, S., Okamoto, E., Rodrı́guezHenrı́quez, F., Teruya, T.: High-Speed Software Implementation of the Optimal
ate Pairing over Barreto-Naehrig Curves. In: Joye, M., Miyaji, A., Otsuka, A.
(eds.) Pairing 2010. LNCS, vol. 6487, pp. 21–39. Springer, Heidelberg (2010)
11. Beuchat, J.-L., López-Trejo, E., Martı́nez-Ramos, L., Mitsunari, S., Rodrı́guezHenrı́quez, F.: Multi-core Implementation of the Tate Pairing over Supersingular
Elliptic Curves. In: Garay, J.A., Miyaji, A., Otsuka, A. (eds.) CANS 2009. LNCS,
vol. 5888, pp. 413–432. Springer, Heidelberg (2009)
12. Boneh, D., Boyen, X.: Short Signatures without Random Oracles. In: Cachin, C.,
Camenisch, J.L. (eds.) EUROCRYPT 2004. LNCS, vol. 3027, pp. 56–73. Springer,
Heidelberg (2004)
13. Boneh, D., Franklin, M.: Identity-Based Encryption from the Weil Pairing. In:
Kilian, J. (ed.) CRYPTO 2001. LNCS, vol. 2139, pp. 213–229. Springer, Heidelberg
(2001)
14. Bos, J.W., Kleinjung, T., Niederhagen, R., Schwabe, P.: ECC2K-130 on Cell CPUs.
In: Bernstein, D.J., Lange, T. (eds.) AFRICACRYPT 2010. LNCS, vol. 6055, pp.
225–242. Springer, Heidelberg (2010)
15. Costello, C., Lange, T., Naehrig, M.: Faster Pairing Computations on Curves with
High-Degree Twists. In: Nguyen, P.Q., Pointcheval, D. (eds.) PKC 2010. LNCS,
vol. 6056, pp. 224–242. Springer, Heidelberg (2010)
16. Duquesne, S., Guillermin, N.: A FPGA pairing implementation using the Residue
Number System. Cryptology ePrint Archive, Report 176 (2011)
17. Estibals, N.: Compact Hardware for Computing the Tate Pairing over 128-BitSecurity Supersingular Curves. In: Joye, M., Miyaji, A., Otsuka, A. (eds.) Pairing
2010. LNCS, vol. 6487, pp. 397–416. Springer, Heidelberg (2010)
18. Fan, J., Vercauteren, F., Verbauwhede, I.: Faster Fp -Arithmetic for Cryptographic
Pairings on Barreto-Naehrig Curves. In: Clavier, C., Gaj, K. (eds.) CHES 2009.
LNCS, vol. 5747, pp. 240–253. Springer, Heidelberg (2009)
19. Galbraith, S., Paterson, K., Smart, N.: Pairings for cryptographers. Discrete
Applied Mathematics 156, 3113–3121 (2008)
294
D.F. Aranha et al.
20. von zur Gathen, J.: Efficient and optimal exponentiation in finite fields. Computational Complexity 1, 360–394 (1991)
21. Grabher, P., Großschädl, J., Page, D.: On Software Parallel Implementation of
Cryptographic Pairings. In: Avanzi, R.M., Keliher, L., Sica, F. (eds.) SAC 2008.
LNCS, vol. 5381, pp. 35–50. Springer, Heidelberg (2009)
22. Granger, R., Hess, F., Oyono, R., Thériault, N., Vercauteren, F.: Ate Pairing on
Hyperelliptic Curves. In: Naor, M. (ed.) EUROCRYPT 2007. LNCS, vol. 4515, pp.
430–447. Springer, Heidelberg (2007)
23. Granger, R., Scott, M.: Faster Squaring in the Cyclotomic Subgroup of Sixth
Degree Extensions. In: Nguyen, P.Q., Pointcheval, D. (eds.) PKC 2010. LNCS,
vol. 6056, pp. 209–223. Springer, Heidelberg (2010)
24. Gueron, S., Kounavis, M.: Carry-less multiplication and its usage for computing
the GCM mode, Intel white paper (2010)
25. Güneysu, T.: Utilizing hard cores for modern FPGA devices for high-performance
cryptography. Journal of Cryptographic Engineering 1, 37–55 (2011)
26. Hankerson, D., Menezes, A., Scott, M.: Software implementation of pairings. In:
Joye, M., Neven, G. (eds.) Identity-Based Cryptography. IOS Press (2008)
27. Hess, F.: Pairing Lattices. In: Galbraith, S.D., Paterson, K.G. (eds.) Pairing 2008.
LNCS, vol. 5209, pp. 18–38. Springer, Heidelberg (2008)
28. Hess, F., Smart, N., Vercauteren, F.: The eta pairing revisited. IEEE Trans. Inf.
Theory 52, 4595–4602 (2006)
29. Itoh, T., Tsujii, S.: A fast algorithm for computing multiplicative inverses in
GF (2m ) using normal bases. Information and Computation 78, 171–177 (1988)
30. Kammler, D., et al.: Designing an ASIP for Cryptographic Pairings over BarretoNaehrig Curves. In: Clavier, C., Gaj, K. (eds.) CHES 2009. LNCS, vol. 5747, pp.
254–271. Springer, Heidelberg (2009)
31. Karabina, K.: Squaring in cyclotomic subgroups. Cryptology ePrint Archive, Report 542 (2010)
32. Koblitz, N., Menezes, A.: Pairing-Based Cryptography at High Security Levels. In:
Smart, N.P. (ed.) Cryptography and Coding 2005. LNCS, vol. 3796, pp. 13–36.
Springer, Heidelberg (2005)
33. Lee, E.: The problem with threads. Computer 39, 33–42 (2006)
34. Lee, E., Lee, H., Park, C.: Efficient and generalized pairing computation on abelian
varieties. IEEE Trans. Inf. Theory 55, 1793–1803 (2009)
35. Lim, C.H., Lee, P.J.: More Flexible Exponentiation with Precomputation. In:
Desmedt, Y.G. (ed.) CRYPTO 1994. LNCS, vol. 839, pp. 95–107. Springer,
Heidelberg (1994)
36. Miller, V.: The Weil pairing, and its efficient calculation. Journal of Cryptology 17,
235–261 (2004)
37. Montgomery, P.: Five, six, and seven-term Karatsuba-like formulae. IEEE Trans.
Comput. 54, 362–369 (2005)
38. Naehrig, M., Niederhagen, R., Schwabe, P.: New Software Speed Records for Cryptographic Pairings. In: Abdalla, M., Barreto, P.S.L.M. (eds.) LATINCRYPT 2010.
LNCS, vol. 6212, pp. 109–123. Springer, Heidelberg (2010)
39. Nogami, Y., Akane, M., Sakemi, Y., Kato, H., Morikawa, Y.: Integer Variable χBased ate Pairing. In: Galbraith, S.D., Paterson, K.G. (eds.) Pairing 2008. LNCS,
vol. 5209, pp. 178–191. Springer, Heidelberg (2008)
40. OpenMP Microbenchmarks v2.0,
http://www2.epcc.ed.ac.uk/computing/research activities/openmpbench/
openmp index.html
Parallelizing the Weil and Tate Pairings
295
41. Pereira, G., Simplı́cio Jr., M., Naehrig, M., Barreto, P.: A family of implementationfriendly BN elliptic curves. Journal of Systems and Software 84, 1319–1326 (2011)
42. Sakai, R., Ohgishi, K., Kasahara, M.: Cryptosystems based on pairing. In: Proc.
2000 Symp. on Cryptography and Information Security, Okinawa (2000)
43. Scott, M.: Authenticated ID-based key exchange and remote log-in with simple
token and PIN number. Cryptology ePrint Archive, Report 164 (2002)
44. Scott, M., Benger, N., Charlemagne, M., Dominguez Perez, L.J., Kachisa, E.J.: On
the Final Exponentiation for Calculating Pairings on Ordinary Elliptic Curves. In:
Shacham, H., Waters, B. (eds.) Pairing 2009. LNCS, vol. 5671, pp. 78–88. Springer,
Heidelberg (2009)
45. Vercauteren, F.: Optimal pairings. IEEE Trans. Inf. Theory 56, 455–461 (2010)
46. Yao, G., Fan, J., Cheung, R., Verbauwhede, I.: A high speed pairing coprocessor
using RNS and lazy reduction. Cryptology ePrint Archive, Report 258 (2011)
47. Zhao, C., Zhang, F., Xie, D.: Reducing the complexity of the Weil pairing computation. Cryptology ePrint Archive, Report 212 (2008)
A
Relationship between G1 , G2 , G̃1 and G̃2
Let E : Y 2 = X 3 + b be a BN curve defined over Fp with r = #E(Fp ). We
have Fp12 = Fp2 [w]/(w6 − ξ) where ξ ∈ Fp2 is a non-square and a non-cube [8].
Without loss of generality, we can assume that the equation for the degree-6 twist
Ẽ over Fp2 for which r | #Ẽ(Fp2 ) is Y 2 = X 3 + b/ξ. The twisting isomorphism
is Ψ : Ẽ → E, (x, y) → (w2 x, w3 y).
Lemma 3. Let G1 and G2 be the 1- and p-eigenspaces of the p-power Frobenius
π acting on E[r], and let G̃1 and G̃2 be the 1- and p2 -eigenspaces of the p2 -power
Frobenius π̃ acting on Ẽ[r]. Then Ψ (G̃1 ) = G2 and Ψ (G̃2 ) = G1 .
Proof. Ψ (G̃1 ) = G2 was proven in [8]. We shall show that Ψ (G̃2 ) = G1 .
2
2
2
2
2
We have wp = w(w6 )(p −1)/6 = w · ξ (p −1)/6 . Let c = ξ (p −1)/6 so wp = cw.
2
Then c6 = ξ p −1 ∈ F∗p2 , so c is a 6th root of unity and in fact is in Fp since p ≡ 1
(mod 6). Furthermore, c has order 6 since ξ is a non-square non-cube in Fp2 .
Let P = (w2 x, w3 y) ∈ G1 and P̃ = Ψ −1 (P ) = (x, y). Then
2
2
2
2
2
2
2
2
Ψ (π̃(P̃ )) = (w2 xp , w3 y p ) = (w2p c−2p xp , w3p c−3p y p )
= π 2 (c−2 w2 x, c−3 w3 y) = π 2 (χ(P )) = χ(P )
where χ : (x, y) → (c−2 x, c−3 y) is an order-6 automorphism of E defined over
Fp and thus satisfies χ(P ) = p2 P or χ(P ) = p10 P . If χ(P ) = p10 P , then we
have Ψ (π̃(P̃ )) = p10 P , and applying Ψ −1 to both sides gives π̃(P̃ ) = p10 P̃ —
this is impossible since p10 is not an eigenvalue of π̃ acting on Ẽ[r]. Hence we
must have χ(P ) = p2 P , whence Ψ (π̃(P̃ )) = p2 P . Applying Ψ −1 to both sides
gives π̃(P̃ ) = p2 P̃ , so P̃ ∈ G̃2 . We have established that Ψ −1 (P ) ∈ G̃2 , so we
conclude that Ψ (G̃2 ) = G1 .