36. 36HPCS2012
SiNW,
19,848
原子,
格子数:320x320x120,
バンド数:41,472
-‐78%
-‐79%
-‐79%
空間分割:
2,048
バンド分割:
6
Time
per
SCF
(sec.)
GS CGMatE/SD RotV/SD
Wait
/
orbital
Global
communication
/
orbital
Global
communication
/
space
Adjacent
communication
/
space
Computation
Space
+
Orbital
Space
+
Orbital
Space
+
Orbital
Space
+
Orbital
Space
Space
Space
Space
-‐78%
100.0
80.0
60.0
40.0
20.0
0.0
トータル並列プロセス数は12,288で固定
120.0
回転行列生成
空間分割:
12,288
大域通信時間を大幅に削減
RSDFTの高並列化 -マッピングの効果-
2014年5月8日 CMSI計算科学技術特論B
37. 37
DGEMM tuned for the K computer was also used for the
LINPACK benchmark program.
5.2 Scalability
We measured the computation time for the SCF iterations with
communications for the parallel tasks in orbitals, however, was
actually restricted to a relatively small number of compute nodes,
and therefore, the wall clock time for global communications of
the parallel tasks in orbitals was small. This means we succeeded
in decreasing time for global communication by the combination
Figure 6. Computation and communication time of (a) GS, (b) CG, (c) MatE/SD and (d)
RotV/SD for different numbers of cores.
0.0
40.0
80.0
120.0
160.0
0 20000 40000 60000 80000
TimeperCG(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
0.0
100.0
200.0
300.0
400.0
0 20,000 40,000 60,000 80,000
TimeperGS(sec.)
Number of cores
theoretical computation
computation
global/space
global/orbital
wait/orbital
0.0
50.0
100.0
150.0
200.0
0 20,000 40,000 60,000 80,000
TimeperMatE/SD(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
0.0
100.0
200.0
300.0
0 20,000 40,000 60,000 80,000
TimeperRotV/SD(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
(d)(c)
(a) (b)
time as a result of keeping the block data on the L1 cache
manually decreased by 12% compared with the computation time
for the usual data replacement operations of the L1 cache. This
DGEMM tuned for the K computer was also used for the
LINPACK benchmark program.
5.2 Scalability
We measured the computation time for the SCF iterations with
hand, the global communication time for the parallel task
orbitals was supposed to increase as the number of parallel t
in orbitals increased. The number of MPI processes requi
communications for the parallel tasks in orbitals, however,
actually restricted to a relatively small number of compute no
and therefore, the wall clock time for global communication
the parallel tasks in orbitals was small. This means we succee
in decreasing time for global communication by the combina
Figure 6. Computation and communication time of (a) GS, (b) CG, (c) MatE/SD and (d)
RotV/SD for different numbers of cores.
0.0
40.0
80.0
120.0
160.0
0 20000 40000 60000 80000
TimeperCG(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
0.0
100.0
200.0
300.0
400.0
0 20,000 40,000 60,000 80,000
TimeperGS(sec.)
Number of cores
theoretical computation
computation
global/space
global/orbital
wait/orbital
0.0
50.0
100.0
150.0
200.0
0 20,000 40,000 60,000 80,000
TimeperMatE/SD(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
0.0
100.0
200.0
300.0
0 20,000 40,000 60,000 80,000
TimeperRotV/SD(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
(d)(c)
(a) (b)
RSDFTの高並列化-スケーラビリティ-
2014年5月8日 CMSI計算科学技術特論B
38. 38
confinement becomes prominent. The quantum effects,
which depend on the crystallographic directions of the nano-
wire axes and on the cross-sectional shapes of the nanowires,
result in substantial modifications to the energy-band
structures and the transport characteristics of SiNW FETs.
However, knowledge of the effect of the structural mor-
phology on the energy bands of SiNWs is lacking. In addi-
tion, actual nanowires have side-wall roughness. The
effects of such imperfections on the energy bands are
Table 2. Distribution of computational costs for an iteration of the SCF calculation of the modified code.
Procedure block
Execution
time (s)
Computation
time (s)
Communication time (s)
Performance
(PFLOPS/%)Adjacent/grids Global/grids Global/orbitals Wait/orbitals
SCF 2903.10 1993.89 61.73 823.02 12.57 11.89 5.48/51.67
SD 1796.97 1281.44 13.90 497.36 4.27 – 5.32/50.17
MatE/SD 525.33 363.18 13.90 143.98 4.27 – 6.15/57.93
EigenSolve/SD 492.56 240.66 – 251.90 – – 0.01/1.03
RotV/SD 779.08 677.60 – 101.48 – – 8.14/76.70
CG 159.97 43.28 47.83 68.85 0.01 – 0.06/0.60
GS 946.16 669.17 – 256.81 8.29 11.89 6.70/63.10
The test model was a SiNW with 107,292 atoms. The numbers of grids and orbitals were 576 Â 576 Â 180, and 230,400, respectively. The numbers of
parallel tasks in grids and orbitals were 27,648 and three, respectively, using 82,944 compute nodes. Each parallel task had 2160 grids and 76,800 orbitals.
Hasegawa et al. 13
Article
Performance evaluation of ultra-large-
scale first-principles electronic structure
calculation code on the K computer
Yukihiro Hasegawa1
, Jun-Ichi Iwata2
, Miwako Tsuji1
,
Daisuke Takahashi3
, Atsushi Oshiyama2
, Kazuo Minami1
,
Taisuke Boku3
, Hikaru Inoue4
, Yoshito Kitazawa5
,
Ikuo Miyoshi6
and Mitsuo Yokokawa7,1
Abstract
The International Journal of High
Performance Computing Applications
1–21
ª The Author(s) 2013
Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/1094342013508163
hpc.sagepub.com
Yukihiro Hasegawa et al.,
http://hpc.sagepub.com/
Computing Applications
International Journal of High Performance
http://hpc.sagepub.com/content/early/2013/10/16/1094342013508163
The online version of this article can be found at:
DOI: 10.1177/1094342013508163
published online 17 October 2013International Journal of High Performance Computing Applications
Hikaru Inoue, Yoshito Kitazawa, Ikuo Miyoshi and Mitsuo Yokokawa
Yukihiro Hasegawa, Jun-Ichi Iwata, Miwako Tsuji, Daisuke Takahashi, Atsushi Oshiyama, Kazuo Minami, Taisuke Boku
K computer
Performance evaluation of ultra-largescale first-principles electronic structure calculation code on the
Published by:
RSDFTの高並列化
総合性能
2014年5月8日 CMSI計算科学技術特論B
43. 43
PHASEの並列化
i
はエネルギーバンド量子数
基本的にエネルギーバンドについて並列化されてい
る
一部,波数:Gについて並列化されている
G並列の前にエネルギーバンド並列されている波動
関数をG並列可能なようにトランスバース転送が発生
G並列後にG並列されている波動関数をエネルギーバ
ンド並列に戻すためなトランスバース転送が発生
このトランスバース転送のコストが大
G
i G
i
Hϕik (G) = εiϕik (G)
2014年5月8日 CMSI計算科学技術特論B