Poor Man's Computing Revisited: Alexander Shchepetkin, I.G.P.P. UCLA
Poor Man's Computing Revisited: Alexander Shchepetkin, I.G.P.P. UCLA
Poor Man's Computing Revisited: Alexander Shchepetkin, I.G.P.P. UCLA
• 80-bit long registers (!) allowing 32-, 64-, and 80-bit single, dou-
ble or extended-precision arithmetic. In the last case results are
truncated to 64 bit, when leaving registers.
• GNU gcc, g77 compilers (F77 but not F90); all FREE
• Intel icc and ifc compilers: F95 with Open MP support; FREE for
Linux only
• Lahey (Fujitsu) lf95 compiler, striped down version (no OpenMP sup-
port) $240; Complete version $640
Test Problem:
ROMS code (relevant features):
Test Problem:
ROMS code, same as above;
1/2 degree Pacific model, 384 × 224 × 30 grid, ⇒ 800MB+ problem
Performance 1/2 degree, 384 × 224 × 32 grid Pacific model on dual 2.4
GHz Xeon machine: Overall, with proper choice of partitions the dual
Xeon machine runs as fast as to 12 195MHz R10k CPUs of Origin
2000. It takes 10 hours of computing (wall clock) to get one model
year of simulation, which makes it viable choice (time step 7200 sec;
mode splitting ratio ndtfast=78; FB barotropic mode).
What did we learn:
• Cache utilization has major effect. Dual-Xeon machine CPUs are
in more ”data-hungry” situation that PIIIs above: it looks like per-
CPU computational speed has increased by a factor of 4+, while
its clock speed only in 2.5 and memory bandwidth by a factor of 3.
• With introduction of P4 vector length is back into consideration
again (P4 has 4× longer cache line; pipelined regime)
• Best policy results subdomain size of 96×4 points — try MPI code
with 4-point wide subdomains + 2 ghost points on each side!
Given that the total storage size for this problem is 850 MBytes, an
intuitive ”rule of thumb” tells that 1700 subdomains are needed to
fit into 512Kb cache. In practice the number is significantly smaller
— about 160. This is because not all model arrays are being used in
each parallel region, and, more importantly, the best effort is already
made to organize loops within subdomains in cache-friendly manner:
the operations are arranged sweeping either in horizontal x − y or ver-
tical x − z planes with intermediate results placed in two-dimensional
(almost never three-dimensional) tile-sized scratch arrays, so that suc-
cessful cache management is achieved when a sufficient number of
these rather small size scratch arrays fit into cache.
number of CPUs 4 8 12 16 24
partition 2×2 2×4 3×4 4×4 3×8
run time, seconds 4130 2154 1118 868 521
time steps/minute 7.44 14.2 27.5 35.4 58.9
time steps/minute/node∗ 3.71 3.57 4.57 4.42 4.91
number of CPUs 32 48 64 96
partition 4×8 3×16 8×8 6×16
run time, seconds 413 241 173 139
time steps/minute 74.4 127.5 177.5 221.0
time steps/minute/node∗ 4.65 5.29 5.53 4.60
Term node labeled by asterisk ∗ in the last line means the whole dual-
CPU PowerEdge 1750 node, hence node=2CPUs
I/0 during these runs is done from/into separate files individually for
each MPI node using the standard (non-parallel) netCDF library. This
leads to scalable I/O, however pre- and post-processing is required to
convert the data into more a conventional form.
The machinery and the code exhibit overall good scaling, which is
manifested nearly proportional decrease of run time with the number of
CPUs, as well as non-degrading per-node computational performance
— shown on the bottom line. However the per-node performance
above is significantly lower than the 8.9 time steps/minute of the
properly optimized Open MP code running on 2 CPUs (just one node
of the Linux cluster) using fine 3 × 56 partitioning to better utilize
processor’s cache. Mild super-linear scaling and partial recovery of per-
node performance on 48 and 64 CPUs above (bottom line, highlighted)
is attributed to cache effects due to effective reduction of problem size
solved on each node.