ELPA Library
For static calculations, it is recommended to use ELPA Library, which has better performance than ScaLapack. In particular, ELPA allows for the utilization of GPUs which provide a significant boost for calculations. In order to activate ELPA lib in predefines.h set:
// select diagonalization routine
#define DIAGONALIZATION_ROUTINE ELPA
Moreover, you need to inspect carefully part:
// ---------------------- ELPA SETTINGS ---------------------------
// Fill this part only if ELPA library is used for diagonnalization
// uncomment it if you want to activate GPU for diagonalizations
#define ELPA_USE_GPU
// Select ELPA kernels
#define ELPS_USE_SOLVER ELPA_SOLVER_1STAGE
#define ELPA_USE_COMPLEX_KERNEL ELPA_2STAGE_COMPLEX_DEFAULT
#define ELPA_USE_REAL_KERNEL ELPA_2STAGE_REAL_DEFAULT
// Fraction of eigenvectors to be extracted in each cycle.
// 1.0 corresponds to extraction if all eigenvectors (USE IT IF YOU YOU ARE NOT SURE)
// NOTE: value of this parameter should assure that all eigenstates below requested Ec are extracted.
// NOTE: For 3D case this value typically can be set to 0.78
#define ELPA_NEV_FRACTION 1.0
Documentation
- Eigenvalue SoLvers for Petaflop-Applications (ELPA)
- Wiki: Eigenvalue SoLvers for Petaflop-Applications (ELPA)
- ELPA installation guide
Publications about ELPA performance
ScaLapack library
If the target system does not provide ELPA library user can use (standard) diagonalization library: ScaLAPACK. W-SLDA Toolkit can utilize the following ScaLapack diagonalization engines:
#define DIAGONALIZATION_ROUTINE PZHEEVR
or
#define DIAGONALIZATION_ROUTINE PZHEEVD
It is recommended to use PZHEEVR
. This engine takes advantage from the fact that typically we extract only a fraction of eigenstates. However, we find that in some rare cases (system dependent) this routine does not work correctly. In such a case, PZHEEVD
should be used.
Benchmarks & Scalings
All tests correspond to the extraction of all eigenvectors.
Table
matrix size | p | q | mb | nb | prec. | routine | system | time [sec] | cost |
---|---|---|---|---|---|---|---|---|---|
32,768 = 2x128^2 | 6 | 8 | 16 | 16 | real | ELPA (2-GPU) | Cygnus | 93 | 0.052 nh |
45,000 = 2x150^2 | 6 | 8 | 16 | 16 | real | ELPA (2-GPU) | Cygnus | 217 | 0.12 nh |
45,000 = 2x150^2 | 6 | 8 | 16 | 16 | complex | ELPA (2-GPU) | Cygnus | 860 | 0.478 nh |
65,536 = 2x32^3 | 24 | 28 | 32 | 32 | complex | ELPA (2-GPU) | Summit | 115 | 0.52 nh |
65,536 = 2x32^3 | 24 | 28 | 8 | 8 | complex | ELPA (1-GPU) | Summit | 118 | 0.52 nh |
128,000 = 2x40^3 | 24 | 28 | 8 | 8 | complex | ELPA (1-GPU) | Summit | 435 | 1.93 nh |
128,000 | 24 | 28 | 32 | 32 | complex | ELPA (2-GPU) | Summit | 511 | 2.27 nh |
128,000 = 2x40^3 | 20 | 20 | 32 | 32 | complex | ELPA (1-GPU) | Daint | 220 | 24.4 nh |
128,000 | 54 | 64 | 32 | 32 | complex | ELPA (2-CPU) | Daint | 677 | 54.1 nh |
128,000 | 54 | 64 | 32 | 32 | complex | PZHEEVR | Daint | 945 | 75.6 nh |
147,456 = 4x64x24^2 | 24 | 25 | 32 | 32 | complex | ELPA (1-GPU) | Daint | 375 | 62.5 nh |
147,456 = 2x768x96 | 18 | 18 | 16 | 16 | double | ELPA (1-GPU) | Daint | 395 | 35.6 nh |
221,184 = 2x48^3 | 46 | 84 | 32 | 32 | complex | ELPA (2-GPU) | Summit | 603 | 15.4 nh |
221,184 | 46 | 84 | 16 | 16 | complex | ELPA (1-GPU) | Summit | 736 | 18.8 nh |
221,184 | 46 | 84 | 16 | 16 | complex | ELPA (2-GPU) | Summit | 3098 | 79.2 nh |
221,184 | 46 | 84 | 16 | 16 | complex | PZHEEVD | Summit | 5995 | 153.2 nh |
500,000 = 2x50^2x100 | 96 | 112 | 16 | 16 | complex | ELPA (1-GPU) | Summit | 2,109 | 150.0 nh |
524,288 = 2x64^3 | 96 | 112 | 16 | 16 | complex | ELPA (1-GPU) | Summit | 2,217 | 157.7 nh |
746,496 = 2x72^3 | 112 | 192 | 16 | 16 | complex | ELPA (1-GPU) | Summit | 3,436 | 488.7 nh |
1,769,472 = 2x96^3 | 300 | 560 | 32 | 32 | complex | ELPA (1-GPU) | Summit | 52,024 | 57,804 nh |
(1-GPU): ELPA_SOLVER_1STAGE
, ELPA_2STAGE_COMPLEX_GPU
or ELPA_2STAGE_REAL_GPU
(1-CPU): ELPA_SOLVER_1STAGE
, ELPA_2STAGE_COMPLEX_DEFAULT
or ELPA_2STAGE_REAL_DEFAULT
(2-GPU): ELPA_SOLVER_2STAGE
, ELPA_2STAGE_COMPLEX_GPU
or ELPA_2STAGE_REAL_GPU
(2-CPU): ELPA_SOLVER_2STAGE
, ELPA_2STAGE_COMPLEX_DEFAULT
or ELPA_2STAGE_REAL_DEFAULT
Plots
These scalings are derived empirically: points correspond to real measurement on target system, while line shows a fit of ideal scaling for level-3 rutines (\sim N^3
)
Summit
The scaling was derived within ALCC grant Quantum Turbulence in Fermi Superfluids.
- Raw data: summit-scaling.txt
- Gnuplot script: summit-scaling.gp