General info
The spirit of SLDA method is to exploit only the local densities during the computation process. This feature makes the method an excellent candidate for utilization multithreading computing units like GPUs. Instead of iterate over all lattice points (case of CPU implementation), we can create number of lattice points NX x NY x NZ
concurrent and independent threads, and assign with each single thread all operations related to one point either in position or momentum space, see Fig. below. Switching between spaces is performed by parallel cuFFT implementation, which is even more than 100 times faster than CPU implementation (like FFTW).
MPI space and GPUs
Time-dependent codes evolve quasiparticle wave functions (qpwfs) which number depends mainly in the lattice size and value of the cut-off energy ec
. The number of wave-functions is printed under the name nwf
, for example:
# INIT2: nwf=46032 wave-functions to scatter
Quasiparticle wave functions are distributed uniformly among np
MPI process. For example if above example is executed on np=16
then each process is responsible for evolving nwfip = 46032/16 = 2877
. Qpwfs are evolved by GPUs. It requires that for each MPI process a GPU must be assigned. Suppose that the code is executed on 4 nodes
, and each node is equipped with 4 GPUs
. Consider the following execution command:
mpiexec -ppn 4 -np 16 ./td-wslda-2d input.txt
where:
-
ppn
: processes per node, -
np
: number of processes, - in input file:
gpuspernode 4
. When executing the code, the following mappingMPI Process <--> GPU
will be applied:
In this case each MPI process is connected to one GPU.
Alternatively, one can use:
mpiexec -ppn 8 -np 32 ./td-wslda-2d input.txt
and distribution will be as follow:
In this case each GPU evolves qpwfs of two MPI processes.
To learn more about MPI <--> GPU
mapping see: Configuring GPU machine.
Number of MPI processes per GPU: performance notes
If number of lattice points:
- 3d code:
N = NX x NY x NZ
, - 2d code:
N = NX x NY
, - 1d code:
N = NX
,
satisfy following criteria: N >> number_of_CUDA_cores
then it is recommended to run the code where the number of MPI process is equal number of GPUs, i.e each GPU is assigned to only one MPI process. Number of CUDA cores depends on GPU type, but typically it is of the order of a few thousand. If the condition is not satisfied then user may consider assigning many MPI processes to a single GPU, as it can provide better performance.