Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • wslda wslda
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • wtools
  • wsldawslda
  • Wiki
  • Paralellization scheme of time dependent codes

Paralellization scheme of time dependent codes · Changes

Page history
Update Paralellization scheme of time dependent codes authored Feb 20, 2026 by Gabriel Wlazłowski's avatar Gabriel Wlazłowski
Show whitespace changes
Inline Side-by-side
Paralellization-scheme-of-time-dependent-codes.md
View page @ b8a79fbd
# General info # General info
The spirit of SLDA method is to exploit only the local densities during the computation process. This feature makes the method an excellent candidate for utilization multithreading computing units like GPUs. Instead of iterate over all lattice points (case of CPU implementation), we can create number of lattice points `NX x NY x NZ` concurrent and independent threads, and assign with each single thread all operations related to one point either in position or momentum space, see Fig. below. Switching between spaces is performed by parallel cuFFT implementation, which is even more than 100 times faster than CPU implementation (like FFTW). The spirit of the SLDA method is to exploit only local densities during computation. This feature makes the method an excellent candidate for utilizing multithreading computing units like GPUs. Instead of iterating over all lattice points (case of CPU implementation), we can create a number of lattice points `NX x NY x NZ` concurrent and independent threads, and assign to every single thread all operations related to one point, either in position or momentum space, see Fig. below. Switching between spaces is performed by a parallel cuFFT implementation, which is even more than 100 times faster than a CPU implementation (like FFTW).
![td-pscheme](uploads/b20710aa7608e7240a6f36aa32ac5659/td-pscheme.png) ![td-pscheme](uploads/b20710aa7608e7240a6f36aa32ac5659/td-pscheme.png)
# MPI space and GPUs # MPI space and GPUs
Time-dependent codes evolve quasiparticle wave functions (qpwfs) which number depends mainly in the lattice size and value of the cut-off energy `ec`. The number of wave-functions is printed under the name `nwf`, for example: Time-dependent codes evolve quasiparticle wave functions (qpwfs), whose number depends mainly on the lattice size and the cutoff energy `ec`. The number of wave-functions is printed under the name `nwf`, for example:
``` ```
# INIT2: nwf=46032 wave-functions to scatter # INIT2: nwf=46032 wave-functions to scatter
``` ```
Quasiparticle wave functions are distributed uniformly among `np` MPI process. For example if above example is executed on `np=16` then each process is responsible for evolving `nwfip = 46032/16 = 2877`. Qpwfs are evolved by GPUs. It requires that for each MPI process a GPU must be assigned. Suppose that the code is executed on `4 nodes`, and each node is equipped with `4 GPUs`. Consider the following execution command: Quasiparticle wave functions are distributed uniformly among `np` MPI processes. For example, if the above example is executed on `np=16`, then each process is responsible for evolving `nwfip = 46032/16 = 2877`. Qpwfs are evolved by GPUs. It requires that each MPI process be assigned a GPU. Suppose the code is executed on `4 nodes`, each with `4 GPUs`. Consider the following execution command:
```bash ```bash
mpiexec -ppn 4 -np 16 ./td-wslda-2d input.txt mpiexec -ppn 4 -np 16 ./td-wslda-2d input.txt
``` ```
...@@ -20,22 +20,22 @@ When executing the code, the following mapping `MPI Process <--> GPU` will be ap ...@@ -20,22 +20,22 @@ When executing the code, the following mapping `MPI Process <--> GPU` will be ap
![td-scheme-2](uploads/392f47833c1d499edf7ce504cfbf277f/td-scheme-2.png) ![td-scheme-2](uploads/392f47833c1d499edf7ce504cfbf277f/td-scheme-2.png)
In this case each MPI process is connected to one GPU. In this case, each MPI process is connected to one GPU.
Alternatively, one can use: Alternatively, one can use:
```bash ```bash
mpiexec -ppn 8 -np 32 ./td-wslda-2d input.txt mpiexec -ppn 8 -np 32 ./td-wslda-2d input.txt
``` ```
and distribution will be as follow: and distribution will be as follows:
![td-scheme](uploads/c6f4cd2418855a360754cb5baaaf6f0c/td-scheme.png) ![td-scheme](uploads/c6f4cd2418855a360754cb5baaaf6f0c/td-scheme.png)
In this case each GPU evolves qpwfs of two MPI processes. In this case, each GPU evolves qpwfs of two MPI processes.
To learn more about `MPI <--> GPU` mapping see: [Configuring GPU machine](Configuring GPU machine). To learn more about `MPI <--> GPU` mapping, see: [Configuring GPU machine](Configuring-GPU-machine).
# Number of MPI processes per GPU: performance notes # Number of MPI processes per GPU: performance notes
If number of lattice points: If the number of lattice points:
* 3d code: `N = NX x NY x NZ`, * 3d code: `N = NX x NY x NZ`,
* 2d code: `N = NX x NY`, * 2d code: `N = NX x NY`,
* 1d code: `N = NX`, * 1d code: `N = NX`,
satisfy following criteria: `N >> number_of_CUDA_cores` then it is recommended to run the code where the number of MPI process is equal number of GPUs, i.e each GPU is assigned to only one MPI process. Number of CUDA cores depends on GPU type, but typically it is of the order of a few thousand. If the condition is not satisfied then user may consider assigning many MPI processes to a single GPU, as it can provide better performance. satisfy the following criteria: `N >> number_of_CUDA_cores`, then it is recommended to run the code with the number of MPI processes equal to the number of GPUs, i.e., each GPU is assigned to only one MPI process. The number of CUDA cores depends on the GPU type, but it is typically a few thousand. If the condition is not satisfied, the user may consider assigning multiple MPI processes to a single GPU, as this can improve performance.
Clone repository

Content of Documentation
Official webpage
W-BSK Toolkit