Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • wslda wslda
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • wtools
  • wsldawslda
  • Wiki
  • Paralellization scheme of time dependent codes

Last edited by Gabriel Wlazłowski Feb 20, 2026
Page history

Paralellization scheme of time dependent codes

General info

The spirit of the SLDA method is to exploit only local densities during computation. This feature makes the method an excellent candidate for utilizing multithreading computing units like GPUs. Instead of iterating over all lattice points (case of CPU implementation), we can create a number of lattice points NX x NY x NZ concurrent and independent threads, and assign to every single thread all operations related to one point, either in position or momentum space, see Fig. below. Switching between spaces is performed by a parallel cuFFT implementation, which is even more than 100 times faster than a CPU implementation (like FFTW).

td-pscheme

MPI space and GPUs

Time-dependent codes evolve quasiparticle wave functions (qpwfs), whose number depends mainly on the lattice size and the cutoff energy ec. The number of wave-functions is printed under the name nwf, for example:

# INIT2: nwf=46032 wave-functions to scatter

Quasiparticle wave functions are distributed uniformly among np MPI processes. For example, if the above example is executed on np=16, then each process is responsible for evolving nwfip = 46032/16 = 2877. Qpwfs are evolved by GPUs. It requires that each MPI process be assigned a GPU. Suppose the code is executed on 4 nodes, each with 4 GPUs. Consider the following execution command:

mpiexec -ppn 4 -np 16 ./td-wslda-2d input.txt

where:

  • ppn: processes per node,
  • np: number of processes,
  • in input file: gpuspernode 4. When executing the code, the following mapping MPI Process <--> GPU will be applied:

td-scheme-2

In this case, each MPI process is connected to one GPU.
Alternatively, one can use:

mpiexec -ppn 8 -np 32 ./td-wslda-2d input.txt

and distribution will be as follows: td-scheme

In this case, each GPU evolves qpwfs of two MPI processes.

To learn more about MPI <--> GPU mapping, see: Configuring GPU machine.

Number of MPI processes per GPU: performance notes

If the number of lattice points:

  • 3d code: N = NX x NY x NZ,
  • 2d code: N = NX x NY,
  • 1d code: N = NX,

satisfy the following criteria: N >> number_of_CUDA_cores, then it is recommended to run the code with the number of MPI processes equal to the number of GPUs, i.e., each GPU is assigned to only one MPI process. The number of CUDA cores depends on the GPU type, but it is typically a few thousand. If the condition is not satisfied, the user may consider assigning multiple MPI processes to a single GPU, as this can improve performance.

Clone repository

Official webpage
Main Repo
Main Docs
W-BSK Toolkit
Mirror Repo: GitLab, GitHub
Mirror Doc: GitLab, GitHub