Gabriel Wlazłowski · 2951cebb
--- a/Paralellization-scheme-of-time-dependent-codes.md
+++ b/Paralellization-scheme-of-time-dependent-codes.md
@@ -9,8 +9,32 @@ Time-dependent codes evolve quasiparticle wave functions (qpwfs) which number de
 ```
 Quasiparticle wave functions are distributed uniformly among `np` MPI process. For example if above example is executed on `np=32` then each process is responsible for evolving `nwfip = 46032/32 = 1438.5`. (In practice for this example processes evolve either 1438 or 1439 qpwf). Qpwfs are evolved by GPUs. It requires that for each MPI process a GPU must be assigned. Suppose that the code is executed on `4 nodes`, and each node is equipped with `4 GPUs`. Consider the following execution command:
 ```bash
-mpiexec -ppn 8 -np 32 ./td-wslda-2d input.txt
+mpiexec -ppn 4 -np 16 ./td-wslda-2d input.txt
 ```
 where:  
 * `ppn`: processes per node,
-* `np`: number of processes.
+* `np`: number of processes,
\ No newline at end of file
+*  in input file: `gpuspernode 4`.
+When executing the code, the following mapping `MPI Process <--> GPU` will be applied:  
+![td-scheme-2](uploads/99212e2b948a20702ce6c6937c176b8e/td-scheme-2.png)
+In this case each MPI process is connected to one GPU.     
+Alternatively, one can use:
+```bash
+mpiexec -ppn 16 -np 32 ./td-wslda-2d input.txt
+``` 
+and distribution will be as follow:
+![td-scheme](uploads/c6f4cd2418855a360754cb5baaaf6f0c/td-scheme.png) 
+In this case each GPU evolves qpwfs of two MPI processes.  
+To learn more about `MPI <--> GPU` mapping see: [Configuring GPU machine](Configuring GPU machine).
+# Number of MPI processes per GPU: performance notes
+If number of lattice points:
+* 3d code: `N = NX x NY x NZ,
+* 2d code: `N = NX x NY,
+* 1d code: `N = NX,
+satisfy following criteria: `N >> number_of_CUDA_cores` then it is recommended to run the code where the number of MPI  process is equal number of GPUs, i.e each GPU is assigned to only one MPI process. Number of CUDA cores depends on GPU type, but typically it is of the order of a few thousand. If the condition is not satisfied then user may consider assigning many MPI processes to a single GPU, as it can provide better performance. 
\ No newline at end of file