Gabriel Wlazłowski · 90129b7d
--- a/Parallelization-scheme-of-static-codes.md
+++ b/Parallelization-scheme-of-static-codes.md
+The most time-consuming part of the static codes is BdG matrix diagonalization. It is important how the BdG matrix is decomposed among the computing process in order to efficiently utilize `st-wslda` codes. The decomposition depends on the dimensionality version of the code.
+ 
+# st-wslda-3d
+## Block-cyclic decomposition of BdG matrix 
+
+The code diagonalizes one matrix per iteration of size `matrix_size=2*NX*NY*NZ`. The matrix is decomposed between MPI process in block-cyclic (bc) fashion. To understand the idea of bc decomposition let us suppose that our (artificial) BdG matrix has size 9x9 (in practice this cannot happen since the size will be always an even number). The code will be executed on `np=6` processes:
+```bash
+mpirun -np 6 ./st-wslda-3d input.txt
+```
+In the input file we set:
+```
+# BLACS grid 
+p                       2
+q                       3
+
+# data block size 
+mb                      2
+nb                      2
+```
+Then the matrix will be distributed in the following fashion:  
+![bc-decomp](uploads/2cffebe8b09bbb6fef13c909065b3899/bc-decomp.png)
+where by colors we indicate matrix elements that are handled by different MPI processes (6 different colors):
+![bc-decomp-2](uploads/d5d9bf64c421faefecd6967fde32dd66/bc-decomp-2.png)  
+
+To learn more about bc decomposition see [here](http://wlazlowski.fizyka.pw.edu.pl/pdfs/dydaktyka/NTO/meeting-7.pdf). 
+
+## Selecting p, q, mb and nb
+By construction following constraint must be satisfied: `np=p*q`. When selecting decomposition parameters user should take into account:
+* Typically, best performance is achieved for `p=q`, thus it is recommended to select `p` and `q` to be as close as possible. If `p` and `q` are commented out then the algorithm will select automatically their values to satisfy this requirement.  
+*Note*: Typically, the constraint `p=q` cannot be satisfied. Then we empirically found that setting with `p<q` gives a better performance than settings with `p>q`. For example, if we run code with `np=24` than we have two options `(p,q)=(4,6)` or `(p,q)=(6,4)`. According to our expertise we expect that setting `(p,q)=(4,6)` will provide better performance of the computation process. 
+* By construction bock sizes must satisfy the following constraints: `mb<=matrix_size/p` and `nb<=matrix_size/q`. In case of decomposition codes (like diagonalization) settings block size to their maximal values does not provide good performance. We find that the best performance is obtained if `mb` and `nb` are much smaller than their maximal allowed values and at the same time number of matrix elements in the block `mb*nb` is significant (of the order of hundred or higher). Empirically we find that typically good performance is obtained for block sizes to be 16, 32, 64 (powers of 2). We recommend the user to try with these values, and based on results decide if further increase or decrease is profitable. 
+* If `p=q=1` then the parallelization is not applied. It corresponds to the single CPU version of the code. The code should work for these settings as well (do not expect that you will be able to solve large problems then). 
+
+# st-wslda-2d
+In this variant the code assumes that the quasi-particle wave functions have form:
+```math
+\psi(x,y,z)=\varphi(x,y)\frac{1}{\sqrt{L_z}}e^{ik_z z}
+```
+where
+```math
+k_z = 0, \pm 1 \frac{2\pi}{L_z}, \pm 2 \frac{2\pi}{L_z}, \ldots , +(N_z-1) \frac{2\pi}{L_z}
+```
+and $`L_z = NZ*DZ`$ is the box length along z-direction. From physical point of view, it means that we impose translation symmetry along z-direction. Under this assumption BdG matrix acquires block-diagonal form:
+![HBdG-2d](uploads/9659540e2c865b7a6736e55cec10aa50/HBdG-2d.png)  
+and diagonalization of the matrix is equivalent to diagonalizations of submatrices $`H(k_{z,i})`$, each of them of size `matrix_size=2*NX*NY`. Moreover, the translation symmetry imposes that $`H(k_{z})=H(-k_{z})`$ and in practice it is sufficient to diagonalize only submatrices for positive $`k_z`$, which takes $NZ/2$ values. Submantcies can be diagonalized simultaneously.  
+
+To demonstrate parallelization scheme in 2D case, let us consider following lattice:
+```c
+#define NX 8
+#define NY 10
+#define NZ 12
+```
+As in the previous example (3D case) in the input file, we set:
+```
+# BLACS grid 
+p                       2
+q                       3
+```
+and we execute code with `np=24` processes:
+```bash
+mpirun -np 24 ./st-wslda-3d input.txt
+```
+For these settings, the single interaction requires $NZ/2=6$ diagonalizations.  The total set of processes will be dived into subgroups, each of size `p*q=6`. Thus the number of subgroups will be `24/6=4`. Each submatrix will be decomposed in block-cyclic fashion among `p*q` processes as in 3D case.
+This information is provided in the code output:
+```
+# CODE: ST-WSLDA-2D
+# LATTICE: 8 x 10 x 12
+ ...
+# NUMBER OF PLAN WAVES TO CONSIDER: 6
+# SETTINGS 4 KZGROUPS, EACH WITH GRID PROCESSES OF SIZE [2 x 3]
+# GROUP 0 WITH 6 PROCESSES HAS BEEN SUCCESSFULLY CREATED.
+# GROUP 1 WITH 6 PROCESSES HAS BEEN SUCCESSFULLY CREATED.
+# GROUP 2 WITH 6 PROCESSES HAS BEEN SUCCESSFULLY CREATED.
+# GROUP 3 WITH 6 PROCESSES HAS BEEN SUCCESSFULLY CREATED.
+# GROUP 0 COMPUTES FOR 2 k-values [0,2)
+# GROUP 1 COMPUTES FOR 2 k-values [2,4)
+# GROUP 2 COMPUTES FOR 1 k-values [4,5)
+# GROUP 3 COMPUTES FOR 1 k-values [5,6)
+# HAMILTONIAN SIZE: 160 x 160
+# HAMILTONIAN TOTAL STORAGE: 0.39MB
+# CREATING CBLACS GRIDs OF SIZE (pzheev): [2 x 3]
+```
+Note that here hamiltonian size means the size of submatrix `160=2*8*10`. 
+The computation process for single interaction is presented schematically in the figure below:
+![pscheme-2d](uploads/4792eecc243913bfd2e18d0fef09c90e/pscheme-2d.png)  
+ 
+and it is reflected in the code output:
+```
+# DIAGONALIZATION 1 2...
+# DIAGONALIZATION 1 4...
+# DIAGONALIZATION 1 5...
+# DIAGONALIZATION 1 0...
+# DIAGONALIZATION 1 5 DONE [0 sec] (EXTRACTED 38 STATES)
+# DIAGONALIZATION 1 4 DONE [0 sec] (EXTRACTED 64 STATES)
+# DIAGONALIZATION 1 2 DONE [0 sec] (EXTRACTED 106 STATES)
+# DIAGONALIZATION 1 0 DONE [0 sec] (EXTRACTED 120 STATES)
+# DIAGONALIZATION 1 1...
+# DIAGONALIZATION 1 3...
+# DIAGONALIZATION 1 3 DONE [0 sec] (EXTRACTED 102 STATES)
+# DIAGONALIZATION 1 1 DONE [0 sec] (EXTRACTED 118 STATES)
+# NWF=976
+# Number of nwf in [-ecut,+ecut] to be extracted is: 976 (50.8% of total number of states)
+```
+## Selecting p, q, and np
+By construction following constraint must be satisfied: `np/(p*q) = i` where `i` is an integer number (number of subgroups) and `i<=NZ/2`. If `p` and `q` are specified in the input file, the code will select them in a such way to maximize the number of simultaneous diagonalizations. Concerning matrix distribution, all requirements as described for 3D case apply here as well.
+
+# st-wslda-1d
+The diagonalization scheme for 1D code is the same as for 2D code, wich modification that now submatrices depend on two wave-vectors $`H(k_{y,i}, k_{z,j})`$.  
+
+In 1D case typically we need to diagonalize a large number of small matrices of size `matrix_size=2*NX`. For small sizes, of the order of thousand, typically the best performance is obtained for `p=q=1`, i.e. no block-cyclic distribution of the submatrices. 
\ No newline at end of file