Gabriel Wlazłowski · 5b4220cd
--- a/Parallelization-scheme-of-static-codes.md
+++ b/Parallelization-scheme-of-static-codes.md
-The most time-consuming part of the static codes is BdG matrix diagonalization. It is essential to understand how the BdG matrix is decomposed during the computational process to utilize `st-wslda` codes efficiently. The decomposition depends on the dimensionality of the code.
+[[_TOC_]]
+
+The most time-consuming part of the static codes is BdG matrix diagonalization. It is essential to understand how the BdG matrix is decomposed during computation to use `st-wslda` codes efficiently. The decomposition depends on the code's dimensionality.
 
 # st-wslda-3d
 ## Block-cyclic decomposition of BdG matrix 

-The code diagonalizes one matrix per iteration of size `matrix_size=2*NX*NY*NZ`. The matrix is decomposed between MPI processes in a block-cyclic (bc) fashion. To understand the concept of bc decomposition, let us suppose that our (artificial) BdG matrix has a size of 9x9 (in practice, this is not possible since the size will always be an even number). The code will be executed on `np=6` processes:
+The code diagonalizes one matrix per iteration of size `matrix_size=2*NX*NY*NZ`. The matrix is decomposed between MPI processes in a block-cyclic (bc) fashion. To understand the concept of bc decomposition, let us suppose that our (artificial) BdG matrix is 8x8 (in practice, this is not possible, since the smallest matrix for a 3d problem will be $`2\cdot 2^3=16`$). The code will be executed on `np=6` processes:
 ```bash
 mpirun -np 6 ./st-wslda-3d input.txt
 ```
@@ -18,18 +20,15 @@ mb                      2
 nb                      2
 ```
 Then the matrix will be distributed in the following fashion:  
-![bc-decomp](uploads/2cffebe8b09bbb6fef13c909065b3899/bc-decomp.png)
-whereby colors indicate matrix elements that are handled by different MPI processes (6 different colors):
-![bc-decomp-2](uploads/d5d9bf64c421faefecd6967fde32dd66/bc-decomp-2.png)  
-
-To learn more about bc decomposition, see [here](http://wlazlowski.fizyka.pw.edu.pl/pdfs/dydaktyka/NTO/meeting-8.pdf). 
+![bc_decomp-1](uploads/161392e9b3ec35a880a90ae9adbcaae4/bc_decomp-1.png)
+whereby colors indicate matrix elements that are handled by different MPI processes (6 different colors). Note that different processes handle different numbers of matrix elements. To learn more about bc decomposition, see [here](http://wlazlowski.fizyka.pw.edu.pl/pdfs/dydaktyka/NTO/meeting-8.pdf). 

-## Selecting p, q, mb and nb
+## Selecting p, q, mb, and nb
 By construction, the following constraint must be satisfied: `np=p*q`. When selecting decomposition parameters, the user should take into account:
 * Typically, the best performance is achieved for `p=q`; thus, it is recommended to select `p` and `q` to be as close as possible. If `p` and `q` are commented out, then the algorithm will automatically select their values to satisfy this requirement.  
-*Note*: Typically, the constraint `p=q` cannot be satisfied. We empirically found that the setting with `p<q` yields better performance than those with `p>q`. For example, if we run code with `np=24` than we have two options `(p,q)=(4,6)` or `(p,q)=(6,4)`. According to our expertise, we expect that setting `(p,q)=(4,6)` will provide better performance of the computation process. 
-* By construction, block sizes must satisfy the following constraints: `mb<=matrix_size/p` and `nb<=matrix_size/q`. In the case of decomposition codes (such as diagonalization), setting block sizes to their maximum values does not perform well. We find that the best performance is obtained if `mb` and `nb` are much smaller than their maximal allowed values, and at the same time, the number of matrix elements in the block `mb*nb` is significant (of the order of a hundred or higher). Empirically, we find that typically good performance is obtained for block sizes of 16, 32, 64 (powers of 2). We recommend that the user try these values and, based on the results, decide whether further increasing or decreasing is profitable. 
-* If `p=q=1`, then the parallelization is not applied. It corresponds to the single CPU version of the code. The code should work for these settings as well (do not expect that you will be able to solve large problems then). 
+*Note*: Typically, the constraint `p=q` cannot be satisfied. We empirically found that the setting with `p<q` yields better performance than those with `p>q`. For example, if we run code with `np=24` than we have two options `(p,q)=(4,6)` or `(p,q)=(6,4)`. Based on our expertise, we expect that setting `(p,q)=(4,6)` will yield better performance in the computation process. 
+* By construction, block sizes must satisfy the following constraints: `mb<=matrix_size/p` and `nb<=matrix_size/q`. In decomposition codes (such as diagonalization), setting block sizes to their maximum values does not perform well. We find that the best performance is obtained if `mb` and `nb` are much smaller than their maximal allowed values, and at the same time, the number of matrix elements in the block `mb*nb` is significant (of the order of a hundred or higher). Empirically, we find that good performance is typically obtained for block sizes of 16, 32, and 64 (powers of 2). We recommend that the user try these values and, based on the results, decide whether further increases or decreases are profitable. 
+* If `p=q=1`, then the parallelization is not applied. It corresponds to the single CPU version of the code. The code should work for these settings as well (do not expect to be able to solve large problems then). 

 # st-wslda-2d
 In this variant, the code assumes that the quasi-particle wave functions have the form:
@@ -40,17 +39,17 @@ where
 ```math
 k_z = 0, \pm 1 \frac{2\pi}{L_z}, \pm 2 \frac{2\pi}{L_z}, \ldots , +(N_z-1) \frac{2\pi}{L_z}
 ```
-and $`L_z = NZ*DZ`$ is the box length along the z-direction. From the physical point of view, it means that we impose translation symmetry along the z-direction. Under this assumption, the BdG matrix acquires a block-diagonal form:
+and $`L_z = NZ*DZ`$ is the box length along the z-direction. From the physical point of view, it means that we impose translation symmetry along the z-direction. Under this assumption, the BdG matrix acquires a block-diagonal form:  
 ![HBdG-2d](uploads/9659540e2c865b7a6736e55cec10aa50/HBdG-2d.png)  
 and diagonalization of the matrix is equivalent to diagonalizations of submatrices $`H(k_{z,i})`$, each of them of size `matrix_size=2*NX*NY`. Moreover, the translation symmetry imposes that $`H(k_{z})=H(-k_{z})`$ and in practice it is sufficient to diagonalize only submatrices for positive $`k_z`$, which takes `NZ/2` values. Submatrices can be diagonalized simultaneously.  

-To demonstrate the parallelization scheme in 2D case, let us consider the following lattice:
+To demonstrate the parallelization scheme in the 2D case, let us consider the following lattice:
 ```c
 #define NX 8
 #define NY 10
 #define NZ 12
 ```
-As in the previous example (3D case) in the input file, we set:
+As in the previous example (3D case), in the input file, we set:
 ```
 # BLACS grid 
 p                       2
@@ -60,7 +59,7 @@ and we execute code with `np=24` processes:
 ```bash
 mpirun -np 24 ./st-wslda-3d input.txt
 ```
-For these settings, the single iteration requires `NZ/2=6` diagonalizations.  The total set of processes will be divided into subgroups, each of size `p*q=6`. Thus the number of subgroups will be `24/6=4`. Each submatrix will be decomposed in the block-cyclic fashion among `p*q` processes as in 3D case.
+For these settings, the single iteration requires `NZ/2=6` diagonalizations.  The total set of processes will be divided into subgroups, each of size `p*q=6`. Thus, the number of subgroups will be `24/6=4`. Each submatrix will be decomposed in the block-cyclic fashion among `p*q` processes as in the 3D case.
 This information is provided in the code output:
 ```
 # CODE: ST-WSLDA-2D
@@ -80,7 +79,7 @@ This information is provided in the code output:
 # HAMILTONIAN TOTAL STORAGE: 0.39MB
 # CREATING CBLACS GRIDs OF SIZE (pzheev): [2 x 3]
 ```
-Note that here hamiltonian size means the size of submatrix `160=2*8*10`. 
+Note that here, Hamiltonian size means the size of the submatrix `160=2*8*10`. 
 The computation process for a single iteration is presented schematically in the figure below:
 ![pscheme-2d](uploads/4792eecc243913bfd2e18d0fef09c90e/pscheme-2d.png)  
 
@@ -102,9 +101,9 @@ and it is reflected in the code output:
 # Number of nwf in [-ecut,+ecut] to be extracted is: 976 (50.8% of total number of states)
 ```
 ## Selecting p, q, and np
-By construction, the following constraint must be satisfied: `np/(p*q) = i` where `i` is an integer number (number of subgroups) and `i<=NZ/2`. If `p` and `q` are not specified in the input file, the code will select them in such a way as to maximize the number of simultaneous diagonalizations. Concerning matrix distribution, all requirements as described for 3D case apply here as well.
+By construction, the following constraint must be satisfied: `np/(p*q) = i` where `i` is an integer number (number of subgroups) and `i<=NZ/2`. If `p` and `q` are not specified in the input file, the code will select them in such a way as to maximize the number of simultaneous diagonalizations. Concerning matrix distribution, all requirements as described for the 3D case apply here as well.

 # st-wslda-1d
-The diagonalization scheme for 1D code is the same as for 2D code, with the modification that now submatrices depend on two wave-vectors $`H(k_{y,i}, k_{z,j})`$.  
+The diagonalization scheme for the 1D code is the same as for the 2D code, with the modification that now submatrices depend on two wave-vectors $`H(k_{y,i}, k_{z,j})`$.  

 In the 1D case, typically, we need to diagonalize a large number of small matrices of size `matrix_size=2*NX`. For small sizes of the order of thousands, typically, the best performance is obtained for `p=q=1`, i.e., no block-cyclic distribution of the submatrices. 
\ No newline at end of file