Configuring GPU machine

Introduction

td codes require machines equipped with GPUs. The standard scenario assumes that the number of parallel MPI processes equals the number of GPUs. The user must provide the correct prescription that uniquely assigns GPU devices to MPI processes. This step depends on the target machine's architecture. To set up the correct profile for the target machine, modify machine.h. To print out the screen-applied mapping of MPI processes to GPUs, use:

/**
 * Activate this flag in order to print to stdout
 * applied mapping mpi-process <==> device-id.
 * */
// #define PRINT_GPU_DISTRIBUTION

Machine with uniformly distributed GPU cards of the same type

It is the most common case. In such a case, it is sufficient to use the default settings by commenting out:

/**
 * Activate this flag if the target machine has a non-standard distribution of GPUs. 
 * In such a case, you need to provide the body of the function `assign_deviceid_to_mpi_process`.
 * If this flag is commented out, it is assumed that the code is running on a machine 
 * with uniformly distributed GPU cards across the nodes, 
 * controlled by GPUS_PER_NODE or `gpuspernode` input file tag.
 * */
// #define CUSTOM_GPU_DISTRIBUTION

and setting

/**
 * Default number of GPUs per node.
 * You can overwrite this value by using the gpuspernode tag in the input file.
 * The flag is ignored if CUSTOM_GPU_DISTRIBUTION is selected.
 * */
#define GPUS_PER_NODE 1

In a given example, you need to execute the code with the number of MPI processes equal number of GPUs. For example, if each node is equipped with one GPU and you plan to run the code on 512 nodes, you should call it as (schematic notation):

mpirun -n 512 --ntasks-per-node=1 ./td-wslda-3d input.txt

Machine with non-uniform distribution of GPUs

In such a case, you need to define the GPU distribution. For example, consider a machine that has 7 nodes, and cards are distributed as follows (the content of file nodes.txt):

node2061.grid4cern.if.pw.edu.pl slots=8
node2062.grid4cern.if.pw.edu.pl slots=8
node2063.grid4cern.if.pw.edu.pl slots=4
node2064.grid4cern.if.pw.edu.pl slots=4
node2065.grid4cern.if.pw.edu.pl slots=4
node2066.grid4cern.if.pw.edu.pl slots=4
node2067.grid4cern.if.pw.edu.pl slots=8

GPU distribution is defined as follows:

/**
 * Activate this flag if the target machine has a non-standard distribution of GPUs. 
 * In such a case, you need to provide the body of the function `assign_deviceid_to_mpi_process`.
 * If this flag is commented out, it is assumed that the code is running on a machine 
 * with uniformly distributed GPU cards across the nodes, 
 * controlled by GPUS_PER_NODE or `gpuspernode` input file tag.
 * */
#define CUSTOM_GPU_DISTRIBUTION

/**
 * This function is used to assign a unique device ID to the MPI process.
 * @param comm MPI communicator
 * @return device-id assign to the process extracted by function MPI_Comm_rank(...)
 * DO NOT REMOVE STATEMENT `#if ... BELOW !!!
 * */
#if defined(CUSTOM_GPU_DISTRIBUTION) && defined(TDWSLDA_MAIN)
int assign_deviceid_to_mpi_process(MPI_Comm comm)
{
    int np, ip;
    MPI_Comm_size(comm, &np);
    MPI_Comm_rank(comm, &ip);
    
    // assign here deviceid to process with ip=iam
    int deviceid=0;
    
    if(ip==0) printf("# CUSTOM GPU DISTRIBUTION FOR MACHINE: DWARF\n");
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);
    int *ompi_local_rank;
    ompi_local_rank = (int *)malloc(sizeof(int)*np);
    int ompi_ppn=4;
    if(strcmp (processor_name,"node2061.grid4cern.if.pw.edu.pl")==0) ompi_ppn=8;
    if(strcmp (processor_name,"node2062.grid4cern.if.pw.edu.pl")==0) ompi_ppn=8;
    if(strcmp (processor_name,"node2067.grid4cern.if.pw.edu.pl")==0) ompi_ppn=8;
    MPI_Allgather(&ompi_ppn,1,MPI_INT,ompi_local_rank,1,MPI_INT,MPI_COMM_WORLD);
    int ompi_i=0, ompi_j;
    while(ompi_i<np)
    {
        if(ompi_local_rank[ompi_i]==8)
        {
            for(ompi_j=0; ompi_j<8; ompi_j++) ompi_local_rank[ompi_i+ompi_j]=ompi_j;
            ompi_i+=8;
        }
        else
        {
            for(ompi_j=0; ompi_j<4; ompi_j++) ompi_local_rank[ompi_i+ompi_j]=ompi_j;
            ompi_i+=4;
        }
    }

    deviceid=ompi_local_rank[ip];
    free(ompi_local_rank);
    
    return deviceid;
}
#endif