Topología en SMP

Presenter Notes

Resumen:

  • False sharing
  • Distancias
  • Ubicación
  • Impacto

Nicolás Wolovick 20180515

Presenter Notes

False sharing

Cache (Coherency) Line Size (CLS)

1 zx81:~$ cat /sys/devices/system/cpu/cpu*/cache/index*/coherency_line_size
2 64
3 ...
4 64

Presenter Notes

Exponiendo el false sharing

 1 #pragma omp parallel sections
 2 {
 3     #pragma omp section
 4     {
 5     for (size_t i=0; i<N/2; ++i) {
 6         sum[0] += i;
 7     }
 8     }
 9     #pragma omp section
10     {
11     for (size_t i=N/2; i<N; ++i) {
12         sum[OFFSET] += i;
13     }
14     }
15 }
  • Ir variando el OFFSET de 1 a 15 y ver que pasa.
  • Siempre con -O0 o usa registros para acumular y chau false-shaing.

Presenter Notes

Topología

  • Cómputo
    • Unidades compartidas.
    • Hilos virtuales (SMT, Hyperthreading®)
  • Memoria
    • Cache
    • RAM

Presenter Notes

Sencillas

Core 2 Duo (Penryn)

  • No hay nada complejo ... nada.

Presenter Notes

Sencillas

i7 980 (Gulftown, 32nm shink de Nehalem)

  • Aun nada complejo ... nada.

Presenter Notes

Dos nodos NUMA (nabu)

2 * E52680-v2

Presenter Notes

Dos nodos NUMA (2 * E52620-v3, zx81)

Presenter Notes

Hyperthreading

Notar numeración "Intel", primero los físicos, luego los lógicos.

Presenter Notes

4 nodos NUMA (AMD Bulldozer)

  • Cores adjacentes comparten L1i y L2
  • En realidad cada dos Bulldozer cores se forma un Cluster Multi-Threading (CMT). Dos ALU 256 bits, una FPU de 256 bits.
  • Numeración "AMD" físico-lógico-físico-lógico-...

Presenter Notes

Bulldozer, compute nodes

Presenter Notes

KNL (7210, Eulogia)

Presenter Notes

Demasiado grande, lo naveguemos

1 $ lstopo -i topo_7210_knl.xml

Son 32 módulos de dos 2*VPU (vector processing units)

Presenter Notes

NUMA

¿Cómo transformar una máquina NUMA en UMA?

Presenter Notes

NUMA en PSG16

Máquina de 4 pastillas AMD Bulldozer. ¡Notar la complejidad de la topología!

 1 $ numactl --hardware
 2 available: 8 nodes (0-7)
 3 node 0 cpus: 0 1 2 3 4 5 6 7
 4 node 0 size: 32765 MB
 5 node 0 free: 2833 MB
 6 node 1 cpus: 8 9 10 11 12 13 14 15
 7 node 1 size: 32768 MB
 8 node 1 free: 20462 MB
 9 ...
10 node 7 cpus: 56 57 58 59 60 61 62 63
11 node 7 size: 32752 MB
12 node 7 free: 25099 MB
13 node distances:
14 node   0   1   2   3   4   5   6   7·
15   0:  10  16  16  22  16  22  16  22·
16   1:  16  10  22  16  22  16  22  16·
17   2:  16  22  10  16  16  22  16  22·
18   3:  22  16  16  10  22  16  22  16·
19   4:  16  22  16  22  10  16  16  22·
20   5:  22  16  22  16  16  10  22  16·
21   6:  16  22  16  22  16  22  10  16·
22   7:  22  16  22  16  22  16  16  10·

Presenter Notes

Herramientas

1 $ numactl --show
2 $ numactl --hardware
3 $ numastat
4 $ numatop
5 $ tiptop    # ya que estamos!

Presenter Notes

Experimentos

numa.c

 1 #include <stddef.h> // size_t
 2 #include <stdlib.h> // malloc()
 3 #include <omp.h>
 4 
 5 #define N (1L<<34)
 6 
 7 int main(int argc, char ** argv)
 8 {
 9     float *a = NULL;
10     a = malloc(N*sizeof(float)); // 64 GiB
11 loop:   a = a;
12     #pragma omp parallel for
13     for (size_t i=0L; i<N; ++i)
14         a[i]=(float)i;
15     goto loop;
16 }

Monitoreamos como van ocupando los nodos NUMA.

1 $ watch -n 1 numactl --hardware

Presenter Notes

Experimentos

numa-trash.c

 1 #include <stddef.h> // size_t
 2 #include <stdlib.h> // malloc()
 3 #include <omp.h>
 4 
 5 #define N (1L<<34)
 6 
 7 int main(int argc, char ** argv)
 8 {
 9     float *a = NULL;
10     a = malloc(N*sizeof(float));  // 64 GiB
11     for (size_t i=0L; i<N; i+=4096/sizeof(float))
12         a[i] = 1.0f; // toco un float por página en single-thread
13 loop:   a = a; //skip
14     #pragma omp parallel for
15     for (size_t i=0L; i<N; ++i)
16         a[i]=(float)i;
17     goto loop;
18 }

Idem, pero vemos que ahora hay algo que balancea automágicamente.

1 $ watch -n 1 numactl --hardware

Presenter Notes

NUMA balancing

1 root@zx81:~# dmesg | grep -i NUMA
2 [    0.000000] NUMA: Initialized distance table, cnt=2
3 [    0.000000] NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x107fffffff] -> [mem 0x00000000-0x107fffffff]
4 [    0.000000] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
5 [    0.248548] pci_bus 0000:00: on NUMA node 0
6 [    0.251704] pci_bus 0000:80: on NUMA node 1
7 root@zx81:~# sysctl kernel.numa_balancing
8 kernel.numa_balancing = 1

Presenter Notes

Pinning, moving

Puedo agarrarla a cores y a NUMA nodes.

1 $ numactl --cpunodebind=1 --membind=1 ./a.out

Puedo pedirle que cambie la política

1 $ numactl --interleave=all ./a.out

Con el interleave soluciono el problema de numa-trash.c sin usar NUMA balancing.

Presenter Notes

En OpenMP

GOMP_CPU_AFFINITY

1 $ GOMP_CPU_AFFINITY="0 3 1-2 4-15:2" ./a.out

OMP_PLACES

threads, cores, sockets

1 $ OMP_PLACES=sockets ./a.out

OMP_PROC_BIND

true, master, close, spread

1 $ OMP_PLACES=spread ./a.out

OMP_MAX_ACTIVE_LEVELS

1 $ OMP_NESTED=1 OMP_MAX_ACTIVE_LEVELS=2 ./a.out

Presenter Notes

Gráficamente Close

Presenter Notes

Gráficamente ?

Presenter Notes

Gráficamente Spread

Presenter Notes

Ejemplo

Para ejecutar en KNL

1 OMP_NESTED=1 OMP_MAX_ACTIVE_LEVELS=2 OMP_NUM_THREADS=256 \
2 OMP_PROC_BIND=spread,spread OMP_PLACES=cores ./time_dgemm_icc_mkl 8192 8192 8192;

Presenter Notes

Alternativas

Presenter Notes

En GPU

 1 @dgxpascal:~$ nvidia-smi topo -m
 2         GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_2  mlx5_1  mlx5_3  CPU Affinity
 3 GPU0     X      NV1     NV1     NV1     NV1     SOC     SOC     SOC     PIX     SOC     PHB     SOC     0-19,40-59
 4 GPU1    NV1      X      NV1     NV1     SOC     NV1     SOC     SOC     PIX     SOC     PHB     SOC     0-19,40-59
 5 GPU2    NV1     NV1      X      NV1     SOC     SOC     NV1     SOC     PHB     SOC     PIX     SOC     0-19,40-59
 6 GPU3    NV1     NV1     NV1      X      SOC     SOC     SOC     NV1     PHB     SOC     PIX     SOC     0-19,40-59
 7 GPU4    NV1     SOC     SOC     SOC      X      NV1     NV1     NV1     SOC     PIX     SOC     PHB     20-39,60-79
 8 GPU5    SOC     NV1     SOC     SOC     NV1      X      NV1     NV1     SOC     PIX     SOC     PHB     20-39,60-79
 9 GPU6    SOC     SOC     NV1     SOC     NV1     NV1      X      NV1     SOC     PHB     SOC     PIX     20-39,60-79
10 GPU7    SOC     SOC     SOC     NV1     NV1     NV1     NV1      X      SOC     PHB     SOC     PIX     20-39,60-79
11 mlx5_0  PIX     PIX     PHB     PHB     SOC     SOC     SOC     SOC      X      SOC     PHB     SOC     
12 mlx5_2  SOC     SOC     SOC     SOC     PIX     PIX     PHB     PHB     SOC      X      SOC     PHB     
13 mlx5_1  PHB     PHB     PIX     PIX     SOC     SOC     SOC     SOC     PHB     SOC      X      SOC     
14 mlx5_3  SOC     SOC     SOC     SOC     PHB     PHB     PIX     PIX     SOC     PHB     SOC      X
15 
16 Legend:
17 X   = Self
18 SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
19 PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
20 PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
21 PIX  = Connection traversing a single PCIe switch
22 NV#  = Connection traversing a bonded set of # NVLinks

Presenter Notes

En GPU

Presenter Notes

Bibliografía

Presenter Notes

Presenter Notes

La clase que viene

  • Scaling (or not)

Presenter Notes