Topología en SMP

Presenter Notes

Resumen:

False sharing
Distancias
Ubicación
Impacto

Nicolás Wolovick 20180515

Presenter Notes

False sharing

Cache (Coherency) Line Size (CLS)

1 zx81:~$ cat /sys/devices/system/cpu/cpu*/cache/index*/coherency_line_size
2 64
3 ...
4 64

Nitsan Wakart, Notes on False Sharing, 2014.

Presenter Notes

Exponiendo el false sharing

 1 #pragma omp parallel sections
 2 {
 3     #pragma omp section
 4     {
 5     for (size_t i=0; i<N/2; ++i) {
 6         sum[0] += i;
 7     }
 8     }
 9     #pragma omp section
10     {
11     for (size_t i=N/2; i<N; ++i) {
12         sum[OFFSET] += i;
13     }
14     }
15 }

Ir variando el OFFSET de 1 a 15 y ver que pasa.
Siempre con -O0 o usa registros para acumular y chau false-shaing.

Presenter Notes

Topología

Cómputo
- Unidades compartidas.
- Hilos virtuales (SMT, Hyperthreading®)
Memoria
- Cache
- RAM

Presenter Notes

Sencillas

Core 2 Duo (Penryn)

No hay nada complejo ... nada.

Presenter Notes

Sencillas

i7 980 (Gulftown, 32nm shink de Nehalem)

Aun nada complejo ... nada.

Presenter Notes

Dos nodos NUMA (nabu)

2 * E52680-v2

Presenter Notes

Dos nodos NUMA (2 * E52620-v3, zx81)

Presenter Notes

Hyperthreading

Notar numeración "Intel", primero los físicos, luego los lógicos.

Presenter Notes

4 nodos NUMA (AMD Bulldozer)

Cores adjacentes comparten L1i y L2
En realidad cada dos Bulldozer cores se forma un Cluster Multi-Threading (CMT). Dos ALU 256 bits, una FPU de 256 bits.
Numeración "AMD" físico-lógico-físico-lógico-...

Presenter Notes

Bulldozer, compute nodes

Presenter Notes

KNL (7210, Eulogia)

Presenter Notes

Demasiado grande, lo naveguemos

1 $ lstopo -i topo_7210_knl.xml

Son 32 módulos de dos 2*VPU (vector processing units)

Presenter Notes

NUMA

¿Cómo transformar una máquina NUMA en UMA?

Frank Denneman, Node Interleaving: Enable or Disable?, 2010.

Presenter Notes

NUMA en PSG16

Máquina de 4 pastillas AMD Bulldozer. ¡Notar la complejidad de la topología!

 1 $ numactl --hardware
 2 available: 8 nodes (0-7)
 3 node 0 cpus: 0 1 2 3 4 5 6 7
 4 node 0 size: 32765 MB
 5 node 0 free: 2833 MB
 6 node 1 cpus: 8 9 10 11 12 13 14 15
 7 node 1 size: 32768 MB
 8 node 1 free: 20462 MB
 9 ...
10 node 7 cpus: 56 57 58 59 60 61 62 63
11 node 7 size: 32752 MB
12 node 7 free: 25099 MB
13 node distances:
14 node   0   1   2   3   4   5   6   7·
15   0:  10  16  16  22  16  22  16  22·
16   1:  16  10  22  16  22  16  22  16·
17   2:  16  22  10  16  16  22  16  22·
18   3:  22  16  16  10  22  16  22  16·
19   4:  16  22  16  22  10  16  16  22·
20   5:  22  16  22  16  16  10  22  16·
21   6:  16  22  16  22  16  22  10  16·
22   7:  22  16  22  16  22  16  16  10·

Presenter Notes

Herramientas

1 $ numactl --show
2 $ numactl --hardware
3 $ numastat
4 $ numatop
5 $ tiptop    # ya que estamos!

Presenter Notes

Experimentos

`numa.c`

 1 #include <stddef.h> // size_t
 2 #include <stdlib.h> // malloc()
 3 #include <omp.h>
 4 
 5 #define N (1L<<34)
 6 
 7 int main(int argc, char ** argv)
 8 {
 9     float *a = NULL;
10     a = malloc(N*sizeof(float)); // 64 GiB
11 loop:   a = a;
12     #pragma omp parallel for
13     for (size_t i=0L; i<N; ++i)
14         a[i]=(float)i;
15     goto loop;
16 }

Monitoreamos como van ocupando los nodos NUMA.

1 $ watch -n 1 numactl --hardware

Presenter Notes

Experimentos

`numa-trash.c`

 1 #include <stddef.h> // size_t
 2 #include <stdlib.h> // malloc()
 3 #include <omp.h>
 4 
 5 #define N (1L<<34)
 6 
 7 int main(int argc, char ** argv)
 8 {
 9     float *a = NULL;
10     a = malloc(N*sizeof(float));  // 64 GiB
11     for (size_t i=0L; i<N; i+=4096/sizeof(float))
12         a[i] = 1.0f; // toco un float por página en single-thread
13 loop:   a = a; //skip
14     #pragma omp parallel for
15     for (size_t i=0L; i<N; ++i)
16         a[i]=(float)i;
17     goto loop;
18 }

Idem, pero vemos que ahora hay algo que balancea automágicamente.

1 $ watch -n 1 numactl --hardware

Presenter Notes

NUMA balancing

1 root@zx81:~# dmesg | grep -i NUMA
2 [    0.000000] NUMA: Initialized distance table, cnt=2
3 [    0.000000] NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x107fffffff] -> [mem 0x00000000-0x107fffffff]
4 [    0.000000] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
5 [    0.248548] pci_bus 0000:00: on NUMA node 0
6 [    0.251704] pci_bus 0000:80: on NUMA node 1
7 root@zx81:~# sysctl kernel.numa_balancing
8 kernel.numa_balancing = 1

Circa 2012, kernel ~3.8

Presenter Notes

Pinning, moving

Puedo agarrarla a cores y a NUMA nodes.

1 $ numactl --cpunodebind=1 --membind=1 ./a.out

Puedo pedirle que cambie la política

1 $ numactl --interleave=all ./a.out

Con el interleave soluciono el problema de numa-trash.c sin usar NUMA balancing.

Presenter Notes

En OpenMP

`GOMP_CPU_AFFINITY`

1 $ GOMP_CPU_AFFINITY="0 3 1-2 4-15:2" ./a.out

`OMP_PLACES`

threads, cores, sockets

1 $ OMP_PLACES=sockets ./a.out

`OMP_PROC_BIND`

true, master, close, spread

1 $ OMP_PLACES=spread ./a.out

`OMP_MAX_ACTIVE_LEVELS`

1 $ OMP_NESTED=1 OMP_MAX_ACTIVE_LEVELS=2 ./a.out

Victor Eijkhout, Parallel Programming in MPI and OpenMP, 2018.

Presenter Notes

Gráficamente Close

Presenter Notes

Gráficamente ?

Presenter Notes

Gráficamente Spread

Presenter Notes

Ejemplo

Para ejecutar en KNL

1 OMP_NESTED=1 OMP_MAX_ACTIVE_LEVELS=2 OMP_NUM_THREADS=256 \
2 OMP_PROC_BIND=spread,spread OMP_PLACES=cores ./time_dgemm_icc_mkl 8192 8192 8192;

Presenter Notes

Alternativas

cpusets.
cgroups.

Presenter Notes

En GPU

 1 @dgxpascal:~$ nvidia-smi topo -m
 2         GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_2  mlx5_1  mlx5_3  CPU Affinity
 3 GPU0     X      NV1     NV1     NV1     NV1     SOC     SOC     SOC     PIX     SOC     PHB     SOC     0-19,40-59
 4 GPU1    NV1      X      NV1     NV1     SOC     NV1     SOC     SOC     PIX     SOC     PHB     SOC     0-19,40-59
 5 GPU2    NV1     NV1      X      NV1     SOC     SOC     NV1     SOC     PHB     SOC     PIX     SOC     0-19,40-59
 6 GPU3    NV1     NV1     NV1      X      SOC     SOC     SOC     NV1     PHB     SOC     PIX     SOC     0-19,40-59
 7 GPU4    NV1     SOC     SOC     SOC      X      NV1     NV1     NV1     SOC     PIX     SOC     PHB     20-39,60-79
 8 GPU5    SOC     NV1     SOC     SOC     NV1      X      NV1     NV1     SOC     PIX     SOC     PHB     20-39,60-79
 9 GPU6    SOC     SOC     NV1     SOC     NV1     NV1      X      NV1     SOC     PHB     SOC     PIX     20-39,60-79
10 GPU7    SOC     SOC     SOC     NV1     NV1     NV1     NV1      X      SOC     PHB     SOC     PIX     20-39,60-79
11 mlx5_0  PIX     PIX     PHB     PHB     SOC     SOC     SOC     SOC      X      SOC     PHB     SOC     
12 mlx5_2  SOC     SOC     SOC     SOC     PIX     PIX     PHB     PHB     SOC      X      SOC     PHB     
13 mlx5_1  PHB     PHB     PIX     PIX     SOC     SOC     SOC     SOC     PHB     SOC      X      SOC     
14 mlx5_3  SOC     SOC     SOC     SOC     PHB     PHB     PIX     PIX     SOC     PHB     SOC      X
15 
16 Legend:
17 X   = Self
18 SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
19 PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
20 PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
21 PIX  = Connection traversing a single PCIe switch
22 NV#  = Connection traversing a bonded set of # NVLinks

Mark Harris, NVIDIA DGX-1: The Fastest Deep Learning System, 2017.

Presenter Notes

En GPU

Presenter Notes

Bibliografía

Presenter Notes

Bibliografía

Keld Helsgaun, How to Get Good Performance by Using OpenMP, 2010.
Gregg S., Process and Thread Affinity for Intel® Xeon Phi™ Processors, 2016.
Victor Eijkhout, Parallel Programming in MPI and OpenMP, 2018.
Christoph Lameter, NUMA (Non-Uniform Memory Access): An Overview, ACM Queue, 2013.

Presenter Notes

La clase que viene

Scaling (or not)

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

numa.c

numa-trash.c

GOMP_CPU_AFFINITY

OMP_PLACES

OMP_PROC_BIND

OMP_MAX_ACTIVE_LEVELS

Table of Contents

Help

`numa.c`

`numa-trash.c`

`GOMP_CPU_AFFINITY`

`OMP_PLACES`

`OMP_PROC_BIND`

`OMP_MAX_ACTIVE_LEVELS`