Affinity

Presenter Notes

Resumen:

algo

Presenter Notes

Placement

mapeo de hilos a núcleos físicos

No todos los núcleos son iguales

Diferente distancia a de comunicación entre ellos.
- Nodos NUMA.
- LLC compartida (típicamente L3).
- SMT.
Diferente velocidad.
- Defectos de fabricación.
- Núcleos lentos vs. rápidos (diferente presupuesto de potencia, aka BIG.litte-alike).

Afinidad

Presenter Notes

Simétrico, no hay afinidad alguna

No SMT, No NUMA, No LLC entre algunos núcleos.

Da lo mismo poner los hilos en cualquier lugar.

¡Ojo! Esto no implica que no programe locality-aware, es decir pirámide registros-L1-L2-DRAM para que la localidad sea máxima.

Presenter Notes

Afinidad por NUMA, 2sock

LLC y NUMA nodes coinciden en afinidad, asi que hay 1 nivel de placement y 2 posibilidades

NUMA node 0
NUMA node 1

Prog-wise tengo que tener en cuenta la NUMA distance al implementar.

Presenter Notes

Afinidad por SMT, Core i5 HT

Con numeración Intel, cores 0..N-1 son reales, N..2N-1 son virtuales.

¡Poner parejas {0,N}, {1,N+1}, ..., {N-1, 2N-1} de hilos que compartan el estado microaquitectural, la L1 y la L2!

{{0,2},{1,3}}

Presenter Notes

2 niveles de Afinidad, KNL

SMT, LLC.

{{{0,64,128,192},{1,65,129,193}},..., {{{62,126,190,254},{63,127,191,255}}}

Presenter Notes

3 niveles de Afinidad, 4sock, Bulldozer

SMT, NUMA.

Tengo que poder agrupar hilos de a pares que compartan L1i y L2.
Tengo que poder agrupar pares de hilos para que compartan L3 y NUMA node nivel 1.
Tengo que poder agrupar hilos que compartan NUMA node nivel 2.

{{{0,1},{2,3},{4,5},{6,7}}, ..., {{24,25},{26,27},{28,29},{30,31}}}

Presenter Notes

3 niveles de Afinidad, 2sock EPYC Rome

Presenter Notes

3 niveles de Afinidad, 2sock EPYC Rome

Presenter Notes

3 niveles de Afinidad, 2sock EPYC Rome

NUMA, CCD=CCX(LLC), SMT.

SUSE, Optimizing Linux for AMD EPYC™ 7002 Series Processors with SUSE Linux Enterprise 15 SP1, 2019

{ { {{0,128},{1,129},{2,130},{3,131}}, ..., {{60,188},{61,189},{62,190},{63,191}} }, { {{64,192},{65,193},{66,194},{67,195}}, ..., {{124,252},{125,253},{126,254},{126,255}} } }

Presenter Notes

4 niveles de Afinidad, 2sock EPYC Rome

Creo que si tengo 2 CCX por CCD (64 core EPYC), tengo un nivel intermedio más.

¡En EPYC1, eran 4 niveles tb!

Presenter Notes

Placement & OS

Presenter Notes

Hilos danzantes

El SO tiene que asegurar buen servicio a todos los procesos y esto implica decisiones raras como CPU Migrations.

 1 #include <stddef.h> // size_t
 2 #include <stdlib.h> // malloc()
 3 #include <omp.h>
 4 
 5 #define N (1L<<34)
 6 
 7 int main(int argc, char ** argv)
 8 {
 9     float *a = NULL;
10     a = malloc(N*sizeof(float)); // 64 GiB
11 loop:   a = a;
12     #pragma omp parallel for
13     for (size_t i=0L; i<N; ++i)
14         a[i]=(float)i;
15     goto loop;
16 }

"A mover el bote, el bote ..."

1 OMP_NUM_THREADS=14 perf stat ./a.out &
2 htop

Presenter Notes

NUMA en Ambroggio Racing

Máquina de 4 pastillas AMD Bulldozer. ¡Notar la complejidad de la topología!

 1 $ numactl --hardware
 2 available: 8 nodes (0-7)
 3 node 0 cpus: 0 1 2 3 4 5 6 7
 4 node 0 size: 32765 MB
 5 node 0 free: 2833 MB
 6 node 1 cpus: 8 9 10 11 12 13 14 15
 7 node 1 size: 32768 MB
 8 node 1 free: 20462 MB
 9 ...
10 node 7 cpus: 56 57 58 59 60 61 62 63
11 node 7 size: 32752 MB
12 node 7 free: 25099 MB
13 node distances:
14 node   0   1   2   3   4   5   6   7·
15   0:  10  16  16  22  16  22  16  22·
16   1:  16  10  22  16  22  16  22  16·
17   2:  16  22  10  16  16  22  16  22·
18   3:  22  16  16  10  22  16  22  16·
19   4:  16  22  16  22  10  16  16  22·
20   5:  22  16  22  16  16  10  22  16·
21   6:  16  22  16  22  16  22  10  16·
22   7:  22  16  22  16  22  16  16  10·

Presenter Notes

Herramientas

1 $ numactl --show
2 $ numactl --hardware
3 $ numastat
4 $ numatop
5 $ tiptop    # ya que estamos!
6 $ lstopo --distances
7 $ lstopo --merge

Presenter Notes

Experimentos

`numa.c`

 1 #include <stddef.h> // size_t
 2 #include <stdlib.h> // malloc()
 3 #include <omp.h>
 4 
 5 #define N (1L<<34)
 6 
 7 int main(int argc, char ** argv)
 8 {
 9     float *a = NULL;
10     a = malloc(N*sizeof(float)); // 64 GiB
11 loop:   a = a;
12     #pragma omp parallel for
13     for (size_t i=0L; i<N; ++i)
14         a[i]=(float)i;
15     goto loop;
16 }

Monitoreamos como van ocupando los nodos NUMA.

1 $ watch -n 1 numactl --hardware
2 $ numatop

Presenter Notes

Experimentos

`numa-trash.c`

 1 #include <stddef.h> // size_t
 2 #include <stdlib.h> // malloc()
 3 #include <omp.h>
 4 
 5 #define N (1L<<34)
 6 
 7 int main(int argc, char ** argv)
 8 {
 9     float *a = NULL;
10     a = malloc(N*sizeof(float));  // 64 GiB
11     for (size_t i=0L; i<N; i+=4096/sizeof(float))
12         a[i] = 1.0f; // toco un float por página en single-thread
13 loop:   a = a; //skip
14     #pragma omp parallel for
15     for (size_t i=0L; i<N; ++i)
16         a[i]=(float)i;
17     goto loop;
18 }

Idem, pero vemos que ahora hay algo que balancea automágicamente.

1 $ watch -n 1 numactl --hardware
2 $ numatop

Presenter Notes

NUMA balancing

1 root@zx81:~# dmesg | grep -i NUMA
2 [    0.000000] NUMA: Initialized distance table, cnt=2
3 [    0.000000] NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x107fffffff] -> [mem 0x00000000-0x107fffffff]
4 [    0.000000] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
5 [    0.248548] pci_bus 0000:00: on NUMA node 0
6 [    0.251704] pci_bus 0000:80: on NUMA node 1
7 root@zx81:~# sysctl kernel.numa_balancing
8 kernel.numa_balancing = 1

Circa 2012, kernel ~3.8

Presenter Notes

Pinning, moving

Puedo agarrarla a cores y a NUMA nodes.

1 $ numactl --cpunodebind=1 --membind=1 ./a.out

Puedo pedirle que cambie la política

1 $ numactl --interleave=all ./a.out

Con el interleave soluciono el problema de numa-trash.c sin usar NUMA balancing.

También con `taskset`

1 $ taskset 0x00000FFF ./a.out

Donde 0x00000FFF es 0000 0000 0000 0000 0000 1111 1111 1111

Máscara binaria que habilita solo los cores del 0 a 11.

Presenter Notes

OMP Affinity

Presenter Notes

OMP es muy usado

Para mostrar todas las variables de entorno que el runtime de OpenMP usa antes de lanzar los hilos:

1 export OMP_DISPLAY_ENV=true

Ejecutamos comandos comunes como

Imagemagick y su super convert.
FFmpeg de Fabrice.
mplayer.

Y vamos a ver que son OMP-enabled.

¿Cómo controlamos afinidad de esta lingua franca del TLP?

Presenter Notes

OpenMP Places

Donde va a ejecutar los hilos.

Correr en los 4 primeros núcleos impares

1 OMP_PLACES={1,3,5,7} ./a.out

Correr en los 4 primeros núcleos pares {init:count:step}.

1 OMP_PLACES={0:4:2} ./a.out

Ojo, esto limita donde se corren los hilos, pero no cuantos se lanzan!
En zx81 lanzará 28 hilos corriendo en 4 cores! Oversubscription.

Esto si asegura que no haya process migration.

Presenter Notes

Lugares conocidos para `OMP_PLACES`

sockets: pastillas físicas.
cores: núcleos reales.
threads: núcleos virtuales, aka SMT.

Veamos que significa en zx81

1 OMP_PLACES=sockets
2   OMP_PLACES = '{0:14},{14:14}'
3 OMP_PLACES=cores
4   OMP_PLACES = '{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23},{24},{25},{26},{27}'
5 OMP_PLACES=threads
6   OMP_PLACES = '{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23},{24},{25},{26},{27}'

Ahora con HT

1 OMP_PLACES=sockets
2   OMP_PLACES = '{0:14,28:14},{14:14,42:14}'
3 OMP_PLACES=cores
4   OMP_PLACES = '{0,28},{1,29},{2,30},{3,31},{4,32},{5,33},{6,34},{7,35},{8,36},{9,37},{10,38},{11,39},{12,40},{13,41},{14,42},{15,43},{16,44},{17,45},{18,46},{19,47},{20,48},{21,49},{22,50},{23,51},{24,52},{25,53},{26,54},{27,55}'
5 OMP_PLACES=threads
6   OMP_PLACES = '{0},{28},{1},{29},{2},{30},{3},{31},{4},{32},{5},{33},{6},{34},{7},{35},{8},{36},{9},{37},{10},{38},{11},{39},{12},{40},{13},{41},{14},{42},{15},{43},{16},{44},{17},{45},{18},{46},{19},{47},{20},{48},{21},{49},{22},{50},{23},{51},{24},{52},{25},{53},{26},{54},{27},{55}'

for p in sockets cores threads; do echo OMP_PLACES=$p; OMP_DISPLAY_ENV=true OMP_PLACES=$p ./a.out 2>&1 | grep PLACES; done

Presenter Notes

OpenMP Affinity Policy

Como se distribuyen los trabajos en los PLACES.

Activar la política y fijar hilos a núcleos.

1 OMP_PROC_BIND=true ./a.out

Políticas:

close: mantenerlas cerca.
spread: dispersarlas.
master: en el mismo lugar donde está el hilo maestro.

Presenter Notes

Niveles

`OMP_MAX_ACTIVE_LEVELS`

1 $ OMP_NESTED=1 OMP_MAX_ACTIVE_LEVELS=2 ./a.out

Victor Eijkhout, Parallel Programming in MPI and OpenMP, 2018.
R. van der Pas, E. Stotzer, C. Terboven, Using OpenMP-The Next Step : Affinity, Accelerators, Tasking, and SIMD, MIT Press, 2017.

Presenter Notes

Gráficamente Close

Presenter Notes

Gráficamente Spread

Presenter Notes

Intel Compiler y MKL

Usa extensiones de variables de entorno KMP_*

Define cosas más finas.

Presenter Notes

Ejemplo

Para ejecutar en KNL

1 OMP_NESTED=1 OMP_MAX_ACTIVE_LEVELS=2 OMP_NUM_THREADS=256 \
2 OMP_PROC_BIND=spread,spread OMP_PLACES=cores ./time_dgemm_icc_mkl 8192 8192 8192;

Presenter Notes

Alternativas

cpusets.
cgroups.

Presenter Notes

En GPU

 1 @dgxpascal:~$ nvidia-smi topo -m
 2         GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_2  mlx5_1  mlx5_3  CPU Affinity
 3 GPU0     X      NV1     NV1     NV1     NV1     SOC     SOC     SOC     PIX     SOC     PHB     SOC     0-19,40-59
 4 GPU1    NV1      X      NV1     NV1     SOC     NV1     SOC     SOC     PIX     SOC     PHB     SOC     0-19,40-59
 5 GPU2    NV1     NV1      X      NV1     SOC     SOC     NV1     SOC     PHB     SOC     PIX     SOC     0-19,40-59
 6 GPU3    NV1     NV1     NV1      X      SOC     SOC     SOC     NV1     PHB     SOC     PIX     SOC     0-19,40-59
 7 GPU4    NV1     SOC     SOC     SOC      X      NV1     NV1     NV1     SOC     PIX     SOC     PHB     20-39,60-79
 8 GPU5    SOC     NV1     SOC     SOC     NV1      X      NV1     NV1     SOC     PIX     SOC     PHB     20-39,60-79
 9 GPU6    SOC     SOC     NV1     SOC     NV1     NV1      X      NV1     SOC     PHB     SOC     PIX     20-39,60-79
10 GPU7    SOC     SOC     SOC     NV1     NV1     NV1     NV1      X      SOC     PHB     SOC     PIX     20-39,60-79
11 mlx5_0  PIX     PIX     PHB     PHB     SOC     SOC     SOC     SOC      X      SOC     PHB     SOC     
12 mlx5_2  SOC     SOC     SOC     SOC     PIX     PIX     PHB     PHB     SOC      X      SOC     PHB     
13 mlx5_1  PHB     PHB     PIX     PIX     SOC     SOC     SOC     SOC     PHB     SOC      X      SOC     
14 mlx5_3  SOC     SOC     SOC     SOC     PHB     PHB     PIX     PIX     SOC     PHB     SOC      X
15 
16 Legend:
17 X   = Self
18 SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
19 PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
20 PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
21 PIX  = Connection traversing a single PCIe switch
22 NV#  = Connection traversing a bonded set of # NVLinks

Mark Harris, NVIDIA DGX-1: The Fastest Deep Learning System, 2017.

Presenter Notes

En GPU

Presenter Notes

Bibliografía

Presenter Notes

Bibliografía

Keld Helsgaun, How to Get Good Performance by Using OpenMP, 2010.
Gregg S., Process and Thread Affinity for Intel® Xeon Phi™ Processors, 2016.
Victor Eijkhout, Parallel Programming in MPI and OpenMP, 2018.
Christoph Lameter, NUMA (Non-Uniform Memory Access): An Overview, ACM Queue, 2013.
R. van der Pas, E. Stotzer, C. Terboven, Using OpenMP-The Next Step : Affinity, Accelerators, Tasking, and SIMD, MIT Press, 2017.

Presenter Notes

La clase que viene

Scaling (or not)

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

No todos los núcleos son iguales

Afinidad

numa.c

numa-trash.c

También con taskset

¿Cómo controlamos afinidad de esta lingua franca del TLP?

OMP_MAX_ACTIVE_LEVELS

Table of Contents

Help

`numa.c`

`numa-trash.c`

También con `taskset`

`OMP_MAX_ACTIVE_LEVELS`