Nicolás Wolovick 20180515
Cache (Coherency) Line Size (CLS)
1 zx81:~$ cat /sys/devices/system/cpu/cpu*/cache/index*/coherency_line_size
2 64
3 ...
4 64
1 #pragma omp parallel sections
2 {
3 #pragma omp section
4 {
5 for (size_t i=0; i<N/2; ++i) {
6 sum[0] += i;
7 }
8 }
9 #pragma omp section
10 {
11 for (size_t i=N/2; i<N; ++i) {
12 sum[OFFSET] += i;
13 }
14 }
15 }
OFFSET
de 1 a 15 y ver que pasa.-O0
o usa registros para acumular y chau false-shaing.Core 2 Duo (Penryn)
i7 980 (Gulftown, 32nm shink de Nehalem)
2 * E52680-v2
Notar numeración "Intel", primero los físicos, luego los lógicos.
1 $ lstopo -i topo_7210_knl.xml
Son 32 módulos de dos 2*VPU (vector processing units)
¿Cómo transformar una máquina NUMA en UMA?
Máquina de 4 pastillas AMD Bulldozer. ¡Notar la complejidad de la topología!
1 $ numactl --hardware
2 available: 8 nodes (0-7)
3 node 0 cpus: 0 1 2 3 4 5 6 7
4 node 0 size: 32765 MB
5 node 0 free: 2833 MB
6 node 1 cpus: 8 9 10 11 12 13 14 15
7 node 1 size: 32768 MB
8 node 1 free: 20462 MB
9 ...
10 node 7 cpus: 56 57 58 59 60 61 62 63
11 node 7 size: 32752 MB
12 node 7 free: 25099 MB
13 node distances:
14 node 0 1 2 3 4 5 6 7·
15 0: 10 16 16 22 16 22 16 22·
16 1: 16 10 22 16 22 16 22 16·
17 2: 16 22 10 16 16 22 16 22·
18 3: 22 16 16 10 22 16 22 16·
19 4: 16 22 16 22 10 16 16 22·
20 5: 22 16 22 16 16 10 22 16·
21 6: 16 22 16 22 16 22 10 16·
22 7: 22 16 22 16 22 16 16 10·
1 $ numactl --show
2 $ numactl --hardware
3 $ numastat
4 $ numatop
5 $ tiptop # ya que estamos!
numa.c
1 #include <stddef.h> // size_t
2 #include <stdlib.h> // malloc()
3 #include <omp.h>
4
5 #define N (1L<<34)
6
7 int main(int argc, char ** argv)
8 {
9 float *a = NULL;
10 a = malloc(N*sizeof(float)); // 64 GiB
11 loop: a = a;
12 #pragma omp parallel for
13 for (size_t i=0L; i<N; ++i)
14 a[i]=(float)i;
15 goto loop;
16 }
Monitoreamos como van ocupando los nodos NUMA.
1 $ watch -n 1 numactl --hardware
numa-trash.c
1 #include <stddef.h> // size_t
2 #include <stdlib.h> // malloc()
3 #include <omp.h>
4
5 #define N (1L<<34)
6
7 int main(int argc, char ** argv)
8 {
9 float *a = NULL;
10 a = malloc(N*sizeof(float)); // 64 GiB
11 for (size_t i=0L; i<N; i+=4096/sizeof(float))
12 a[i] = 1.0f; // toco un float por página en single-thread
13 loop: a = a; //skip
14 #pragma omp parallel for
15 for (size_t i=0L; i<N; ++i)
16 a[i]=(float)i;
17 goto loop;
18 }
Idem, pero vemos que ahora hay algo que balancea automágicamente.
1 $ watch -n 1 numactl --hardware
1 root@zx81:~# dmesg | grep -i NUMA
2 [ 0.000000] NUMA: Initialized distance table, cnt=2
3 [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x107fffffff] -> [mem 0x00000000-0x107fffffff]
4 [ 0.000000] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
5 [ 0.248548] pci_bus 0000:00: on NUMA node 0
6 [ 0.251704] pci_bus 0000:80: on NUMA node 1
7 root@zx81:~# sysctl kernel.numa_balancing
8 kernel.numa_balancing = 1
Puedo agarrarla a cores y a NUMA nodes.
1 $ numactl --cpunodebind=1 --membind=1 ./a.out
Puedo pedirle que cambie la política
1 $ numactl --interleave=all ./a.out
Con el interleave
soluciono el problema de numa-trash.c
sin usar NUMA balancing.
GOMP_CPU_AFFINITY
1 $ GOMP_CPU_AFFINITY="0 3 1-2 4-15:2" ./a.out
OMP_PLACES
threads
, cores
, sockets
1 $ OMP_PLACES=sockets ./a.out
OMP_PROC_BIND
true
, master
, close
, spread
1 $ OMP_PLACES=spread ./a.out
OMP_MAX_ACTIVE_LEVELS
1 $ OMP_NESTED=1 OMP_MAX_ACTIVE_LEVELS=2 ./a.out
Para ejecutar en KNL
1 OMP_NESTED=1 OMP_MAX_ACTIVE_LEVELS=2 OMP_NUM_THREADS=256 \
2 OMP_PROC_BIND=spread,spread OMP_PLACES=cores ./time_dgemm_icc_mkl 8192 8192 8192;
1 @dgxpascal:~$ nvidia-smi topo -m
2 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_2 mlx5_1 mlx5_3 CPU Affinity
3 GPU0 X NV1 NV1 NV1 NV1 SOC SOC SOC PIX SOC PHB SOC 0-19,40-59
4 GPU1 NV1 X NV1 NV1 SOC NV1 SOC SOC PIX SOC PHB SOC 0-19,40-59
5 GPU2 NV1 NV1 X NV1 SOC SOC NV1 SOC PHB SOC PIX SOC 0-19,40-59
6 GPU3 NV1 NV1 NV1 X SOC SOC SOC NV1 PHB SOC PIX SOC 0-19,40-59
7 GPU4 NV1 SOC SOC SOC X NV1 NV1 NV1 SOC PIX SOC PHB 20-39,60-79
8 GPU5 SOC NV1 SOC SOC NV1 X NV1 NV1 SOC PIX SOC PHB 20-39,60-79
9 GPU6 SOC SOC NV1 SOC NV1 NV1 X NV1 SOC PHB SOC PIX 20-39,60-79
10 GPU7 SOC SOC SOC NV1 NV1 NV1 NV1 X SOC PHB SOC PIX 20-39,60-79
11 mlx5_0 PIX PIX PHB PHB SOC SOC SOC SOC X SOC PHB SOC
12 mlx5_2 SOC SOC SOC SOC PIX PIX PHB PHB SOC X SOC PHB
13 mlx5_1 PHB PHB PIX PIX SOC SOC SOC SOC PHB SOC X SOC
14 mlx5_3 SOC SOC SOC SOC PHB PHB PIX PIX SOC PHB SOC X
15
16 Legend:
17 X = Self
18 SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
19 PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
20 PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
21 PIX = Connection traversing a single PCIe switch
22 NV# = Connection traversing a bonded set of # NVLinks
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |