SIMD

Presenter Notes

Resumen:

SIMD.
Operación.
Tipos de datos.
Operaciones.
Manejo de divergencia intra-lane.
Como generar código SIMD
Intrinsics.
Compilador optimizante.

Nicolás Wolovick 20200413

Presenter Notes

SIMD

Single Instruction Multiple Data

SIMD en la Flynn's taxonomy

Nos vamos a concentrar en una versión particular.

`AVX2`

Circa 2013, en Haswell<=.
256 bits vectoriales en registros YMM0, YMM1, ...

Presenter Notes

Operaciones vectoriales

Presenter Notes

SIMD es ...

Operaciones punto a punto sobre vectores

Esto arquitecturalmente implica
Prorratear unidad de control en muchas ALUs.

Presenter Notes

Tipos de datos

Presenter Notes

Todas las arquitecturas tienen SIMD

Intel: MMX (64 bits), SSE (128 bits), AVX (256 bits), AVX-512 (512 bits).
AMD: SSE (128 bits), AVX (256 bits).
ARM: Neon (128 bits); SVE (ancho variable de 128 a 512 bits).
PowerPC: VSU (128 bits).
NVIDIA: SM (1024 bits).
AMD Radeon: CU (1024 bits o 2048 bits).

WikiChip, A64

Presenter Notes

Paralelismo trivial: map

Multiplicación punto a punto c = a*b

1 #define N (1<<25)
2 float a[N], b[N], c[N];
3 int main(void) {
4     for (unsigned int i=0; i<N; ++i) {
5         a[i] = b[i]*c[i];
6     }
7 }

¿Qué estrategia seguiríamos para paralelizar SIMD?

Presenter Notes

Sin vectorizar

Compilamos y ejecutamos en jupiterace

 1 $ gcc-10 -O1 multmap.c && perf stat -e cycles,instructions,cache-references,cache-misses -r 16 ./novect
 2 
 3  Performance counter stats for './novect' (16 runs):
 4 
 5        131,732,516      cycles                                                        ( +-  0.41% )
 6        206,679,151      instructions              #    1.57  insn per cycle           ( +-  0.04% )
 7            636,709      cache-references                                              ( +-  5.12% )
 8             99,751      cache-misses              #   15.667 % of all cache refs      ( +-  0.50% )
 9 
10           0.061594 +- 0.000407 seconds time elapsed  ( +-  0.66% )

Presenter Notes

Prendiendo la autovectorización

 1 $ gcc-10 -O1 -march=haswell -ftree-vectorize -fopt-info-vec -fopt-info-vec-missed multmap.c -o vect && perf stat -e cycles,instructions,cache-references,cache-misses -r 16 ./vect
 2 multmap.c:8:2: optimized: loop vectorized using 32 byte vectors
 3 
 4  Performance counter stats for './vect' (16 runs):
 5 
 6         97,294,559      cycles                                                        ( +-  0.31% )
 7         30,274,808      instructions              #    0.31  insn per cycle           ( +-  0.26% )
 8          3,113,949      cache-references                                              ( +-  2.38% )
 9            103,556      cache-misses              #    3.326 % of all cache refs      ( +-  1.02% )
10 
11           0.047125 +- 0.000207 seconds time elapsed  ( +-  0.44% )

Usamos -fopt-info-vec para ver si pudo vectorizar.
Usamos -fopt-info-vec-missed para ver que NO pudo vectorizar.
-O3 incluye vectorización, pero también muchas más cosas que dificultan la lectura del assembler.

Presenter Notes

En `clang` aka `LLVM`

Mirar Vectorizers en sus diagnósticos.

Las opciones son distintas:

-Rpass=loop-vectorize
-Rpass-missed=loop-vectorize
-Rpass-analysis=loop-vectorize

Presenter Notes

Comparación

1      N   SIMD  cycles  instr  walltime
2 (1<<25)    no    370M   292M     0.14s
3 (1<<25)    SI    355M   142M     0.13s
4 (1<<26)    no    740M   583M     0.28s
5 (1<<26)    SI    695M   280M     0.26s
6 (1<<27)    no   1463M  1171M     0.56s
7 (1<<27)    SI   1383M   566M     0.52s

Código absolutamente memory-bound con intensidad aritmética de 1 FLOP/ 8 bytes.

Aun asi, mejora un poquitín leer ancho.

Looking for 4x speedups? SSE™ to the rescue!

Mostafa Hagog, Looking for 4x speedups? SSE™ to the rescue!, Intel, 2006.

Presenter Notes

Por dentro

gcc-10 -S -O1 -march=haswell multmap.c

1 .L2:
2     vmovss  (%rcx,%rax), %xmm0
3     vmulss  (%rdx,%rax), %xmm0, %xmm0
4     vmovss  %xmm0, (%rsi,%rax)
5     addq    $4, %rax
6     cmpq    $134217728, %rax
7     jne .L2

gcc-10 -S -O1 -march=haswell -ftree-vectorize multmap.c

1 .L2:
2     vmovaps (%rcx,%rax), %ymm1
3     vmulps  (%rdx,%rax), %ymm1, %ymm0
4     vmovaps %ymm0, (%rsi,%rax)
5     addq    $32, %rax
6     cmpq    $134217728, %rax
7     jne .L2

Notar:

%rax va 8 veces más rápido (repite lazo 4x menos).
Usa instrucciones con ps: packed single.

Presenter Notes

Divergent lanes

1 for (unsigned int i=0; i<N; ++i) {
2     if (b[i]<c[i])
3         a[i] = b[i]*c[i];
4 }

Coding Game, Masking and Conditional Load.

Presenter Notes

VMASKMOVPS - 256-bit load

DEST[31:0]←IF (SRC1[31]) Load_32(mem) ELSE 0
DEST[63:32]←IF (SRC1[63]) Load_32(mem + 4) ELSE 0
DEST[95:64]←IF (SRC1[95]) Load_32(mem + 8) ELSE 0
DEST[127:96]←IF (SRC1[127]) Load_32(mem + 12) ELSE 0
DEST[159:128]←IF (SRC1[159]) Load_32(mem + 16) ELSE 0
DEST[191:160]←IF (SRC1[191]) Load_32(mem + 20) ELSE 0
DEST[223:192]←IF (SRC1[223]) Load_32(mem + 24) ELSE 0
DEST[255:224]←IF (SRC1[255]) Load_32(mem + 28) ELSE 0

Es un cmov conditional move vectorial.

La Intel Intrinsics Guide lo presenta asi

1 FOR j := 0 to 7s
2     i := j*32
3     IF mask[i+31]
4         dst[i+31:i] := MEM[mem_addr+i+31:mem_addr+i]
5     ELSE
6         dst[i+31:i] := 0
7     FI
8 ENDFOR
9 dst[MAX:256] := 0

Presenter Notes

Intel Intrinsics Guide MUCHA info

Presenter Notes

Revisamos `mapcond.c`, `mapcond_omp.c`

Compilar con gcc
Compilar con clang
Compilar con -march={penryn, haswell, knl}

Presenter Notes

De AVX a AVX-512

x86/x64 SIMD Instruction List (SSE to AVX512)

Presenter Notes

Arquitectura + compilador

Recién con Sandy Bridge, clang-9 es capaz de generar la multiplicación vectorial mulps.

 1 $ for arch in core2 penryn nehalem sandybridge ivybridge haswell broadwell skylake skx cascadelake icelake-server knl knm; do clang-9 -O1 -ftree-vectorize -march=$arch mapcond.c -S; echo $arch; grep mulps mapcond.s; done
 2 core2
 3 penryn
 4 nehalem
 5 sandybridge
 6     vmulps  %ymm1, %ymm0, %ymm0
 7 ivybridge
 8     vmulps  %ymm1, %ymm0, %ymm0
 9 haswell
10     vmulps  %ymm1, %ymm0, %ymm0
11 broadwell
12     vmulps  %ymm1, %ymm0, %ymm0
13 skylake
14     vmulps  %ymm1, %ymm0, %ymm0
15 skx
16     vmulps  %zmm1, %zmm0, %zmm0
17 cascadelake
18     vmulps  %zmm1, %zmm0, %zmm0
19 icelake-server
20     vmulps  %zmm1, %zmm0, %zmm0
21 knl
22     vmulps  %zmm1, %zmm0, %zmm0
23 knm
24     vmulps  %zmm1, %zmm0, %zmm0

Nota: gcc-10 no es capaz de vectorizarlo correctamente!

Presenter Notes

Bibliografía

Presenter Notes

Bibliografía

Patterson, Hennessy Computer Organization and Design, Fifth Edition, Morgan Kaufmann, 2012.
Intel, Intel Intrinsics Guide, 2020.
Jan Finis, x86 Intrinsics Cheat Sheet, v2.2.
Daytime x86/x64 SIMD Instruction List (SSE to AVX512), 2020.

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

AVX2

Olides but goldies

Table of Contents

Help

`AVX2`