SIMD, Implementación

Presenter Notes

Resumen:

Nicolás Wolovick 20200420

Presenter Notes

SIMD units

Presenter Notes

AMD Bulldozer (2011)

Presenter Notes

AMD Bulldozer (2011)

Por cada dos cores de enteros:

  • Una unidad de 256b FMAC (multiply and add)
  • Una unidad de 256b de ADD.

Unidad FP compartida:

  • Puede ejecutar AVX.
  • Desde Bulldozer GEN4 (Excavator) soporta AVX2.
  • También se comporta como 2 unidades de SSE4.x

¡Hay veces que en esta plataforma conviene forzar AVX de 128 bits!

AMD, Compiler Options Quick Reference Guide, 2011.

This afternoon i added c++OPT = -O3 -mprefer-avx128 -ftree-vectorize -ffast-math (same for cOpt), I got 20 to 25 better performance on speed , it was with gcc 4.6.

Presenter Notes

AMD Bulldozer (2011), pico teórico

Siempre fp64 para ser compatible con Top500.

freq * 2 units * 2 ops/cycle * 2 fp64 (128b) * (cores/2)

Caso típico: TUPAC, AMD Opteron 6276 "Interlagos"

2.3 * 2 * 2 * 2 * (16/2) = 147 GFLOPS

Presenter Notes

Intel Ivy Bridge (2012)

Die shrink a 22n de Sandy Bridge + RDRANDs

Presenter Notes

Intel Ivy Bridge (2012)

  • Una unidad 256b por core.
  • AVX, no FMA.
  • 2.8 GHz, estable ;)

Pico teórico

freq * 2 units * 1 ops/cycle * 4 fp64 (256b) * cores

Caso típico: Mendieta fase 2, Intel E5-2680v2

2.8 * 2 * 1 * 4 * 10 = 224 GFLOPS

Presenter Notes

High performance ARMs (2012)

  • ARM Cortex A9, BD-SL-i.MX6, Freescale i.mx6, 1.0 GHz, 4 cores.
  • ARM Cortex A15, Google Nexus 10, Samsung Exynos 5250, 1.7 GHz, 2 cores.
  • Qualcomm Scorpion, Samsung Galaxy SIIX, Qualcomm APQ8060, 1.5 GHz, 2 cores.
  • Qualcomm Krait 200, Blackberry Z10, Qualcomm MSM8960, 1.5 GHz, 2 cores.
  • Qualcomm Krait 300, HTC One, Qualcomm Snapdragon 600, 1.7 GHz, 4 cores.

Presenter Notes

Implementaciones

Rahul Garg, Exploring the Floating Point Performance of Modern ARM Processors, 2013.

ARM especifica arquitectura, muchas implementaciones posibles.

Presenter Notes

Optimizaciones ARM NEON

jpegtran works by rearranging the compressed data (DCT coefficients), without ever fully decoding the image. Therefore, its transformations are lossless: there is no image degradation at all,...

Vlad Krasnov, NEON is the new black: fast JPEG optimization on ARM server, Cloudflare, 13 Apr 2018.

Presenter Notes

Arquitecturas multi-ancho

Intel ISA para vectores de:

  • 128b (SSE)
  • 256b (AVX)
  • 512b (AVX-512)

¿Cuál uso?

Lo único importante TTS Time To Solution.

(luego será importante ETS, Energy to Solution)

"¿Che Fabián, vale la pena AVX?".

Fabian Giesen @rygorous, SSE/AVX matrix multiply, 2012

1 union Mat44 {
2     float m[4][4];
3     __m128 row[4];
4 };
  • Mirar Sandy Bridge, IvyBrige, Haswell, Zen1 ...

(reemplazar #include <intrim.h> por #include <x86intrin.h>)

Presenter Notes

Heat

Presenter Notes

Dark silicon y DVFS

DVFS, dynamic voltage and frequency scaling, desde Haswell&Broadwell empieza:

Just as in previous archs, “Broadwell” CPUs include the Turbo Boost feature which allows each processor core to operate well above the “base” clock speed during most operations. The precise clock speed increase depends upon the number & intensity of tasks running on each CPU. However, Turbo Boost speed increases also depend upon the types of instructions (AVX vs. Non-AVX)

Microway, Detailed Specifications of the Intel Xeon E5-2600v4 “Broadwell-EP” Processors, 2016.

Presenter Notes

Caso de estudio: Huayra Muyu SMN

Xeon Scalable 6142, Skylake

freq * 2 units * 2 ops/cycle * 8 fp64 (512b) * cores

2.6 * 2 * 2 * 8 * 16 = 1331.2 GFLOPS

Pero lo que hay que tomar es Base AVX-512 Core Frequency.
1.6 * 2 * 2 * 8 * 16 = 819.2 GFLOPS por pastilla, un total de 209.7 TFLOPS para todo el sistema de 128x2 pastillas.

Presenter Notes

Time to change V and F

Presenter Notes

AVX-512 throttling (down!)

The observed behavior is a sad side effect. There are many libraries that use AVX and AVX2 instructions out there, they will probably be updated to AVX-512 at some point, and users are not likely to be aware of the implementation details. If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

Vlad Krasnov, On the dangers of Intel's frequency scaling, Cloudflare, 10 Nov 2017.

Presenter Notes

Transiciones de dVFs AVX-512

El tiempo para volver a ejecutar instrucciones Non-AVX puede ser tan grande como 680µs, del orden de 1 millón de ciclos de procesador.

Daniel Lemire, The dangers of AVX-512 throttling: myth or reality?, 2018.
Travis Downs, Gathering Intel on Intel AVX-512 Transitions, 2020.

Presenter Notes

Firmware update para CPUs

Los procesadores se actualizan, los algoritmos de DVFS tb!

1 $ sudo dmesg | grep -i microco

Ejemplo: Update Zen2 de AMD

The AMD 3rd Gen Ryzen Deep Dive Review: 3700X and 3900X Raising The Bar.

(7/8): We've noticed large frequency boost behaviour changes with new motherboard firmware that was released on launch day (7/7). We are currently re-running all our test suite numbers and updating the article with the new data soon as applicable. For further details please read here.

AMD Releases New Chipset Drivers For Ryzen 3000: More Relaxed CPPC2 Upscaling.

Presenter Notes

Resumen

John McCalpin on Base AVX-512 Core Frequency

The actual frequency when running compute-intensive AVX512 workloads depends on the unique characteristics of the specific piece of silicon (particularly leakage current), as well as the characteristics of the cooling system (ambient temperature, heat sink thermal conductivity, air flow rate, etc).

We have 3472 Xeon Platinum 8160 (24-core0 processors in 1736 two-socket nodes. The Base AVX-512 Core Frequency for these processors is 1.4 GHz and the maximum 24-core AVX-512 frequency is 2.0 GHz. When running Intel's optimized LINPACK benchmark, we see that the average frequency of these processors varies between about 1.52 GHz and about 1.73 GHz, with sustained (LINPACK) performance varying by the same proportions.

Presenter Notes

Bibliografía

Presenter Notes

Presenter Notes