Nicolás Wolovick 20200420
Por cada dos cores de enteros:
Unidad FP compartida:
¡Hay veces que en esta plataforma conviene forzar AVX de 128 bits!
AMD, Compiler Options Quick Reference Guide, 2011.
This afternoon i added
c++OPT = -O3 -mprefer-avx128 -ftree-vectorize -ffast-math
(same for cOpt), I got 20 to 25 better performance on speed , it was with gcc 4.6.
Siempre fp64
para ser compatible con Top500.
freq * 2 units * 2 ops/cycle * 2 fp64 (128b) * (cores/2)
Caso típico: TUPAC, AMD Opteron 6276 "Interlagos"
2.3 * 2 * 2 * 2 * (16/2) = 147 GFLOPS
Die shrink a 22n de Sandy Bridge + RDRAND
s
freq * 2 units * 1 ops/cycle * 4 fp64 (256b) * cores
Caso típico: Mendieta fase 2, Intel E5-2680v2
2.8 * 2 * 1 * 4 * 10 = 224 GFLOPS
Rahul Garg, Exploring the Floating Point Performance of Modern ARM Processors, 2013.
ARM especifica arquitectura, muchas implementaciones posibles.
jpegtran works by rearranging the compressed data (DCT coefficients), without ever fully decoding the image. Therefore, its transformations are lossless: there is no image degradation at all,...
Vlad Krasnov, NEON is the new black: fast JPEG optimization on ARM server, Cloudflare, 13 Apr 2018.
Intel ISA para vectores de:
¿Cuál uso?
Lo único importante TTS Time To Solution.
(luego será importante ETS, Energy to Solution)
Fabian Giesen @rygorous, SSE/AVX matrix multiply, 2012
1 union Mat44 {
2 float m[4][4];
3 __m128 row[4];
4 };
(reemplazar #include <intrim.h>
por #include <x86intrin.h>
)
DVFS, dynamic voltage and frequency scaling, desde Haswell&Broadwell empieza:
Just as in previous archs, “Broadwell” CPUs include the Turbo Boost feature which allows each processor core to operate well above the “base” clock speed during most operations. The precise clock speed increase depends upon the number & intensity of tasks running on each CPU. However, Turbo Boost speed increases also depend upon the types of instructions (AVX vs. Non-AVX)
Microway, Detailed Specifications of the Intel Xeon E5-2600v4 “Broadwell-EP” Processors, 2016.
Intel® Xeon® Processor Scalable Family Specification Update, March 2020.
Second Generation Intel® Xeon® Scalable Processors Specification Update, April 2020.
Xeon Scalable 6142, Skylake
freq * 2 units * 2 ops/cycle * 8 fp64 (512b) * cores
2.6 * 2 * 2 * 8 * 16 = 1331.2 GFLOPS
Pero lo que hay que tomar es Base AVX-512 Core Frequency.
1.6 * 2 * 2 * 8 * 16 = 819.2 GFLOPS por pastilla, un total de 209.7 TFLOPS para todo el sistema de 128x2 pastillas.
The observed behavior is a sad side effect. There are many libraries that use AVX and AVX2 instructions out there, they will probably be updated to AVX-512 at some point, and users are not likely to be aware of the implementation details. If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.
Vlad Krasnov, On the dangers of Intel's frequency scaling, Cloudflare, 10 Nov 2017.
El tiempo para volver a ejecutar instrucciones Non-AVX puede ser tan grande como 680µs, del orden de 1 millón de ciclos de procesador.
Daniel Lemire, The dangers of AVX-512 throttling: myth or reality?, 2018.
Travis Downs, Gathering Intel on Intel AVX-512 Transitions, 2020.
Los procesadores se actualizan, los algoritmos de DVFS tb!
1 $ sudo dmesg | grep -i microco
The AMD 3rd Gen Ryzen Deep Dive Review: 3700X and 3900X Raising The Bar.
(7/8): We've noticed large frequency boost behaviour changes with new motherboard firmware that was released on launch day (7/7). We are currently re-running all our test suite numbers and updating the article with the new data soon as applicable. For further details please read here.
AMD Releases New Chipset Drivers For Ryzen 3000: More Relaxed CPPC2 Upscaling.
John McCalpin on Base AVX-512 Core Frequency
The actual frequency when running compute-intensive AVX512 workloads depends on the unique characteristics of the specific piece of silicon (particularly leakage current), as well as the characteristics of the cooling system (ambient temperature, heat sink thermal conductivity, air flow rate, etc).
We have 3472 Xeon Platinum 8160 (24-core0 processors in 1736 two-socket nodes. The Base AVX-512 Core Frequency for these processors is 1.4 GHz and the maximum 24-core AVX-512 frequency is 2.0 GHz. When running Intel's optimized LINPACK benchmark, we see that the average frequency of these processors varies between about 1.52 GHz and about 1.73 GHz, with sustained (LINPACK) performance varying by the same proportions.
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |