Otras cosas que si tiene sentido hacer:
Nicolás Wolovick 20180405
libc
.Problema de "C"
.
1 void updatePtrs(int *ptrA, int *ptrB, int *val) { // se puede optimizar
2 *ptrA += *val;
3 *ptrB += *val;
4 }
5
6
7 struct node {
8 struct node *next, *prev;
9 };
10
11 void foo(struct node *n) { // no se puede optimizar
12 n->next->prev->next=n;
13 n->next->next->prev=n;
14 }
Usamos la palabra clave restrict
1 void updatePtrs(int * restrict ptrA, int * restrict ptrB,
2 int * restrict val) {
Y el compilador puede hacer lo esperable.
Podemos pedirle al compilador que ponga las cosas alineadas (GNU).
1 float a[L][L] __attribute__((aligned(64))),
2 b[L] __attribute__((aligned(64))),
3 c[L] __attribute__((aligned(64)));
Usar aligned_alloc()
para memoria dinámica.
Ejemplo pensando en líneas de caché de 64 bytes.
1 #include <malloc.h>
2 double *a = aligned_alloc(64, N*N*sizeof(double))
Tampoco parece producir efectos.
Tener en cuenta:
1 struct point {
2 float dx, dy, dz, dw;
3 };
4 point p[N];
5
6 for (int i=0; i<N; ++i) {
7 dist = sqrtf(p[i].dx*p[i].dx + p[i].dy*p[i].dy + p[i].dz*p[i].dz + p[i].dw*p[i].dw)
8 }
1 struct p {
2 float dx[N], dy[N], dz[N], dw[N];
3 };
4
5 for (int i=0; i<N; ++i) {
6 dist = sqrtf(p.dx[i]*p.dx[i] + p.dy[i]*p.dy[i] + p.dz[i]*p.dz[i] + p.dw[i]*p.dw[i])
7 }
q = 0 / (2 L²) = 0
Memory-bound 😓
1 #define L (1<<11)
2
3 float a[L][L], b[L][L];
4
5 int main(void) {
6 unsigned int i=0, j=0;
7 for (i=0; i<L; ++i)
8 for (j=0; j<L; ++j)
9 b[j][i] = a[i][j];
10 return (int)b[(int)b[0][2]][(int)b[2][0]];
11 }
1 $ gcc-8 -O3 mtxtransp1.c && perf stat -r 16 -e instructions,cycles,cache-references,cache-misses ./a.out
2
3 Performance counter stats for './a.out' (16 runs):
4
5 24,518,383 instructions # 0.04 insn per cycle ( +- 1.46% ) (74.72%)
6 589,419,748 cycles ( +- 0.32% ) (74.86%)
7 5,605,508 cache-references ( +- 0.32% ) (74.83%)
8 4,468,003 cache-misses #79.707 % of all cache refs ( +- 0.28% ) (50.31%)
9
10 0.228426360 seconds time elapsed ( +- 1.93% )
perf
muestra claramente el problema.Intel Core2 Duo P8800@2.66GHz, gcc-8.0.1, Linux 4.15.11 x86_64
Usamos cache blocking o loop tiling a mano.
1 #define L (1<<11)
2
3 const unsigned int BX=1<<4;
4 const unsigned int BY=1<<4;
5
6 float a[L][L], b[L][L];
7
8 int main(void) {
9 unsigned int i=0, j=0, k=0, l=0;
10 assert(0==L%BX && 0==L%BY);
11 for (i=0; i<L; i+=BY)
12 for (j=0; j<L; j+=BX)
13 for (k=i; k<i+BY; ++k)
14 for (l=j; l<j+BX; ++l)
15 b[l][k] = a[k][l];
16 return (int)b[(int)b[0][2]][(int)b[2][0]];
17 }
1 $ gcc-8 -O3 mtxtransp2.c && perf stat -r 16 -e instructions,cycles,cache-references,cache-misses ./a.out
2
3 Performance counter stats for './a.out' (16 runs):
4
5 40,650,922 instructions # 0.35 insn per cycle ( +- 0.69% ) (72.24%)
6 114,607,354 cycles ( +- 0.80% ) (72.54%)
7 4,722,970 cache-references ( +- 1.68% ) (77.67%)
8 542,791 cache-misses #11.493 % of all cache refs ( +- 1.97% ) (49.79%)
9
10 0.044948155 seconds time elapsed ( +- 1.14% )
Speedup: 0.22s/0.045s = 4.88x
clap, clap, clap ...
Intel Core2 Duo P8800@2.66GHz, gcc-8.0.1, Linux 4.15.11 x86_64
"Any sufficiently advanced (compiler) technology is indistinguishable from magic.", Jorge Luis Borges, circa 1971.
GCC Manual, 3.10 Options That Control Optimization, GNU, 2018.
1 -ftree-loop-linear
2 -floop-strip-mine
3 -floop-block
4
5 Perform loop nest optimizations. Same as -floop-nest-optimize. To use this code transformation, GCC has to be configured with --with-isl to enable the Graphite loop transformation infrastructure.
Veamos como va ...
1 $ rm a.out; gcc-8 -O3 -floop-block mtxtransp1.c && perf stat -r 16 -e instructions,cycles,cache-references,cache-misses ./a.out
2
3 Performance counter stats for './a.out' (16 runs):
4
5 92,203,819 instructions # 0.96 insn per cycle ( +- 0.56% ) (70.52%)
6 95,723,528 cycles ( +- 0.63% ) (78.14%)
7 4,042,367 cache-references ( +- 0.25% ) (78.24%)
8 458,531 cache-misses #11.343 % of all cache refs ( +- 3.31% ) (43.62%)
9
10 0.037237697 seconds time elapsed ( +- 0.45% )
Todos gcc-8
salvo Eulogia gcc-7.4
1 -O3 mtxtransp1.c 0.210
2 -O3 -floop-block mtxtransp1.c 0.037
3 -O3 mtxtransp2.c 0.045
4 -O3 -floop-block mtxtransp2.c 0.045
1 -O3 mtxtransp1.c 0.053
2 -O3 -floop-block mtxtransp1.c 0.033
3 -O3 mtxtransp2.c 0.032
4 -O3 -floop-block mtxtransp2.c 0.032
1 -O3 mtxtransp1.c 0.200
2 -O3 -floop-block mtxtransp1.c 0.130
3 -O3 mtxtransp2.c 0.141
4 -O3 -floop-block mtxtransp2.c 0.142
"Cada máquina es un mundo", JLB, circa 1962.
P8800, gcc-4.7.3, Linux 3.8.0 (plata en 2016)
De los 0.21s
, llegamos a 0.031s
, 6.77x de speedup!
con gcc-8 y linux-4.15 (plata hoy)
con gcc-8 y linux-4.15 (jupiterace hoy)
con gcc-7 y linux-3.10 (Eulogia hoy)
dgemm
dgemm
intensidad aritméticaq = flops/bytes = O(L)
dgemm
uso de memoriadgemm.c
dgemm_unroll.c
dgemm_blocking_CoAD.c
dgemm_blocking.f
dgemm_blocking.c
dgemm_blocking_unrolling_CoAD.c
dgemm_blas.c
Análisis de:
CS5220, David Bindel, Lecture 2: Tiling matrix-matrix multiply, code tuning
Core2, 10 GFLOPS performance pico usando los 2 cores. El vendor está al 70% de la performance pico.
Knights Landing support #991, OpenBLAS, 2018.
OpenMP-based parallel implementation of matrix-matrix multiplication on the Intel Knights Landing, 2018.
SDSC, CS260, Matrix Matrix Multiply
sgemm
y dgemm
de BLAS3We have also shown that cache tiling, on which a large share of research works focus, only accounts for 12% of the total performance improvement ...
CS5220, David Bindel, Lecture 2: Tiling matrix-matrix multiply, code tuning.
perf
, Intel VTune, etc.OpenBLAS
, FFTW
, etc.-O3
: Aggressive optimization.-march=native
: Tune for specific architecture.-ftree-vectorize
: Automatic use of SSE (se supone).-funroll-loops
: Loop unrolling.-ffast-math
: Unsafe floating point optimizations.graphite
, ejemplo -floop-block
.-fprofile-generate
, -fprofile-correction
."Therefore my best advice is to avoid loop unrolling. An unrolled loop takes more space in the μop cache, and the advantage of loop unrolling is minimal."
(8.16, Bottlenecks in Sandy Bridge, The microarchitecture of ..., Agner Fog)
"Note that we are not making any absolute predictions on code performance for these implementations, or even relative comparison of their runtimes. Such predictions are very hard to make. However, the above discussion identifies issues that are relevant for a wide range of classical CPUs."
(p.33, Eijkhout, Intro ...)
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |