Nicolás Wolovick, $Date: 2012-04-03 23:21:58 -0300 (Tue, 03 Apr 2012) $, $Revision: 3357 $
Obtener lo máximo de un núcleo.
Dependiendo de la aplicación los límites pueden ser:
(usualmente memory-bound, excepto tests para lucirse: Top500)
Usualmente referido a GFLOPS.
#FPU x Clock x #Core
¿Punto flotante simple f32
o doble f64
?
Top500 usa f64
.
f64
Core i7 960 con operaciones vectoriales: 102 GFLOPS.
Extraídos de: Alpha [Parello'02], Tegra2 [Ramirez'11], Core i7 960 [Hennessy'11], Xeon E5 2680 [guesstimated], GTX 480 [Wikipedia + 1/8 of f32 GFLOPS], Tesla C2075 [NVIDIA].
(¿Cómo se calculará?)
Extraídos de: Core i7 960 [Intel], Core i7 3960X [Anandtech, Intel], GTX 480 [Hennessy'11], Tesla C2075 [NVIDIA].
Aumentar localidad temporal y espacial (caché y TLB).
Transformaciones que preservan la semántica.
(050 082 Übersetzerbau LU 3.0 - WS 2005/2006)
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" - C.A.R. Hoare (citado por Donald Knuth)
Source Code Optimization, Felix von Leitner
"People often write less readable code because they think it will produce faster code. Unfortunately, in most cases, the code will not be faster."
Dar una idea de las cosas que no tiene sentido optimizar.
Tomamos algunos pedazos de código de:
Agner Fog, Optimizing software in C++. An optimization guide for Windows, Linux and Mac platforms.
1 // Example 8.1a
2 float square (float a) {
3 return a * a;
4 }
5 float parabola (float x) {
6 return square(x) + 1.0f;
7 }
queda ...
1 // Example 8.1b
2 float parabola (float x) {
3 return x * x + 1.0f;
4 }
1 // Example 8.6a
2 int a, b, c;
3 b = (a+1) * (a+1);
4 c = (a+1) / 4;
queda ...
1 // Example 8.6b
2 int a, b, c, temp;
3 temp = a+1;
4 b = temp * temp;
5 c = temp / 4;
1 // Example 8.11a
2 int SomeFunction (int a, bool b) {
3 if (b) {
4 a = a * 2;
5 } else {
6 a = a * 3;
7 }
8 if (b) {
9 return a + 1;
10 } else {
11 return a - 1;
12 }
13 }
queda ...
1 // Example 8.11b
2 int SomeFunction (int a, bool b) {
3 if (b) {
4 a = a * 2;
5 return a + 1;
6 } else {
7 a = a * 3;
8 return a - 1;
9 }
10 }
1 // Example 8.12a
2 int i, a[2];
3 for (i = 0; i < 2; i++)
4 a[i] = i+1;
queda ...
1 // Example 8.12b
2 int a[2];
3 a[0] = 1; a[1] = 2;
1 // Example 8.14a
2 int i, a[100];
3 for (i = 0; i < 100; i++) {
4 a[i] = i * 9 + 3;
5 }
queda ...
1 // Example 8.14b
2 int i, a[100], temp;
3 temp = 3;
4 for (i = 0; i < 100; i++) {
5 a[i] = temp;
6 temp += 9;
7 }
1 // Example 8.15a
2 struct S1 {double a; double b;};
3 S1 list[100]; int i;
4 for (i = 0; i < 100; i++) {
5 list[i].a = 1.0;
6 list[i].b = 2.0;
7 }
queda ...
1 // Example 8.15b
2 struct S1 {double a; double b;};
3 S1 list[100], *temp;
4 for (temp = &list[0]; temp < &list[100]; temp++) {
5 temp->a = 1.0;
6 temp->b = 2.0;
7 }
¿Cuántas veces hemos hecho esto para ser los ofuscados del curso/equipo?
Aquí usted deberá confesar todo lo que hizo para optimizar (inútilmente).
"Any sufficiently advanced (compiler) technology is indistinguishable from magic." -- A.C. Clarke
clang
parece razonable en X86_64
, pero en armv7
es mucho peor que gcc
.(nvcc, GF100)
.gcc-4.6
agregó IPO.icc
para mostrar Bulldozer?icc
gcc
es bastante bueno.llvm
(y su front-end clang
) tienen mucho futuro.icc
,gcc
con transformaciones de source code.Empecemos por el último ...
1 float a[L][L] __attribute__((aligned(64))),
2 b[L] __attribute__((aligned(64))),
3 c[L] __attribute__((aligned(64)));
No produce efectos sobre mtxvector
.
Originalmente alineadas a 32 bytes.
Usar posix_memalign()
para memoria dinámica.
q = flops/memaccess
1 #define L (1<<11)
2
3 float a[L][L], b[L][L];
4
5 int main(void) {
6 unsigned int i=0, j=0;
7 for (i=0; i<L; ++i)
8 for (j=0; j<L; ++j)
9 b[j][i] = a[i][j];
10 return (int)b[(int)b[0][2]][(int)b[2][0]];
11 }
gcc -O3
1 Performance counter stats for './mtxtransp1' (4 runs):
2
3 25,288,166 instructions:u # 0.07 insns per cycle ( +- 0.00% )
4 383,586,253 cycles:u # 0.000 GHz ( +- 0.31% )
5
6 0.193846106 seconds time elapsed ( +- 0.41% )
Intel Core2 Duo T5870@2.00GHz, gcc-4.6.3, Linux 3.2.0 x86_64
Usamos cache blocking o loop tiling.
1 #define L (1<<11)
2
3 const unsigned int BX=1<<4;
4 const unsigned int BY=1<<4;
5
6 float a[L][L], b[L][L];
7
8 int main(void) {
9 unsigned int i=0, j=0, k=0, l=0;
10 assert(0==L%BX && 0==L%BY);
11 for (i=0; i<L; i+=BY)
12 for (j=0; j<L; j+=BX)
13 for (k=i; k<i+BY; ++k)
14 for (l=j; l<j+BX; ++l)
15 b[l][k] = a[k][l];
16 return (int)b[(int)b[0][2]][(int)b[2][0]];
17 }
gcc -O3
1 Performance counter stats for './mtxtransp2' (4 runs):
2
3 29,497,644 instructions:u # 0.53 insns per cycle ( +- 0.00% )
4 56,052,902 cycles:u # 0.000 GHz ( +- 0.31% )
5
6 0.043328137 seconds time elapsed ( +- 1.19% )
1 3,3: 61,416,022
2 4,4: 57,070,272
3 5,5: 57,405,419
4 6,6: 60,445,567
5
6 4,3: 75,513,590
7 3,4: 35,257,018
8 3,5: 39,157,456
9 3,6: 35,540,453
10
11 2,7: 31,574,525
12 2,8: 30,184,926
0.19s
, llegamos a 0.032s
.clang-3.0 -O3
icc-12.1.3 -fast
1 Performance counter stats for './mtxtransp1' (4 runs):
2
3 21,013,686 instructions:u # 0.59 insns per cycle ( +- 0.00% )
4 35,521,518 cycles:u # 0.000 GHz ( +- 0.02% )
5
6 0.034163702 seconds time elapsed ( +- 2.08% )
7
8 Performance counter stats for './mtxtransp2' (4 runs):
9
10 18,006,352 instructions:u # 0.21 insns per cycle ( +- 0.00% )
11 84,799,122 cycles:u # 0.000 GHz ( +- 0.06% )
12
13 0.057742278 seconds time elapsed
gcc
.-O3
logra resultados similares para el naïve y mejora el optimizado.ci = Σj: aij ∗ bj
Intensidad computacional: q = flops/memaccess = 2L/(2L+1) ~ 1
1 #define L (1<<13)
2
3 float a[L][L], b[L], c[L];
4
5 int main(void) {
6 unsigned int i=0, j=0;
7 for (i=0; i<L; ++i)
8 for (j=0; j<L; ++j)
9 c[i] += a[i][j] * b[j];
10 return (int)c[(int)c[0]];
11 }
gcc -O3
1 Performance counter stats for './mtxvector1' (4 runs):
2
3 402,909,892 instructions:u # 1.58 insns per cycle ( +- 0.00% )
4 255,084,724 cycles:u # 0.000 GHz ( +- 0.01% )
5
6 0.141779046 seconds time elapsed ( +- 0.59% )
Tenemos un procesador superescalar, y además lo usamos!
1 #define L (1<<13)
2
3 float a[L][L], b[L], c[L];
4
5 int main(void) {
6 unsigned int i=0, j=0;
7 for (i=0; i<L; ++i) {
8 float s = 0.0f;
9 for (j=0; j<L; ++j)
10 s += a[i][j] * b[j];
11 c[i] = s;
12 }
13 return (int)c[(int)c[0]];
14 }
1 #define L (1<<13)
2
3 float a[L][L], b[L], c[L];
4
5 int main(void) {
6 unsigned int i=0, j=0;
7 for (j=0; j<L; ++j)
8 for (i=0; i<L; ++i)
9 c[i] += a[i][j] * b[j];
10 return (int)c[(int)c[0]];
11 }
Un poquito de loop unrolling.
1 #define L (1<<13)
2
3 float a[L][L], b[L], c[L];
4
5 int main(void) {
6 unsigned int i=0, j=0;
7 assert(0==L%2);
8 for (i=0; i<L; i+=2) {
9 float s1=0.0f, s2=0.0f;
10 for (j=0; j<L; ++j) {
11 s1 += a[i][j] * b[j];
12 s2 += a[i+1][j] * b[j];
13 }
14 c[i] = s1; c[i+1] = s2;
15 }
16 return (int)c[(int)c[0]];
17 }
1 Performance counter stats for './mtxvector3' (4 runs):
2
3 302,221,984 instructions:u # 1.93 insns per cycle ( +- 0.00% )
4 156,708,757 cycles:u # 0.000 GHz ( +- 0.03% )
5
6 0.096735539 seconds time elapsed ( +- 0.79% )
Estoy usando las 2 FPU de mi Core2!
1 #define L (1<<13)
2
3 const unsigned int BX=1<<4;
4
5 float a[L][L], b[L], c[L];
6
7 int main(void) {
8 unsigned int i=0, j=0, ii=0, jj=0;
9 assert(0==L%BX);
10 for (jj=0; jj<L; jj+=BX)
11 for (i=0; i<L; ++i)
12 for (j=jj; j<jj+BX; ++j)
13 c[i] += a[i][j] * b[j];
14 return (int)c[(int)c[0]];
15 }
1 Performance counter stats for './mtxvector4' (4 runs):
2
3 230,874,805 instructions:u # 0.92 insns per cycle ( +- 0.00% )
4 251,133,584 cycles:u # 0.000 GHz ( +- 0.16% )
5
6 0.140699997 seconds time elapsed ( +- 0.87% )
1 #define L (1<<13)
2
3 const unsigned int BX=1<<4;
4 const unsigned int BY=1<<4;
5
6 float a[L][L], b[L], c[L];
7
8 int main(void) {
9 unsigned int i=0, j=0, ii=0, jj=0;
10 assert(0==L%BX && 0==L%BY);
11 for (ii=0; ii<L; ii+=BY)
12 for (jj=0; jj<L; jj+=BX)
13 for (i=ii; i<ii+BY; ++i)
14 for (j=jj; j<jj+BX; ++j)
15 c[i] += a[i][j] * b[j];
16 return (int)c[(int)c[0]];
17 }
1 Performance counter stats for './mtxvector5' (4 runs):
2
3 356,179,132 instructions:u # 1.50 insns per cycle ( +- 0.00% )
4 238,165,690 cycles:u # 0.000 GHz ( +- 0.01% )
5
6 0.134618410 seconds time elapsed ( +- 1.07% )
Comparando las dos versiones de blocking.
1 230,874,805 instructions:u
2 251,133,584 cycles:u
3 0.92 insns per cycle
1 356,179,132 instructions
2 238,165,690 cycles
3 1.50 insns per cycle
Notar que tenemos 54% más de instrucciones y casi idéntica cantidad de ciclos. Probablemente las instrucciones excedentes sean de punto fijo y el OoO las procese en paralelo con las de punto flotante.
Pitfall: Evaluating dynamic or static scheduling on the basis of unoptimized code.
"Unoptimized code is much easier to schedule than 'tight' optimized code."
"Of course the optimized program is much faster since it has fewer instructions"
icc -O3
Tomamos la primera versión
1 Performance counter stats for './mtxvector1' (4 runs):
2
3 84,321,106 instructions:u # 0.97 insns per cycle ( +- 0.00% )
4 87,227,621 cycles:u # 0.000 GHz ( +- 0.02% )
5
6 0.062100494 seconds time elapsed ( +- 0.42% )
Me ganó. Veamos que hace.
1 ..B1.3: # Preds ..B1.3 ..B1.2
2 movaps a(%rsi,%rdi,4), %xmm2 #12.22
3 addl $8, %ecx #11.3
4 movaps 16+a(%rsi,%rdi,4), %xmm3 #12.22
5 cmpl $8192, %ecx #11.3
6 mulps b(,%rdi,4), %xmm2 #12.22
7 mulps 16+b(,%rdi,4), %xmm3 #12.22
8 addps %xmm2, %xmm1 #12.4
9 addps %xmm3, %xmm0 #12.4
10 movl %ecx, %edi #11.3
11 jb ..B1.3 # Prob 99% #11.3
12 # LOE rdx rbx rsi rdi r12 r13 r14 r15 eax ecx xmm0 xmm1
Operaciones vectoriales!
gcc
.-O3 -ftree-vectorize -ftree-vectorizer-verbose=4
1 mtxvector1.c:10: note: accesses have the same alignment.
2 mtxvector1.c:10: note: inner step doesn't divide the vector-size.
3 mtxvector1.c:10: note: Unknown alignment for access: a
4 mtxvector1.c:10: note: inner step doesn't divide the vector-size.
5 mtxvector1.c:10: note: Unknown alignment for access: b
6 mtxvector1.c:10: note: strided access in outer loop.
7 mtxvector1.c:10: note: not vectorized: complicated access pattern.
8 mtxvector1.c:11: note: not vectorized: unsupported use in stmt.
9 mtxvector1.c:8: note: vectorized 0 loops in function.
q = flops/memaccess = O(L)
1 #define L (1<<10)
2
3 float a[L][L], b[L][L], c[L][L];
4
5 int main(void) {
6 unsigned int i=0, j=0, k=0;
7 for (i=0; i<L; ++i)
8 for (j=0; j<L; ++j)
9 for (k=0; k<L; ++k)
10 c[i][j] += a[i][k] * b[k][j];
11 return (int)c[(int)c[0][2]][(int)c[2][0]];
12 }
ijk
Performance counter stats for './mtxmtx1' (4 runs):
1 2,687,106,265 instructions:u # 0.88 insns per cycle ( +- 0.00% )
2 3,049,920,233 cycles:u # 0.000 GHz ( +- 0.03% )
3 1.428424749 seconds time elapsed ( +- 0.39% )
ikj
Performance counter stats for './mtxmtx2' (4 runs):
1 1,887,568,663 instructions:u # 2.43 insns per cycle ( +- 0.00% )
2 775,561,712 cycles:u # 0.000 GHz ( +- 0.01% )
3 0.367354745 seconds time elapsed ( +- 0.56% )
1 #define L (1<<10)
2 #define T (1<<3)
3
4 float a[L][L], b[L][L], c[L][L];
5
6 int main(void) {
7 unsigned int i=0, j=0, k=0, ii=0, jj=00, kk=00;
8 assert(0==L%T);
9 for (ii=0; ii<L; ii+=T)
10 for (jj=0; jj<L; jj+=T)
11 for (kk=0; kk<L; kk+=T)
12 for (i=ii; i<ii+T; ++i)
13 for(j=jj; j<jj+T; ++j)
14 for (k=kk; k<kk+T; ++k)
15 c[i][j] += a[i][k] * b[k][j];
16 return (int)c[(int)c[0][2]][(int)c[2][0]];
17 }
gcc -O3
1 Performance counter stats for './mtxmtx4' (4 runs):
2
3 4,499,104,571 instructions:u # 1.88 insns per cycle ( +- 0.00% )
4 2,393,754,604 cycles:u # 0.000 GHz ( +- 0.07% )
5 1.118082160 seconds time elapsed ( +- 0.29% )
Algún problema hay ... debería funcionar mucho mejor. (En clase discutimos esto y encontramos buenas explicaciones al asunto. 1-El orden ikj en gcc permite vectorizar, 2-Como dice en [Parello'02] un buen blocking solo te da ~17% de mejora.)
icc -O3
1 Performance counter stats for './mtxmtx1' (4 runs):
2
3 1,784,956,060 instructions:u # 2.57 insns per cycle ( +- 0.00% )
4 693,543,630 cycles:u # 0.000 GHz ( +- 0.01% )
5
6 0.331259339 seconds time elapsed ( +- 0.17% )
icc -fast
1 Performance counter stats for './mtxmtx1' (4 runs):
2
3 947,974,778 instructions:u # 2.26 insns per cycle ( +- 0.00% )
4 419,740,629 cycles:u # 0.000 GHz ( +- 0.00% )
5
6 0.203769402 seconds time elapsed ( +- 0.97% )
* Performance pico: 2GHz, 2 FP: **4GFLOP**.
* Lo mejor en `gcc`: (2*(1024**3) / 0.36) * 1e-9 = **5.96 GFLOP**.
* Lo mejor en `icc`: (2*(1024**3) / 0.20) * 1e-9 = **10.73 GFLOP**.
(CS5220, David Bindel, Lecture 2: Tiling matrix-matrix multiply, code tuning)
Core2, 10GFLOPS performance pico. Está al 20% de la performance pico.
(CS5220, David Bindel, Lecture 2: Tiling matrix-matrix multiply, code tuning)
Core2, 10GFLOPS performance pico usando los 2 cores. El vendor está al 70% de la performance pico.
sgemm
y dgemm
de BLAS3We have also shown that cache tiling, on which a large share of research works focus, only accounts for 12% of the total performance improvement ...
(SDSC, CS260, Matrix Matrix Multiply)
(Fig 1.15, Eijkhout, Intro ...)
(Fig 1.16, Eijkhout, Intro ...)
(CS5220, David Bindel, Lecture 2: Tiling matrix-matrix multiply, code tuning)
"Therefore my best advice is to avoid loop unrolling. An unrolled loop takes more space in the μop cache, and the advantage of loop unrolling is minimal."
(8.16, Bottlenecks in Sandy Bridge, The microarchitecture of ..., Agner Fog)
"Note that we are not making any absolute predictions on code performance for these implementations, or even relative comparison of their runtimes. Such predictions are very hard to make. However, the above discussion identifies issues that are relevant for a wide range of classical CPUs."
(p.33, Eijkhout, Intro ...)
Operaciones vectoriales.
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |