Optimizaciones de Código para CPUs Modernas

Presenter Notes

Resumen:

Límites.
Optimizaciones del Compilador.
Ejemplos:
- Matrix-vector multiplication.
- Matrix transposition
- Matrix-matrix multiplication.
Bibliotecas y autotuning.
Conclusiones.
Próxima clase.

Nicolás Wolovick, $Date: 2012-04-03 23:21:58 -0300 (Tue, 03 Apr 2012) $, $Revision: 3357 $

Presenter Notes

Límites

Presenter Notes

Problemas y límites

Obtener lo máximo de un núcleo.

Operaciones de punto flotante (GFLOPS).
Ancho de banda de memoria (GBps).

Dependiendo de la aplicación los límites pueden ser:

CPU-bound
Memory-bound

(usualmente memory-bound, excepto tests para lucirse: Top500)

Presenter Notes

Performance pico

Usualmente referido a GFLOPS.

#FPU x Clock x #Core

Supuestos

Ejecución OoO es capaz de tirar 1op por clock.
Hay #FPU unidades de punto flotante independientes.

¿Punto flotante simple f32 o doble f64?

Top500 usa f64.

Presenter Notes

Ejemplos performance pico `f64`

CPU

Alpha 21264 (1 GHz, 2 FPU, 2000): 2 GFLOPS.
Tegra2 (1 GHz, 1 FPU, 2 cores, 2010): 2 GFLOPS.
Core i7 960 (3.2 GHz, 2 FPU, 4 cores, 2009): 26 GFLOPS.
Xeon E5 2680 (2.7 GHz, 4 FPU, 8 cores, 2012): 86 GFLOPS.

Core i7 960 con operaciones vectoriales: 102 GFLOPS.

GPU

GTX 480: 168 GFLOPS.
Tesla C2075: 515 GFLOPS.

Extraídos de: Alpha [Parello'02], Tegra2 [Ramirez'11], Core i7 960 [Hennessy'11], Xeon E5 2680 [guesstimated], GTX 480 [Wikipedia + 1/8 of f32 GFLOPS], Tesla C2075 [NVIDIA].

Presenter Notes

Ejemplos de ancho de banda de memoria

CPU

Core i7 960 (triple channel, DDR3-1333) : 25.6 GBps.
Intel Core i7 3960X (quad channel, DDR3-1600): 51 GBps.

GPU

GTX 480 (6 canales, 384 bits, GDDR5@924MHz): 177 GBps.
Tesla C2075 (6 canales, 384 bits, GDDR5@1.5GHz): 144GBps.

(¿Cómo se calculará?)

Extraídos de: Core i7 960 [Intel], Core i7 3960X [Anandtech, Intel], GTX 480 [Hennessy'11], Tesla C2075 [NVIDIA].

Presenter Notes

¿Qué hacer?

CPU

Aumentar el ILP, favorecer Ejecución OoO.
Disminuir la cantidad de instrucciones ejecutadas.

Memoria

Aumentar localidad temporal y espacial (caché y TLB).

Presenter Notes

Optimizaciones del Compilador

Presenter Notes

Taxonomía de las transformaciones

Transformaciones que preservan la semántica.

Alcance

Local. Ej: eliminación de subexpresión común.
Intraprocedural. Ej: eliminación de código muerto.
Iterprocedural (IPO). Ej: function inlining.

(In)Dependiente de la arquitectura

Independiente. Ej: movimiento de código.
Dependiente. Ej: ocultamiento de latencia, manejo de registros.

(050 082 Übersetzerbau LU 3.0 - WS 2005/2006)

Presenter Notes

(continúa taxonomía)

Tipo de información

Estática (incluye ejecuciones abstractas).
Runtime: profile-guided optimization (PGO).

Cuando actúa el optimizador

Compile-time:
- Tempranas. Ej: simplificaciones algebraicas.
- Tardías. Ej: optimizaciones peep-hole.
Install-time. Ej: ATLAS.
Run-time. Ej: Java JIT compilation.

Presenter Notes

Ejemplos de optimizaciones

Dos citas

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" - C.A.R. Hoare (citado por Donald Knuth)

Source Code Optimization, Felix von Leitner

"People often write less readable code because they think it will produce faster code. Unfortunately, in most cases, the code will not be faster."

Objetivo

Dar una idea de las cosas que no tiene sentido optimizar.

Tomamos algunos pedazos de código de:

Agner Fog, Optimizing software in C++. An optimization guide for Windows, Linux and Mac platforms.

Presenter Notes

Function inlining

1 // Example 8.1a
2 float square (float a) {
3     return a * a;
4 }
5 float parabola (float x) {
6     return square(x) + 1.0f;
7 }

queda ...

1 // Example 8.1b
2 float parabola (float x) {
3     return x * x + 1.0f;
4 }

Presenter Notes

Eliminación de subexpresiones comunes

1 // Example 8.6a
2 int a, b, c;
3 b = (a+1) * (a+1);
4 c = (a+1) / 4;

queda ...

1 // Example 8.6b
2 int a, b, c, temp;
3 temp = a+1;
4 b = temp * temp;
5 c = temp / 4;

Presenter Notes

Eliminación de saltos

 1 // Example 8.11a
 2 int SomeFunction (int a, bool b) {
 3     if (b) {
 4         a = a * 2;
 5     } else {
 6         a = a * 3;
 7     }
 8     if (b) {
 9         return a + 1;
10     } else {
11         return a - 1;
12     }
13 }

queda ...

 1 // Example 8.11b
 2 int SomeFunction (int a, bool b) {
 3     if (b) {
 4         a = a * 2;
 5         return a + 1;
 6     } else {
 7         a = a * 3;
 8         return a - 1;
 9     }
10 }

Presenter Notes

Loop unrolling (cantidad fija)

1 // Example 8.12a
2 int i, a[2];
3 for (i = 0; i < 2; i++)
4     a[i] = i+1;

queda ...

1 // Example 8.12b
2 int a[2];
3 a[0] = 1; a[1] = 2;

Presenter Notes

Variables inductivas (lineales!)

1 // Example 8.14a
2 int i, a[100];
3 for (i = 0; i < 100; i++) {
4     a[i] = i * 9 + 3;
5 }

queda ...

1 // Example 8.14b
2 int i, a[100], temp;
3 temp = 3;
4 for (i = 0; i < 100; i++) {
5     a[i] = temp;
6     temp += 9;
7 }

Presenter Notes

Variables inductivas, Ptr.Arith.

1 // Example 8.15a
2 struct S1 {double a; double b;};
3 S1 list[100]; int i;
4 for (i = 0; i < 100; i++) {
5     list[i].a = 1.0;
6     list[i].b = 2.0;
7 }

queda ...

1 // Example 8.15b
2 struct S1 {double a; double b;};
3 S1 list[100], *temp;
4 for (temp = &list[0]; temp < &list[100]; temp++) {
5     temp->a = 1.0;
6     temp->b = 2.0;
7 }

¿Cuántas veces hemos hecho esto para ser los ofuscados del curso/equipo?

Presenter Notes

Antes de ofuscar código ...

Comparación de optimizaciones de compiladores

Presenter Notes

Antes de ofuscar código ...

Comparación de optimizaciones de compiladores, 2nd

Presenter Notes

Momento de reflexión

Aquí usted deberá confesar todo lo que hizo para optimizar (inútilmente).

Presenter Notes

¿Hasta donde llega la magia?

"Any sufficiently advanced (compiler) technology is indistinguishable from magic." -- A.C. Clarke

Altamente dependiente del par (compilador, arquitectura).
- clang parece razonable en X86_64, pero en armv7 es mucho peor que gcc.
- En arquitecturas nuevas, mirar lo que hace. Ej: (nvcc, GF100).
Usar las últimas versiones.
- Ej: gcc-4.6 agregó IPO.
Usar el compilador del vendor.
- ContraEj: ¿AMD usa icc para mostrar Bulldozer?

Presenter Notes

No todo es `icc`

gcc es bastante bueno.
El tipo de desarrollo puede impedir utilizar compiladores propietarios.
llvm (y su front-end clang) tienen mucho futuro.

Posible solución

Compilar con icc,
inspeccionar la magia,
guiar a gcc con transformaciones de source code.

Presenter Notes

¿Qué cosas si tiene sentido hacer?

Un poco de loop unrolling.
- Desarma dependencias usando registros para lograr más ILP.
- La ganancia no está en ahorrarse un branch (99% de predicción).
- Tienen un límite: register spilling.
Exponer paralelismo dependiente del problema.
Optimizaciones de Caché y TLB.
- Reorganización de loops.
- Cache blocking o loop tiling.
- Alineamiento de memoria.
Indicar que no hay aliasing.

Presenter Notes

Alineamiento de memoria

Empecemos por el último ...

1 float a[L][L] __attribute__((aligned(64))),
2     b[L] __attribute__((aligned(64))),
3     c[L] __attribute__((aligned(64)));

No produce efectos sobre mtxvector. Originalmente alineadas a 32 bytes.

Usar posix_memalign() para memoria dinámica.

Presenter Notes

Ejemplo: Matrix transposition

Presenter Notes

Transposición de una matriz, B = A^t

Intensidad computacional

q = flops/memaccess

Cuanto más, mejor!
Recordar memory wall.
q = 0 / (2 ∗ L^2) = 0 .

Matriz en la memoria

Organización de la matriz en memoria

Presenter Notes

Naïve

 1 #define L (1<<11)
 2 
 3 float a[L][L], b[L][L];
 4 
 5 int main(void) {
 6     unsigned int i=0, j=0;
 7     for (i=0; i<L; ++i)
 8             for (j=0; j<L; ++j)
 9                 b[j][i] = a[i][j];
10     return (int)b[(int)b[0][2]][(int)b[2][0]];
11 }

Medición `gcc -O3`

1 Performance counter stats for './mtxtransp1' (4 runs):
2 
3     25,288,166 instructions:u            #    0.07  insns per cycle          ( +-  0.00% )
4    383,586,253 cycles:u                  #    0.000 GHz                      ( +-  0.31% )
5 
6    0.193846106 seconds time elapsed                                          ( +-  0.41% )

Notar el bajísimo IPC.
¿Qué sucede?

Intel Core2 Duo T5870@2.00GHz, gcc-4.6.3, Linux 3.2.0 x86_64

Presenter Notes

Blocking

Usamos cache blocking o loop tiling.

 1 #define L (1<<11)
 2 
 3 const unsigned int BX=1<<4;
 4 const unsigned int BY=1<<4;
 5 
 6 float a[L][L], b[L][L];
 7 
 8 int main(void) {
 9     unsigned int i=0, j=0, k=0, l=0;
10     assert(0==L%BX && 0==L%BY);
11     for (i=0; i<L; i+=BY)
12         for (j=0; j<L; j+=BX)
13             for (k=i; k<i+BY; ++k)
14                 for (l=j; l<j+BX; ++l)
15                     b[l][k] = a[k][l];
16     return (int)b[(int)b[0][2]][(int)b[2][0]];
17 }

Medición `gcc -O3`

1 Performance counter stats for './mtxtransp2' (4 runs):
2 
3     29,497,644 instructions:u            #    0.53  insns per cycle          ( +-  0.00% )
4     56,052,902 cycles:u                  #    0.000 GHz                      ( +-  0.31% )
5 
6    0.043328137 seconds time elapsed                                          ( +-  1.19% )

Presenter Notes

¿Cuál es el mejor tamaño de bloque?

2^X, 2^Y -> Cantidad de ciclos

 1 3,3: 61,416,022
 2 4,4: 57,070,272
 3 5,5: 57,405,419
 4 6,6: 60,445,567
 5 
 6 4,3: 75,513,590
 7 3,4: 35,257,018
 8 3,5: 39,157,456
 9 3,6: 35,540,453
10 
11 2,7: 31,574,525
12 2,8: 30,184,926

De los 0.19s, llegamos a 0.032s.
5.9x de speedup!
Explorar más el espacio de los tamaños de bloque.

Presenter Notes

Otras posibilidades

`clang-3.0 -O3`

Un poquito peor

`icc-12.1.3 -fast`

 1  Performance counter stats for './mtxtransp1' (4 runs):
 2 
 3         21,013,686 instructions:u            #    0.59  insns per cycle          ( +-  0.00% )
 4         35,521,518 cycles:u                  #    0.000 GHz                      ( +-  0.02% )
 5 
 6        0.034163702 seconds time elapsed                                          ( +-  2.08% )
 7 
 8  Performance counter stats for './mtxtransp2' (4 runs):
 9 
10         18,006,352 instructions:u            #    0.21  insns per cycle          ( +-  0.00% )
11         84,799,122 cycles:u                  #    0.000 GHz                      ( +-  0.06% )
12 
13        0.057742278 seconds time elapsed

Llega a una performance similar a mejor que obtuvimos!
- Ejercicio: mirar el assembler y ver que hace, copiarlo en gcc.
Al optimizar la optimizada, pierde un poco.
-O3 logra resultados similares para el naïve y mejora el optimizado.

Presenter Notes

Ejemplo: Matrix-vector multiplication

Presenter Notes

Multiplicación Matriz-vector c = A∗b

ci = Σj: aij ∗ bj

Intensidad computacional: q = flops/memaccess = 2L/(2L+1) ~ 1

 1 #define L (1<<13)
 2 
 3 float a[L][L], b[L], c[L];
 4 
 5 int main(void) {
 6     unsigned int i=0, j=0;
 7     for (i=0; i<L; ++i)
 8         for (j=0; j<L; ++j)
 9             c[i] += a[i][j] * b[j];
10     return (int)c[(int)c[0]];
11 }

Medición `gcc -O3`

1  Performance counter stats for './mtxvector1' (4 runs):
2 
3        402,909,892 instructions:u            #    1.58  insns per cycle          ( +-  0.00% )
4        255,084,724 cycles:u                  #    0.000 GHz                      ( +-  0.01% )
5 
6        0.141779046 seconds time elapsed                                          ( +-  0.59% )

Tenemos un procesador superescalar, y además lo usamos!

Presenter Notes

Dos variaciones inútiles

Registros para mantener la suma

 1 #define L (1<<13)
 2 
 3 float a[L][L], b[L], c[L];
 4 
 5 int main(void) {
 6     unsigned int i=0, j=0;
 7     for (i=0; i<L; ++i) {
 8         float s = 0.0f;
 9         for (j=0; j<L; ++j)
10             s += a[i][j] * b[j];
11         c[i] = s;
12     }
13     return (int)c[(int)c[0]];
14 }

Otro orden de acceso

 1 #define L (1<<13)
 2 
 3 float a[L][L], b[L], c[L];
 4 
 5 int main(void) {
 6     unsigned int i=0, j=0;
 7     for (j=0; j<L; ++j)
 8         for (i=0; i<L; ++i)
 9             c[i] += a[i][j] * b[j];
10     return (int)c[(int)c[0]];
11 }

Presenter Notes

Exponiendo más ILP

Un poquito de loop unrolling.

 1 #define L (1<<13)
 2 
 3 float a[L][L], b[L], c[L];
 4 
 5 int main(void) {
 6     unsigned int i=0, j=0;
 7     assert(0==L%2);
 8     for (i=0; i<L; i+=2) {
 9         float s1=0.0f, s2=0.0f;
10         for (j=0; j<L; ++j) {
11             s1 += a[i][j] * b[j];
12             s2 += a[i+1][j] * b[j];
13         }
14         c[i] = s1; c[i+1] = s2;
15     }
16     return (int)c[(int)c[0]];
17 }

Mediciones

1  Performance counter stats for './mtxvector3' (4 runs):
2 
3        302,221,984 instructions:u            #    1.93  insns per cycle          ( +-  0.00% )
4        156,708,757 cycles:u                  #    0.000 GHz                      ( +-  0.03% )
5 
6        0.096735539 seconds time elapsed                                          ( +-  0.79% )

Estoy usando las 2 FPU de mi Core2!

Presenter Notes

Cache blocking en filas

 1 #define L (1<<13)
 2 
 3 const unsigned int BX=1<<4;
 4 
 5 float a[L][L], b[L], c[L];
 6 
 7 int main(void) {
 8     unsigned int i=0, j=0, ii=0, jj=0;
 9     assert(0==L%BX);
10     for (jj=0; jj<L; jj+=BX)
11         for (i=0; i<L; ++i)
12             for (j=jj; j<jj+BX; ++j)
13                 c[i] += a[i][j] * b[j];
14     return (int)c[(int)c[0]];
15 }

Mediciones

1  Performance counter stats for './mtxvector4' (4 runs):
2 
3        230,874,805 instructions:u            #    0.92  insns per cycle          ( +-  0.00% )
4        251,133,584 cycles:u                  #    0.000 GHz                      ( +-  0.16% )
5 
6        0.140699997 seconds time elapsed                                          ( +-  0.87% )

Presenter Notes

Cache blocking en filas y columnas

 1 #define L (1<<13)
 2 
 3 const unsigned int BX=1<<4;
 4 const unsigned int BY=1<<4;
 5 
 6 float a[L][L], b[L], c[L];
 7 
 8 int main(void) {
 9     unsigned int i=0, j=0, ii=0, jj=0;
10     assert(0==L%BX && 0==L%BY);
11     for (ii=0; ii<L; ii+=BY)
12         for (jj=0; jj<L; jj+=BX)
13             for (i=ii; i<ii+BY; ++i)
14                 for (j=jj; j<jj+BX; ++j)
15                     c[i] += a[i][j] * b[j];
16     return (int)c[(int)c[0]];
17 }

Mediciones

1  Performance counter stats for './mtxvector5' (4 runs):
2 
3        356,179,132 instructions:u            #    1.50  insns per cycle          ( +-  0.00% )
4        238,165,690 cycles:u                  #    0.000 GHz                      ( +-  0.01% )
5 
6        0.134618410 seconds time elapsed                                          ( +-  1.07% )

Presenter Notes

ILP en código sin optimizar

Comparando las dos versiones de blocking.

Blocking en filas

1 230,874,805 instructions:u
2 251,133,584 cycles:u
3 0.92  insns per cycle

Blocking en filas y columnas

1 356,179,132 instructions
2 238,165,690 cycles
3 1.50  insns per cycle

Notar que tenemos 54% más de instrucciones y casi idéntica cantidad de ciclos. Probablemente las instrucciones excedentes sean de punto fijo y el OoO las procese en paralelo con las de punto flotante.

CAAQA A-76

Pitfall: Evaluating dynamic or static scheduling on the basis of unoptimized code.

"Unoptimized code is much easier to schedule than 'tight' optimized code."

"Of course the optimized program is much faster since it has fewer instructions"

Presenter Notes

Otras posibilidades

`icc -O3`

Tomamos la primera versión

1  Performance counter stats for './mtxvector1' (4 runs):
2 
3         84,321,106 instructions:u            #    0.97  insns per cycle          ( +-  0.00% )
4         87,227,621 cycles:u                  #    0.000 GHz                      ( +-  0.02% )
5 
6        0.062100494 seconds time elapsed                                          ( +-  0.42% )

Me ganó. Veamos que hace.

 1 ..B1.3:                         # Preds ..B1.3 ..B1.2
 2         movaps    a(%rsi,%rdi,4), %xmm2                         #12.22
 3         addl      $8, %ecx                                      #11.3
 4         movaps    16+a(%rsi,%rdi,4), %xmm3                      #12.22
 5         cmpl      $8192, %ecx                                   #11.3
 6         mulps     b(,%rdi,4), %xmm2                             #12.22
 7         mulps     16+b(,%rdi,4), %xmm3                          #12.22
 8         addps     %xmm2, %xmm1                                  #12.4
 9         addps     %xmm3, %xmm0                                  #12.4
10         movl      %ecx, %edi                                    #11.3
11         jb        ..B1.3        # Prob 99%                      #11.3
12                                 # LOE rdx rbx rsi rdi r12 r13 r14 r15 eax ecx xmm0 xmm1

Operaciones vectoriales!

Presenter Notes

Vectorización con `gcc`.

`-O3 -ftree-vectorize -ftree-vectorizer-verbose=4`

1 mtxvector1.c:10: note: accesses have the same alignment.
2 mtxvector1.c:10: note: inner step doesn't divide the vector-size.
3 mtxvector1.c:10: note: Unknown alignment for access: a
4 mtxvector1.c:10: note: inner step doesn't divide the vector-size.
5 mtxvector1.c:10: note: Unknown alignment for access: b
6 mtxvector1.c:10: note: strided access in outer loop.
7 mtxvector1.c:10: note: not vectorized: complicated access pattern.
8 mtxvector1.c:11: note: not vectorized: unsupported use in stmt.
9 mtxvector1.c:8: note: vectorized 0 loops in function.

Mhh, vamos a tener que estudiar SSE.

Presenter Notes

Ejemplo: Matrix-matrix multiplication

Presenter Notes

Multiplicación matriz-matriz C = A∗B

Jerarquía BLAS

Dot product: L data, 2L flops.
Matrix-vector multiply: L^2 data, 2L^2 flops.
Matrix-matrix multiply: 2L^2 data, 2L^3 flops.

q = flops/memaccess = O(L)

Memoria total: O(L^2).
Memoria accedida: O(L^3).
Reuso: O(L).

Presenter Notes

Naïve

 1 #define L (1<<10)
 2 
 3 float a[L][L], b[L][L], c[L][L];
 4 
 5 int main(void) {
 6     unsigned int i=0, j=0, k=0;
 7     for (i=0; i<L; ++i)
 8         for (j=0; j<L; ++j)
 9             for (k=0; k<L; ++k)
10                 c[i][j] += a[i][k] * b[k][j];
11     return (int)c[(int)c[0][2]][(int)c[2][0]];
12 }

Medición `ijk`

Performance counter stats for './mtxmtx1' (4 runs):

1  2,687,106,265 instructions:u            #    0.88  insns per cycle          ( +-  0.00% )
2  3,049,920,233 cycles:u                  #    0.000 GHz                      ( +-  0.03% )
3    1.428424749 seconds time elapsed                                          ( +-  0.39% )

Medición `ikj`

Performance counter stats for './mtxmtx2' (4 runs):

1  1,887,568,663 instructions:u            #    2.43  insns per cycle          ( +-  0.00% )
2    775,561,712 cycles:u                  #    0.000 GHz                      ( +-  0.01% )
3    0.367354745 seconds time elapsed                                          ( +-  0.56% )

Presenter Notes

Blocking

 1 #define L (1<<10)
 2 #define T (1<<3)
 3 
 4 float a[L][L], b[L][L], c[L][L];
 5 
 6 int main(void) {
 7     unsigned int i=0, j=0, k=0, ii=0, jj=00, kk=00;
 8     assert(0==L%T);
 9     for (ii=0; ii<L; ii+=T)
10         for (jj=0; jj<L; jj+=T)
11             for (kk=0; kk<L; kk+=T)
12                 for (i=ii; i<ii+T; ++i)
13                     for(j=jj; j<jj+T; ++j)
14                         for (k=kk; k<kk+T; ++k)
15                             c[i][j] += a[i][k] * b[k][j];
16     return (int)c[(int)c[0][2]][(int)c[2][0]];
17 }

Medición `gcc -O3`

1  Performance counter stats for './mtxmtx4' (4 runs):
2 
3  4,499,104,571 instructions:u            #    1.88  insns per cycle          ( +-  0.00% )
4  2,393,754,604 cycles:u                  #    0.000 GHz                      ( +-  0.07% )
5    1.118082160 seconds time elapsed                                          ( +-  0.29% )

Algún problema hay ... debería funcionar mucho mejor. (En clase discutimos esto y encontramos buenas explicaciones al asunto. 1-El orden ikj en gcc permite vectorizar, 2-Como dice en [Parello'02] un buen blocking solo te da ~17% de mejora.)

Presenter Notes

El turno del vendor

`icc -O3`

1 Performance counter stats for './mtxmtx1' (4 runs):
2 
3  1,784,956,060 instructions:u            #    2.57  insns per cycle          ( +-  0.00% )
4    693,543,630 cycles:u                  #    0.000 GHz                      ( +-  0.01% )
5 
6    0.331259339 seconds time elapsed                                          ( +-  0.17% )

`icc -fast`

1 Performance counter stats for './mtxmtx1' (4 runs):
2 
3    947,974,778 instructions:u            #    2.26  insns per cycle          ( +-  0.00% )
4    419,740,629 cycles:u                  #    0.000 GHz                      ( +-  0.00% )
5 
6    0.203769402 seconds time elapsed                                          ( +-  0.97% )

Comparación de GFLOPS

* Performance pico: 2GHz, 2 FP: **4GFLOP**.
* Lo mejor en `gcc`: (2*(1024**3) / 0.36) * 1e-9 = **5.96 GFLOP**.
* Lo mejor en `icc`: (2*(1024**3) / 0.20) * 1e-9 = **10.73 GFLOP**.

Presenter Notes

Naïve vs. Blocking

(CS5220, David Bindel, Lecture 2: Tiling matrix-matrix multiply, code tuning)

Core2, 10GFLOPS performance pico. Está al 20% de la performance pico.

Presenter Notes

Naïve vs. Blocking vs. Vendor

(CS5220, David Bindel, Lecture 2: Tiling matrix-matrix multiply, code tuning)

Core2, 10GFLOPS performance pico usando los 2 cores. El vendor está al 70% de la performance pico.

Presenter Notes

`sgemm` y `dgemm` de BLAS3

Problema extensivamente estudiado.
Núcleo de la medición de Top500.
Caso paradigmático: [Parello'02]
- Alpha 21264, 1GHz, 2 FP, 2 GFLOPS-
- A través de 9 iteraciones logran 95% de la performance pico.
- We have also shown that cache tiling, on which a large share of research works focus, only accounts for 12% of the total performance improvement ...
Mucho esfuerzo en los compiladores de vendors.

Presenter Notes

Espacio de búsqueda del tamaño del bloque

(SDSC, CS260, Matrix Matrix Multiply)

Presenter Notes

Best vs. naive (1)

Performance of naive and optimized implementations of the Discrete Fourier Transform

(Fig 1.15, Eijkhout, Intro ...)

Presenter Notes

Best vs. naive (2)

Performance of naive and optimized implementations of the matrix-matrix product

(Fig 1.16, Eijkhout, Intro ...)

Presenter Notes

Consejos

Best case: good algorithm, efficient design, obvious code
Tradeoff: speed vs readability, debuggability, maintainability...
Only optimize when needful
Go for low-hanging fruit first: data layouts, libraries, compiler flags
Concentrate on the bottleneck
Concentrate on inner loops
Get correctness (and a test framework) first

(CS5220, David Bindel, Lecture 2: Tiling matrix-matrix multiply, code tuning)

Presenter Notes

Más consejos

Usar buenas herramientas.
Usar bibliotecas.
Flags de compilación.
- -O3: Aggressive optimization
- -march=core2: Tune for specific architecture
- -ftree-vectorize: Automatic use of SSE (supposedly)
- -funroll-loops: Loop unrolling
- -ffast-math: Unsafe floating point optimizations
- profile-guided optimization (PGO).
Prestarle atención al memory layout.
Usar la menor cantidad de memoria posible.
- Bit arrays.
- Mixed precision.
Evitar dependencias falsas.
Loop unrolling para operar sobre registros.

Presenter Notes

Conclusión

Optimizar es complejo

"Therefore my best advice is to avoid loop unrolling. An unrolled loop takes more space in the μop cache, and the advantage of loop unrolling is minimal."

(8.16, Bottlenecks in Sandy Bridge, The microarchitecture of ..., Agner Fog)

"Note that we are not making any absolute predictions on code performance for these implementations, or even relative comparison of their runtimes. Such predictions are very hard to make. However, the above discussion identifies issues that are relevant for a wide range of classical CPUs."

(p.33, Eijkhout, Intro ...)

Presenter Notes

Bibliografía

Presenter Notes

Bibliografía

Hennessy, Patterson, Computer Architecture: A Quantitative Approach, Fourth Edition, Morgan Kaufmann, 2011.
Chatterjeey, Senz, Cache-Efficient Matrix Transposition, HPCA-6, 2000.
Parello, Temam, Verdun, On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance -- Matrix-Multiply Revisited, SC02.
Agner Fog, Optimizing software in C++. An optimization guide for Windows, Linux and Mac platforms, 2012.
Felix von Leitne, Source Code Optimization, 2009.
Mike Acton, Demystifying The Restrict Keyword, 2006.

Presenter Notes

La clase que viene

Operaciones vectoriales.

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Supuestos

CPU

GPU

CPU

GPU

CPU

Memoria

Alcance

(In)Dependiente de la arquitectura

Tipo de información

Cuando actúa el optimizador

Dos citas

Objetivo

Posible solución

Intensidad computacional

Matriz en la memoria

Medición gcc -O3

Medición gcc -O3

2^X, 2^Y -> Cantidad de ciclos

clang-3.0 -O3

icc-12.1.3 -fast

Medición gcc -O3

Registros para mantener la suma

Otro orden de acceso

Mediciones

Mediciones

Mediciones

Blocking en filas

Blocking en filas y columnas

CAAQA A-76

icc -O3

-O3 -ftree-vectorize -ftree-vectorizer-verbose=4

Mhh, vamos a tener que estudiar SSE.

Jerarquía BLAS

Medición ijk

Medición ikj

Medición gcc -O3

icc -O3

icc -fast

Comparación de GFLOPS

Optimizar es complejo

Medición `gcc -O3`

Medición `gcc -O3`

`clang-3.0 -O3`

`icc-12.1.3 -fast`

Medición `gcc -O3`

`icc -O3`

`-O3 -ftree-vectorize -ftree-vectorizer-verbose=4`

Medición `ijk`

Medición `ikj`

Medición `gcc -O3`

`icc -O3`

`icc -fast`