Nicolás Wolovick 20140325
Obtener lo máximo de un núcleo.
Dependiendo de la aplicación los límites pueden ser:
(usualmente memory-bound, excepto tests para lucirse: Top500)
Usualmente referido a GFLOPS.
#FPU x Clock x #Core
¿Punto flotante simple f32
o doble f64
?
Top500 usa f64
.
f64
Extraídos de: Alpha [Parello'02], Tegra2 [Ramirez'11], Core i7 960 [Hennessy'11], Xeon E5 2680 [guesstimated], GTX 480 [Wikipedia + 1/8 of f32 GFLOPS], Tesla C2075 [NVIDIA].
(¿Cómo se calculará?)
Extraídos de: Core i7 960 [Intel], Core i7 3960X [Anandtech, Intel], GTX 480 [Hennessy'11], Tesla C2075 [NVIDIA].
q = flop/bytes
Problema de "C"
.
1 void updatePtrs(int *ptrA, int *ptrB, int *val) { // se puede optimizar
2 *ptrA += *val;
3 *ptrB += *val;
4 }
5
6
7 struct node {
8 struct node *next, *prev;
9 };
10
11 void foo(struct node *n) { // no se puede optimizar
12 n->next->prev->next=n;
13 n->next->next->prev=n;
14 }
Usamos la palabra clave restrict
1 void updatePtrs(int *restrict ptrA, int *restrict ptrB,
2 int *restrict val) {
Y el compilador puede hacer lo esperable.
Podemos pedirle al compilador que ponga las cosas alineadas (GNU).
1 float a[L][L] __attribute__((aligned(64))),
2 b[L] __attribute__((aligned(64))),
3 c[L] __attribute__((aligned(64)));
Ningún efecto sobre variables globales.
Originalmente alineadas a 32 bytes.
Usar memalign()
para memoria dinámica.
Ejemplo pensando en líneas de caché de 64 bytes.
1 #include <malloc.h>
2 double *a = memalign(64, N*N*sizeof(double))
Tampoco parece producir efectos.
Tener en cuenta:
1 struct point {
2 float dx, dy, dz, dw;
3 };
4 point p[N];
5
6 for (int i=0; i<N; ++i) {
7 dist = sqrtf(p[i].dx*p[i].dx + p[i].dy*p[i].dy + p[i].dz*p[i].dz + p[i].dw*p[i].dw)
8 }
1 struct p {
2 float dx[N], dy[N], dz[N], dw[N];
3 };
4
5 for (int i=0; i<N; ++i) {
6 dist = sqrtf(p.dx[i]*p.dx[i] + p.dy[i]*p.dy[i] + p.dz[i]*p.dz[i] + p.dw[i]*p.dw[i])
7 }
q = 0 / (2 ∗ L^2) = 0
Memory-bound ʘ_ʘ
1 #define L (1<<11)
2
3 float a[L][L], b[L][L];
4
5 int main(void) {
6 unsigned int i=0, j=0;
7 for (i=0; i<L; ++i)
8 for (j=0; j<L; ++j)
9 b[j][i] = a[i][j];
10 return (int)b[(int)b[0][2]][(int)b[2][0]];
11 }
gcc -O3
perf stat -r 4 -e instructions,cycles,cache-references,cache-misses ./mtxtransp1
1 Performance counter stats for './mtxtransp1' (4 runs):
2
3 39.047.948 instructions # 0,07 insns per cycle ( +- 0,08% )
4 574.531.905 cycles # 0,000 GHz ( +- 0,80% )
5 4.958.588 cache-references ( +- 0,10% )
6 4.246.447 cache-misses # 85,638 % of all cache refs ( +- 0,02% )
7
8 0,217418170 seconds time elapsed ( +- 0,94% )
Intel Core2 Duo P8800@2.66GHz, gcc-4.7.3, Linux 3.8.0 x86_64
Usamos cache blocking o loop tiling.
1 #define L (1<<11)
2
3 const unsigned int BX=1<<4;
4 const unsigned int BY=1<<4;
5
6 float a[L][L], b[L][L];
7
8 int main(void) {
9 unsigned int i=0, j=0, k=0, l=0;
10 assert(0==L%BX && 0==L%BY);
11 for (i=0; i<L; i+=BY)
12 for (j=0; j<L; j+=BX)
13 for (k=i; k<i+BY; ++k)
14 for (l=j; l<j+BX; ++l)
15 b[l][k] = a[k][l];
16 return (int)b[(int)b[0][2]][(int)b[2][0]];
17 }
gcc -O3
perf stat -r 4 -e instructions,cycles,cache-references,cache-misses ./mtxtransp2
1 Performance counter stats for './a.out' (4 runs):
2
3 43.536.821 instructions # 0,35 insns per cycle ( +- 1,30% )
4 124.940.642 cycles # 0,000 GHz ( +- 1,02% )
5 4.238.619 cache-references ( +- 0,10% )
6 566.880 cache-misses # 13,374 % of all cache refs ( +- 0,16% )
7
8 0,050019460 seconds time elapsed ( +- 5,06% )
Speedup: 0.21/0.05 = 4.1x
clap, clap, clap ...
Intel Core2 Duo P8800@2.66GHz, gcc-4.7.3, Linux 3.8.0 x86_64
0.21s
, llegamos a 0.031s
.dgemm
dgemm
intensidad aritméticaq = flops/bytes = O(L)
dgemm
uso de memoriadgemm.c
dgemm_unroll.c
dgemm_blocking_CoAD.c
dgemm_blocking.f
dgemm_blocking.c
dgemm_blocking_unrolling_CoAD.c
dgemm_blas.c
Análisis de:
(CS5220, David Bindel, Lecture 2: Tiling matrix-matrix multiply, code tuning)
Core2, 10GFLOPS performance pico usando los 2 cores. El vendor está al 70% de la performance pico.
(SDSC, CS260, Matrix Matrix Multiply)
sgemm
y dgemm
de BLAS3We have also shown that cache tiling, on which a large share of research works focus, only accounts for 12% of the total performance improvement ...
(CS5220, David Bindel, Lecture 2: Tiling matrix-matrix multiply, code tuning)
"Therefore my best advice is to avoid loop unrolling. An unrolled loop takes more space in the μop cache, and the advantage of loop unrolling is minimal."
(8.16, Bottlenecks in Sandy Bridge, The microarchitecture of ..., Agner Fog)
"Note that we are not making any absolute predictions on code performance for these implementations, or even relative comparison of their runtimes. Such predictions are very hard to make. However, the above discussion identifies issues that are relevant for a wide range of classical CPUs."
(p.33, Eijkhout, Intro ...)
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |