Data: [about the Borg Queen] She brought me closer to humanity than I ever thought possible. And for a time, I was tempted by her offer.
Picard: How long a time?
Data: 0.68 seconds sir. For an android, that is nearly an eternity.
It's like I'm reading a book… and it's a book I deeply love. But I'm reading it slowly now. So the words are really far apart and the spaces between the words are almost infinite. I can still feel you… and the words of our story… but it's in this endless space between the words that I'm finding myself now. It's a place that's not of the physical world. It's where everything else is that I didn't even know existed. I love you so much. But this is where I am now. And this who I am now. And I need you to let me go. As much as I want to, I can't live your book any more.
Wulf, McKee, Hitting the Memory Wall: Implications of the Obvious, 1995.
if the microprocessor/memory performance gap continues to grow at a similar rate, in 10-15 years each memory access will cost, on average, tens or even hundreds of processor cycles. Under each scenario, system speed is dominated by memory performance.
(Memory Wall -- HPBD 070809 HIGH PERFORMANCE COMPUTING - WIKI, Dell Inc.)
E. Bolotin, et.al, Designing Efficient Heterogeneous Memory Architectures, June 2015, IEEE Micro 35(4):1-1.
E. Bolotin, et.al, Designing Efficient Heterogeneous Memory Architectures, June 2015, IEEE Micro 35(4):1-1.
Computation Structures, MIT 6.004.
Hennessy, Patterson, Computer Architecture ...
Típicamente 8 o 16-way, más no paga:
Se trae mucho más que una palabra (4 u 8 bytes).
Típicamente 64 bytes.
Favorece la localidad espacial.
1 unsigned int i = 0, s = 0;
2 for (i=s=0; i<size; ++i) {
3 s += a[i];
4 }
Hennessy, Patterson, Computer Architecture ...
Comparación n-way en paralelo.
Estrategia de reemplazo least-recently used (LRU), o alguna aproximación.
Favorece la localidad temporal.
1 unsigned int i = 0, s = 0;
2 for (i=s=0; i<size; ++i) {
3 s +=a[i];
4 }
Dos estrategias:
Dos estrategias en write miss:
Ojo, actualmente write combining buffers.
Computation Structures, MIT 6.004.
Amir Kleen et.al, Optimizing for instruction caches, EE Times, 2007.
Amir Kleen et.al, Optimizing for instruction caches, EE Times, 2007.
What if I told you that iterating through a linked list is actually O(N√N) and hash lookups are O(√N)?
E. Ernerfeldt, The Myth of RAM, part I, 2014.
¿Cómo hacemos para meter todo en la caché?
A. Fredriksson, Cold, Hard Cache: Insomniac's Cache Simulator, GDC, 2017.
cachegrind
: valgrind --tool=cachegrind ./a.out
.valgrind --tool=lackey --trace-mem=yes pi 100
for
s¡El paralelismo lo va a cambiar todo!
A. Fredriksson, Cold, Hard Cache: Insomniac's Cache Simulator, GDC, 2017.
Cada acceso a memoria consulta la tabla de página.
Hennessy, Patterson, Computer Architecture ...
Caché completamente asociativo.
"Opteron TLB", Hennessy, Patterson, Computer Architecture ...
Ahora hay TLB multi-niveles, uno en L1 y otro en L2.
1 nicolasw@zx81:~$ grep Huge /proc/meminfo
2 AnonHugePages: 0 kB
3 HugePages_Total: 0
4 HugePages_Free: 0
5 HugePages_Rsvd: 0
6 HugePages_Surp: 0
7 Hugepagesize: 2048 kB
1 nicolasw@zx81:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
2 always [madvise] never
3 nicolasw@zx81:~$ sudo echo "always" > /sys/kernel/mm/transparent_hugepage/enabled
4 nicolasw@zx81:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
5 [always] madvise never
6 nicolasw@zx81:~$ grep Huge /proc/meminfo
7 AnonHugePages: 4096 kB
8 Hugepagesize: 2048 kB
DGEMM
1 root@zx81:~# grep Huge /proc/meminfo
2 AnonHugePages: 184320 kB
3 Hugepagesize: 2048 kB
madvise()
1 NAME
2 madvise - give advice about use of memory
3
4 SYNOPSIS
5 #include <sys/mman.h>
6 int madvise(void *addr, size_t length, int advice);
7 ...
8 MADV_HUGEPAGE (since Linux 2.6.38)
9 Enable Transparent Huge Pages (THP) for pages in the range specified by addr
10 and length.
ia32e
4 niveles de indirección, 4 accesos a memoria para 1 acceso real.
Son una de las pocas cosas que si tiene sentido hacer:
Nicolás Wolovick 20200401
Cuando nos enfrentamos un escenario donde los accesos a memoria son prohibitivos, la mejor estrategia es acceder lo menos posible.
double
o con float
alcanza?Problema de "C"
.
1 void updatePtrs(int *ptrA, int *ptrB, int *val) { // se puede optimizar
2 *ptrA += *val;
3 *ptrB += *val;
4 }
5
6
7 struct node {
8 struct node *next, *prev;
9 };
10
11 void foo(struct node *n) { // no se puede optimizar
12 n->next->prev->next=n;
13 n->next->next->prev=n;
14 }
Usamos la palabra clave restrict
1 void updatePtrs(int * restrict ptrA, int * restrict ptrB,
2 int * restrict val) {
Y el compilador puede hacer lo esperable.
Podemos pedirle al compilador que ponga las cosas alineadas (GNU).
1 float a[L][L] __attribute__((aligned(64))),
2 b[L] __attribute__((aligned(64))),
3 c[L] __attribute__((aligned(64)));
Usar aligned_alloc()
para memoria dinámica.
Ejemplo pensando en líneas de caché de 64 bytes.
1 #include <malloc.h>
2 double *a = aligned_alloc(64, N*N*sizeof(double))
Tampoco parece producir efectos.
En general en compiladores y bibliotecas de manejo de memoria modernas, ya todo está alineado a CLS.
No descartarlo de todas maneras y confirmar que esté OK.
Tener en cuenta:
1 struct point {
2 float dx, dy, dz, dw;
3 };
4 point p[N];
5
6 for (int i=0; i<N; ++i) {
7 dist = sqrtf(p[i].dx*p[i].dx + p[i].dy*p[i].dy + p[i].dz*p[i].dz + p[i].dw*p[i].dw)
8 }
1 struct p {
2 float dx[N], dy[N], dz[N], dw[N];
3 };
4
5 for (int i=0; i<N; ++i) {
6 dist = sqrtf(p.dx[i]*p.dx[i] + p.dy[i]*p.dy[i] + p.dz[i]*p.dz[i] + p.dw[i]*p.dw[i])
7 }
Pensar en la dinámica de la ejecución y cual favorece la localidad espacial.
Dinámica secuencial vs. paralela (a futuro).
q = 0 / (2 L²) = 0
Memory-bound 😓
1 #define L (1<<11)
2
3 float a[L][L], b[L][L];
4
5 int main(void) {
6 unsigned int i=0, j=0;
7 for (i=0; i<L; ++i)
8 for (j=0; j<L; ++j)
9 b[j][i] = a[i][j];
10 return (int)b[(int)b[0][2]][(int)b[2][0]];
11 }
1 $ gcc-8 -O3 mtxtransp1.c && perf stat -r 16 -e instructions,cycles,cache-references,cache-misses ./a.out
2
3 Performance counter stats for './a.out' (16 runs):
4
5 24,518,383 instructions # 0.04 insn per cycle ( +- 1.46% ) (74.72%)
6 589,419,748 cycles ( +- 0.32% ) (74.86%)
7 5,605,508 cache-references ( +- 0.32% ) (74.83%)
8 4,468,003 cache-misses #79.707 % of all cache refs ( +- 0.28% ) (50.31%)
9
10 0.228426360 seconds time elapsed ( +- 1.93% )
perf
muestra claramente el problema.Intel Core2 Duo P8800@2.66GHz, gcc-8.0.1, Linux 4.15.11 x86_64
Usamos cache blocking o loop tiling a mano.
1 #define L (1<<11)
2
3 const unsigned int BX=1<<4;
4 const unsigned int BY=1<<4;
5
6 float a[L][L], b[L][L];
7
8 int main(void) {
9 unsigned int i=0, j=0, k=0, l=0;
10 assert(0==L%BX && 0==L%BY);
11 for (i=0; i<L; i+=BY)
12 for (j=0; j<L; j+=BX)
13 for (k=i; k<i+BY; ++k)
14 for (l=j; l<j+BX; ++l)
15 b[l][k] = a[k][l];
16 return (int)b[(int)b[0][2]][(int)b[2][0]];
17 }
1 $ gcc-8 -O3 mtxtransp2.c && perf stat -r 16 -e instructions,cycles,cache-references,cache-misses ./a.out
2
3 Performance counter stats for './a.out' (16 runs):
4
5 40,650,922 instructions # 0.35 insn per cycle ( +- 0.69% ) (72.24%)
6 114,607,354 cycles ( +- 0.80% ) (72.54%)
7 4,722,970 cache-references ( +- 1.68% ) (77.67%)
8 542,791 cache-misses #11.493 % of all cache refs ( +- 1.97% ) (49.79%)
9
10 0.044948155 seconds time elapsed ( +- 1.14% )
Speedup: 0.22s/0.045s = 4.88x
clap, clap, clap ...
Intel Core2 Duo P8800@2.66GHz, gcc-8.0.1, Linux 4.15.11 x86_64
"Any sufficiently advanced (compiler) technology is indistinguishable from magic.", Jorge Luis Borges, circa 1971.
GCC Manual, 3.10 Options That Control Optimization, GNU, 2018.
1 -ftree-loop-linear
2 -floop-strip-mine
3 -floop-block
4
5 Perform loop nest optimizations. Same as -floop-nest-optimize. To use this code transformation, GCC has to be configured with --with-isl to enable the Graphite loop transformation infrastructure.
Veamos como va ...
1 $ rm a.out; gcc-8 -O3 -floop-block mtxtransp1.c && perf stat -r 16 -e instructions,cycles,cache-references,cache-misses ./a.out
2
3 Performance counter stats for './a.out' (16 runs):
4
5 92,203,819 instructions # 0.96 insn per cycle ( +- 0.56% ) (70.52%)
6 95,723,528 cycles ( +- 0.63% ) (78.14%)
7 4,042,367 cache-references ( +- 0.25% ) (78.24%)
8 458,531 cache-misses #11.343 % of all cache refs ( +- 3.31% ) (43.62%)
9
10 0.037237697 seconds time elapsed ( +- 0.45% )
Todos gcc-8
salvo Eulogia gcc-7.4
1 -O3 mtxtransp1.c 0.210
2 -O3 -floop-block mtxtransp1.c 0.037
3 -O3 mtxtransp2.c 0.045
4 -O3 -floop-block mtxtransp2.c 0.045
1 -O3 mtxtransp1.c 0.053
2 -O3 -floop-block mtxtransp1.c 0.033
3 -O3 mtxtransp2.c 0.032
4 -O3 -floop-block mtxtransp2.c 0.032
1 -O3 mtxtransp1.c 0.200
2 -O3 -floop-block mtxtransp1.c 0.130
3 -O3 mtxtransp2.c 0.141
4 -O3 -floop-block mtxtransp2.c 0.142
"Cada máquina es un mundo", JLB, circa 1962.
P8800, gcc-4.7.3, Linux 3.8.0 (plata en 2016)
De los 0.21s
, llegamos a 0.031s
, 6.77x de speedup!
con gcc-8 y linux-4.15 (plata hoy)
con gcc-8 y linux-4.15 (jupiterace hoy)
con gcc-7 y linux-3.10 (Eulogia hoy)
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |