Jerarquía de Memoria

Presenter Notes

Escala de tiempos de la tecnología actual

Brendan Gregg, Systems Performance, Enterprise and the Cloud, Prentice Hall, 2013.

Presenter Notes

Escala de tiempos de la tecnología actual

Dibujo del genial Jim Gray.
Jeff Atwood, The Infinite Space Between Words, 2014.

Presenter Notes

Sci-Fi

Star Trek: First Contact (1996)

Data: [about the Borg Queen] She brought me closer to humanity than I ever thought possible. And for a time, I was tempted by her offer.

Picard: How long a time?

Data: 0.68 seconds sir. For an android, that is nearly an eternity.

Her (2013)

It's like I'm reading a book… and it's a book I deeply love. But I'm reading it slowly now. So the words are really far apart and the spaces between the words are almost infinite. I can still feel you… and the words of our story… but it's in this endless space between the words that I'm finding myself now. It's a place that's not of the physical world. It's where everything else is that I didn't even know existed. I love you so much. But this is where I am now. And this who I am now. And I need you to let me go. As much as I want to, I can't live your book any more.

Presenter Notes

Para niñas y niños de primaria

Presenter Notes

Memory Wall

John McCalpin, The STREAM Benchmark.
John McCalpin, "A Survey of Memory Bandwidth and Machine Balance in Current High Performance Computers", IEEE TCCA Newsletter, 19-25, 1999.

Presenter Notes

Memory Wall

Wulf, McKee, Hitting the Memory Wall: Implications of the Obvious, 1995.

if the microprocessor/memory performance gap continues to grow at a similar rate, in 10-15 years each memory access will cost, on average, tens or even hundreds of processor cycles. Under each scenario, system speed is dominated by memory performance.

PowerEdge CPU vs. Memory

(Memory Wall -- HPBD 070809 HIGH PERFORMANCE COMPUTING - WIKI, Dell Inc.)

Presenter Notes

Jerarquía de Memoria

E. Bolotin, et.al, Designing Efficient Heterogeneous Memory Architectures, June 2015, IEEE Micro 35(4):1-1.

Presenter Notes

Latencia vs. Ancho de Banda

Latencia, ancho de banda y tamaño de la jerarquía de memoria

Trade-off entre: latencia, ancho de banda, tamaño, precio.
La relación entre BW y Lat no es lineal

E. Bolotin, et.al, Designing Efficient Heterogeneous Memory Architectures, June 2015, IEEE Micro 35(4):1-1.

Computation Structures, MIT 6.004.

Presenter Notes

4 preguntas de Jerarquía de Memoria

(Q1) ¿Dónde pongo un bloque en la caché?
(Q2) ¿Cómo encuentro un bloque en la caché?
(Q3) ¿Qué bloque reemplazo en un fallo?
(Q4) ¿Qué pasa en una escritura?

Presenter Notes

(Q1) Caché: mapa de muchos a n-way

Asociatividad de Caché

Hennessy, Patterson, Computer Architecture ...

Presenter Notes

(Q1) Caché: mapa de muchos a n-way (2)

Típicamente 8 o 16-way, más no paga:

Miss rate versus cache size on the Integer portion of SPEC CPU2000

Wikipedia, CPU Cache

Presenter Notes

Línea de caché

Se trae mucho más que una palabra (4 u 8 bytes).

Típicamente 64 bytes.

Favorece la localidad espacial.

1 unsigned int i = 0, s = 0;
2 for (i=s=0; i<size; ++i) {
3     s += a[i];
4 }

Presenter Notes

(Q2) ¿Cómo distingo el uno en muchos?

Cache address

Hennessy, Patterson, Computer Architecture ...

Comparación n-way en paralelo.

Cache miss.
Cache hit.

Presenter Notes

(Q3) ¿Qué bloque reemplazo en un fallo?

Estrategia de reemplazo least-recently used (LRU), o alguna aproximación.

Favorece la localidad temporal.

1 unsigned int i = 0, s = 0;
2 for (i=s=0; i<size; ++i) {
3     s +=a[i];
4 }

Presenter Notes

(Q4) ¿Qué pasa en una escritura?

Dos estrategias:

Write through.
Write back (dirty bit).

Dos estrategias en write miss:

Write allocate.
No-write allocate.

Ojo, actualmente write combining buffers.

Presenter Notes

Localidad Espacial y Temporal - 0

Computation Structures, MIT 6.004.

Presenter Notes

Localidad Espacial y Temporal - 1

Amir Kleen et.al, Optimizing for instruction caches, EE Times, 2007.

Presenter Notes

Localidad Espacial y Temporal - 2

Amir Kleen et.al, Optimizing for instruction caches, EE Times, 2007.

Presenter Notes

Localidad Espacial y Temporal (real)

A. McMenamin, Some thoughts on the Linux page cache, 2011.

Presenter Notes

La memoria tienen una topoligía no trivial

What if I told you that iterating through a linked list is actually O(N√N) and hash lookups are O(√N)?

E. Ernerfeldt, The Myth of RAM, part I, 2014.

Presenter Notes

Visualización de tamaños de cache y RAM

¿Cómo hacemos para meter todo en la caché?

A. Fredriksson, Cold, Hard Cache: Insomniac's Cache Simulator, GDC, 2017.

Presenter Notes

Ideas

Medir con cachegrind: valgrind --tool=cachegrind ./a.out.
Buscar localidad espacial.
Tener cuidado con n-way del cache.
Usar prefetching manual.
Padding para evitar trashear n-way.
Acceder de manera desordenada, pero dentro de la caché, ej: matrix transposition.
Se puede ver tooooda la traza de memoria:
valgrind --tool=lackey --trace-mem=yes pi 100

Técnica standard para `for`s

Michael Wolfe, (more) Iteration space tiling, 1989.

¡El paralelismo lo va a cambiar todo!

Presenter Notes

Ejemplo: Cold, Hard Cache ...

A. Fredriksson, Cold, Hard Cache: Insomniac's Cache Simulator, GDC, 2017.

Presenter Notes

Optimizaciones avanzadas

Way-prediction (predicción del índice).
Cache multibanco (acceso simultáneo a 2 o más bancos).
Write merging con write buffers.
Prefetching.

Presenter Notes

Memoria Virtual

Cada acceso a memoria consulta la tabla de página.

Virtual a física

Hennessy, Patterson, Computer Architecture ...

Tabla de página

Árbol multinivel.
Típicamente 2 o 3 niveles.
¿Recorrer un árbol de 3 niveles en cada acceso a memoria?

Presenter Notes

TLB

Caché completamente asociativo.

Opteron TLB

"Opteron TLB", Hennessy, Patterson, Computer Architecture ...

Ahora hay TLB multi-niveles, uno en L1 y otro en L2.

Presenter Notes

Cubrimiento de la TLB

Juan Navarro, Sitaram Iyer, Peter Druschel, Alan Cox, Practical, transparent operating system support for superpages, USENIX, 2002.

Presenter Notes

HugePages

1 nicolasw@zx81:~$ grep Huge /proc/meminfo
2 AnonHugePages:         0 kB
3 HugePages_Total:       0
4 HugePages_Free:        0
5 HugePages_Rsvd:        0
6 HugePages_Surp:        0
7 Hugepagesize:       2048 kB

Transparent Huge Pages

1 nicolasw@zx81:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
2 always [madvise] never
3 nicolasw@zx81:~$ sudo echo "always" >  /sys/kernel/mm/transparent_hugepage/enabled
4 nicolasw@zx81:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
5 [always] madvise never
6 nicolasw@zx81:~$ grep Huge /proc/meminfo 
7 AnonHugePages:      4096 kB
8 Hugepagesize:       2048 kB

Luego de correr `DGEMM`

1 root@zx81:~# grep Huge /proc/meminfo
2 AnonHugePages:    184320 kB
3 Hugepagesize:       2048 kB

Presenter Notes

`madvise()`

 1 NAME
 2     madvise - give advice about use of memory
 3 
 4 SYNOPSIS
 5     #include <sys/mman.h>
 6     int madvise(void *addr, size_t length, int advice);
 7     ...
 8     MADV_HUGEPAGE (since Linux 2.6.38)
 9         Enable Transparent Huge Pages (THP) for pages in the range specified by addr  
10         and length.

Presenter Notes

Memoria virtual en `ia32e`

4 niveles de indirección, 4 accesos a memoria para 1 acceso real.

Presenter Notes

Optimizaciones de Memoria

Presenter Notes

Optimización en los accesos a memoria

Son una de las pocas cosas que si tiene sentido hacer:

¡Usar menos memoria!
Aliasing.
Alineamiento de memoria.
AoS vs. SoA.
Blocking.
Ejemplos:
- Transposición de matrices.
- Multiplicación matriz por matriz.
Guías generales para optimización secuencial.

Nicolás Wolovick 20200401

Presenter Notes

Bajar el uso de la memoria

Cuando nos enfrentamos un escenario donde los accesos a memoria son prohibitivos, la mejor estrategia es acceder lo menos posible.

¿Necesito usar double o con float alcanza?
Tal vez almaceno los datos en single precision y computo en double.
(re)Computar en vez de almacenar en memoria.
Comprimir/descomprimir, utilizando formatos sencillos pero que hagan un cambio menos memoria -> más cómputo.

Presenter Notes

Aliasing

Problema de "C".

 1 void updatePtrs(int *ptrA, int *ptrB, int *val) { // se puede optimizar
 2     *ptrA += *val;
 3     *ptrB += *val;
 4 }
 5 
 6 
 7 struct node {
 8     struct node *next, *prev;
 9 };
10 
11 void foo(struct node *n) { // no se puede optimizar
12     n->next->prev->next=n;
13     n->next->next->prev=n;
14 }

Usamos la palabra clave restrict

1 void updatePtrs(int * restrict ptrA, int * restrict ptrB,
2                 int * restrict val) {

Y el compilador puede hacer lo esperable.

Presenter Notes

Alineamiento de memoria

Podemos pedirle al compilador que ponga las cosas alineadas (GNU).

1 float a[L][L] __attribute__((aligned(64))),
2       b[L] __attribute__((aligned(64))),
3       c[L] __attribute__((aligned(64)));

Ningún efecto sobre variables globales.
Originalmente alineadas a 32 bytes.

Usar aligned_alloc() para memoria dinámica.
Ejemplo pensando en líneas de caché de 64 bytes.

1 #include <malloc.h>
2 double *a = aligned_alloc(64, N*N*sizeof(double))

Tampoco parece producir efectos.

En general en compiladores y bibliotecas de manejo de memoria modernas, ya todo está alineado a CLS.
No descartarlo de todas maneras y confirmar que esté OK.

Presenter Notes

Algoritmos caché-aware

Tener en cuenta:

Localidad espacial y temporal.
Parámetros del caché.
- line size.
- associativity.

AoS vs. SoA - Array of Structures vs. Structures of Arrays

Figure 10. Shuffling an AoS into a SoA.

Presenter Notes

Ejemplo

AoS (array of structures)

1 struct point {
2   float dx, dy, dz, dw;
3 };
4 point p[N];
5 
6 for (int i=0; i<N; ++i) {
7   dist = sqrtf(p[i].dx*p[i].dx + p[i].dy*p[i].dy + p[i].dz*p[i].dz + p[i].dw*p[i].dw)
8 }

SoA (structure of arrays)

1 struct p {
2   float dx[N], dy[N], dz[N], dw[N];
3 };
4 
5 for (int i=0; i<N; ++i) {
6   dist = sqrtf(p.dx[i]*p.dx[i] + p.dy[i]*p.dy[i] + p.dz[i]*p.dz[i] + p.dw[i]*p.dw[i])
7 }

Pensar en la dinámica de la ejecución y cual favorece la localidad espacial.
Dinámica secuencial vs. paralela (a futuro).

Presenter Notes

Ejemplo B = A^t

Presenter Notes

Transposición de matrices

Intensidad aritmética

q = 0 / (2 L²) = 0

Memory-bound 😓

Matriz en la memoria (row-major)

Organización de la matriz en memoria

Presenter Notes

Naïve

 1 #define L (1<<11)
 2 
 3 float a[L][L], b[L][L];
 4 
 5 int main(void) {
 6     unsigned int i=0, j=0;
 7     for (i=0; i<L; ++i)
 8             for (j=0; j<L; ++j)
 9                 b[j][i] = a[i][j];
10     return (int)b[(int)b[0][2]][(int)b[2][0]];
11 }

Presenter Notes

Naïve, medición

 1 $ gcc-8 -O3 mtxtransp1.c && perf stat -r 16 -e instructions,cycles,cache-references,cache-misses ./a.out
 2 
 3 Performance counter stats for './a.out' (16 runs):
 4 
 5   24,518,383 instructions     # 0.04  insn per cycle      ( +-  1.46% )  (74.72%)
 6  589,419,748 cycles                                       ( +-  0.32% )  (74.86%)
 7    5,605,508 cache-references                             ( +-  0.32% )  (74.83%)
 8    4,468,003 cache-misses     #79.707 % of all cache refs ( +-  0.28% )  (50.31%)
 9 
10    0.228426360 seconds time elapsed                       ( +-  1.93% )

Notar el bajísimo IPC.
perf muestra claramente el problema.

Intel Core2 Duo P8800@2.66GHz, gcc-8.0.1, Linux 4.15.11 x86_64

Presenter Notes

Tiled

Usamos cache blocking o loop tiling a mano.

 1 #define L (1<<11)
 2 
 3 const unsigned int BX=1<<4;
 4 const unsigned int BY=1<<4;
 5 
 6 float a[L][L], b[L][L];
 7 
 8 int main(void) {
 9     unsigned int i=0, j=0, k=0, l=0;
10     assert(0==L%BX && 0==L%BY);
11     for (i=0; i<L; i+=BY)
12         for (j=0; j<L; j+=BX)
13             for (k=i; k<i+BY; ++k)
14                 for (l=j; l<j+BX; ++l)
15                     b[l][k] = a[k][l];
16     return (int)b[(int)b[0][2]][(int)b[2][0]];
17 }

Presenter Notes

Tiled, medición

 1 $ gcc-8 -O3 mtxtransp2.c && perf stat -r 16 -e instructions,cycles,cache-references,cache-misses ./a.out
 2 
 3 Performance counter stats for './a.out' (16 runs):
 4 
 5   40,650,922 instructions     # 0.35  insn per cycle      ( +-  0.69% )  (72.24%)
 6  114,607,354 cycles                                       ( +-  0.80% )  (72.54%)
 7    4,722,970 cache-references                             ( +-  1.68% )  (77.67%)
 8      542,791 cache-misses     #11.493 % of all cache refs ( +-  1.97% )  (49.79%)
 9 
10  0.044948155 seconds time elapsed                         ( +-  1.14% )

Speedup: 0.22s/0.045s = 4.88x

clap, clap, clap ...

Intel Core2 Duo P8800@2.66GHz, gcc-8.0.1, Linux 4.15.11 x86_64

Presenter Notes

No seamos naïve

"Any sufficiently advanced (compiler) technology is indistinguishable from magic.", Jorge Luis Borges, circa 1971.

GCC Manual, 3.10 Options That Control Optimization, GNU, 2018.

1 -ftree-loop-linear
2 -floop-strip-mine
3 -floop-block
4 
5     Perform loop nest optimizations. Same as -floop-nest-optimize. To use this code transformation, GCC has to be configured with --with-isl to enable the Graphite loop transformation infrastructure.

Veamos como va ...

 1 $ rm a.out; gcc-8 -O3 -floop-block mtxtransp1.c && perf stat -r 16 -e instructions,cycles,cache-references,cache-misses ./a.out
 2 
 3 Performance counter stats for './a.out' (16 runs):
 4 
 5 92,203,819 instructions     # 0.96 insn per cycle       ( +-  0.56% )  (70.52%)
 6 95,723,528 cycles                                       ( +-  0.63% )  (78.14%)
 7  4,042,367 cache-references                             ( +-  0.25% )  (78.24%)
 8    458,531 cache-misses     #11.343 % of all cache refs ( +-  3.31% )  (43.62%)
 9 
10    0.037237697 seconds time elapsed                     ( +-  0.45% )

Presenter Notes

Compartativa de máquinas

Todos gcc-8 salvo Eulogia gcc-7.4

Plata (Penryn)

1 -O3               mtxtransp1.c    0.210
2 -O3 -floop-block  mtxtransp1.c    0.037
3 -O3               mtxtransp2.c    0.045
4 -O3 -floop-block  mtxtransp2.c    0.045

Jupiterace (Haswell)

1 -O3               mtxtransp1.c    0.053
2 -O3 -floop-block  mtxtransp1.c    0.033
3 -O3               mtxtransp2.c    0.032
4 -O3 -floop-block  mtxtransp2.c    0.032

Eulogia (KNL)

1 -O3               mtxtransp1.c    0.200
2 -O3 -floop-block  mtxtransp1.c    0.130
3 -O3               mtxtransp2.c    0.141
4 -O3 -floop-block  mtxtransp2.c    0.142

"Cada máquina es un mundo", JLB, circa 1962.

Presenter Notes

¿Cuál es el mejor tamaño de bloque?

P8800, gcc-4.7.3, Linux 3.8.0 (plata en 2016)

Exploración tamaño de bloque

De los 0.21s, llegamos a 0.031s, 6.77x de speedup!

Presenter Notes

P8800

con gcc-8 y linux-4.15 (plata hoy)

Exploración tamaño de bloque

Presenter Notes

E5-2620v3

con gcc-8 y linux-4.15 (jupiterace hoy)

Exploración tamaño de bloque

Presenter Notes

XeonPhi 7210

con gcc-7 y linux-3.10 (Eulogia hoy)

Exploración tamaño de bloque

Presenter Notes

Bibliografía

Presenter Notes

Bibliografía

Bryant, O’Hallaron, x86-64 Machine-Level Programming, 2008.
Intel Corp., Intel® 64 and IA-32 Architectures Software Developer's Manual, Instruction Set Reference.
Wm. A. Wulf, Sally A. McKee, Hitting the Memory Wall: Implications of the Obvious, 1995.
T. Jurkiewicz, K. Mehlhorn, On a Model of Virtual Address Translation, JEA, 2014.

Presenter Notes

Bibliografía, cont'd

Charles Severance, Kevin Dowd, "High Performance Computing", Connexions, 2010, CC3.0.
Patterson, Hennessy Computer Organization and Design, Fifth Edition, Morgan Kaufmann, 2012.
Hennessy, Patterson, Computer Architecture: A Quantitative Approach, Fourth Edition, Morgan Kaufmann, 2011.
Chatterjeey, Senz, Cache-Efficient Matrix Transposition, HPCA-6, 2000.
Parello, Temam, Verdun, On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance -- Matrix-Multiply Revisited, SC02.
Agner Fog, Optimizing software in C++. An optimization guide for Windows, Linux and Mac platforms, 2012.
Felix von Leitne, Source Code Optimization, 2009.
Mike Acton, Demystifying The Restrict Keyword, 2006.

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Star Trek: First Contact (1996)

Her (2013)

Técnica standard para fors

Tabla de página

Transparent Huge Pages

Luego de correr DGEMM

AoS vs. SoA - Array of Structures vs. Structures of Arrays

AoS (array of structures)

SoA (structure of arrays)

Intensidad aritmética

Matriz en la memoria (row-major)

Plata (Penryn)

Jupiterace (Haswell)

Eulogia (KNL)

Table of Contents

Help

Técnica standard para `for`s

Luego de correr `DGEMM`