Scaling en SMP

Presenter Notes

Resumen:

Consideraciones generales
Scaling sgemv.
Scaling minifmm.
Errores comunes.

Nicolás Wolovick 20180517

Presenter Notes

Consideraciones generales

Los 3 jinetes del apocalipsis y off-by-one errors.

Sincronización.
False-sharing.
Memory placement.
Memory bandwidth.

Presenter Notes

Buenas prácticas

De Using OpenMP, Capítulo 5.

Optimizar el uso de barreras: nowait.
Evitar el constructor ordered.
Evitar regiones críticas grandes.
Maximizar regiones paralelas.
Evitar regiones paralelas en lazos internos.
Solucionar desbalanceo de carga: planificaciones dinámicas.
master vs. single.
Evitar false sharing.
Cuidado en el manejo de la memoria privada: copias innecesarias.
- También la declaración de variables automáticas dentro de los constructores paralelos.
Agarrar hilos a procesadores y distribuir memoria en procesadores: numactl.

Presenter Notes

Ahorrando fork-joins?

One may be worried about the creation of new threads within the inner loop. Worry not, the libgomp in GCC is smart enough to actually only creates the threads once. Once the team has done its work, the threads are returned into a "dock", waiting for new work to do. In other words, the number of times the clone system call is executed is exactly equal to the maximum number of concurrent threads. The parallel directive is not the same as a combination of pthread_create and pthread_join.

There will be lots of locking/unlocking due to the implied barriers, though. I don't know if that can be reasonably avoided or whether it even should.

Joel Yliluoma, Guide into OpenMP.

Luego el ejemplo de Using OpenMP..., Fig 5.24 no es tan grave.
De todas maneras puede mejorar la localidad de los hilos y con eso el reuso de las caché locales.

Presenter Notes

Experimento `parallel-parallel.c`

 1 #pragma omp parallel
 2 {
 3     printf("1st parallel, tid %d\n", omp_get_thread_num());
 4 }
 5 
 6 printf("In the middle, tid %d\n", omp_get_thread_num());
 7 
 8 #pragma omp parallel
 9 {
10     printf("2nd parallel, tid %d\n", omp_get_thread_num());
11 }

Vemos cuantos clone llama:

1 $ gcc -fopenmp parallel-parallel.c && OMP_NUM_THREADS=4 strace ./a.out 2>&1 | grep clone
2 clone(child_stack=0x7f8af92f5f70, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f8af92f69d0, tls=0x7f8af92f6700, child_tidptr=0x7f8af92f69d0) = 30402
3 clone(child_stack=0x7f8af8af4f70, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f8af8af59d0, tls=0x7f8af8af5700, child_tidptr=0x7f8af8af59d0) = 30403
4 clone(child_stack=0x7f8af82f3f70, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f8af82f49d0, tls=0x7f8af82f4700, child_tidptr=0x7f8af82f49d0) = 30404

Efectivamente GOMP_parallel_start() y GOMP_parallel_end() dejan los hilos "anclados" para reutilizarlos.
Luego el costo está en las barreras y/o en la migración de hilos a otras CPUs.

Presenter Notes

Sun Fire 15K, "Starcat"

~2002
72 UltraSPARC IV (64 bit, @1.35 GHz)
576 GiB RAM.
"sustained system bandwidth of 43.2 GB per second"
'>10M' USD

Presenter Notes

Los costos: sincronización

Gráficos de: Reid, Bull, OpenMP Microbenchmarks Version 2.0, 2004. Para Sun Fire 15K.

Synchronization overhead

Presenter Notes

Sobrecarga de exclusión mutua

Mutual exclusion overhead

Presenter Notes

Sobrecarga de planificación

Scheduling overhead

(8 procesadores)

Presenter Notes

Sobrecarga de copia de arreglos

Array overhead

(8 procesadores)

Presenter Notes

Sobrecarga de copia de arreglos

Array overhead 2

(Arreglo de 729 elementos)

Presenter Notes

Estudio más moderno

Joseph Harkness, Extending the EPCC OpenMP Microbenchmarks for OpenMP 3.0, University of Edinburgh, 2010.

Presenter Notes

Estudio de Escalabilidad `sgemv`

Presenter Notes

El objetivo: superlinear speedup

Using Openmp, Fig-5.34

(Using OpenMP, p.162)

Presenter Notes

`sgemv` paralelo

 1 #include <stdio.h>
 2 #include <omp.h>
 3 
 4 #ifndef N
 5 #define N 1024
 6 #endif
 7 
 8 float a[N][N], b[N], c[N];
 9 
10 int main(void)
11 {
12     unsigned int i = 0, j = 0;
13     double start = 0.0;
14 
15     start = omp_get_wtime();
16     #pragma omp parallel for default(none) shared(start,a,b,c) private(i,j)
17     for (i=0; i<N; ++i)
18     for (j=0; j<N; ++j)
19         c[i] += a[i][j]*b[j];
20     printf("%f", ((long)N*N*3*sizeof(float))/((1<<30)*(omp_get_wtime()-start)));
21 
22     return 0;
23 }

Presenter Notes

Compilación y ejecución

1 gcc-8 -O3 -fopenmp $PROG.c -o $PROG -DN="$n"
2 OMP_NUM_THREADS=$t taskset 0x0000000F numactl --interleave=all ./$PROG

Presenter Notes

En `mini`, performance

sgemv-mini-perf

1 * Intel Core i7-950@3.07GHz (4 cores, 8 hilos), 16GB DDR3 1066MHz.

Presenter Notes

En `mini`, eficiencia

sgemv-mini-eff

Presenter Notes

En `ganesh`, performance

sgemv-ganesh-perf

4 * AMD Opteron 8212@2.0GHz (2 cores), 4*8GB.

Presenter Notes

En `ganesh`, eficiencia

sgemv-ganesh-eff

Presenter Notes

En `zx81`, performance

sgemv-zx81-perf

2 * Intel E5-2620 v3 (6 cores), 4 samples

Presenter Notes

En `zx81`, eficiencia

sgemv-zx81-eff

Presenter Notes

En `nabucodonosor`, performance

sgemv-nabu-perf

2 * Intel E5-2680 v2 (10 cores), 4 samples

Presenter Notes

En `nabucodonosor`, eficiencia

sgemv-nabu-eff

Presenter Notes

MiniFMM

Patrick Atkinson, Simon McIntosh-Smith, On the performance of parallel tasking runtimes for an irregular fast multipole method application, 2017.

Vimos varias cosas:

2 fases de computación. Solo se toma la primera que es paralela.
El scaling de 6 a 12 hilos en zx81 está lejos de ser perfecto.
Hay load unbalancing (htop)
taskset efectivamente pins threads-to-cores.
El planificador divide 6 hilos en 3+3 para cada numa node.
Las versiones omp y omp-task-depend, casi no tienen diferencias en walltime.
Mirar el paper, se muestra problema de scaling de tasks en GOMP.

Presenter Notes

Análisis de Performance de OpenMP

This somewhat surprising result is of course specific to the algorithm, implementation, system, and software used and would not be possible if the code were not so amenable to compiler analysis. The lesson to be learned from this study is that, for important program regions, both experimentation and analysis are needed. We hope that the insights given here are sufficient for a programmer to get started on this process.

(Using OpenMP, p.190)

Presenter Notes

Errores comunes

Presenter Notes

Common Mistakes in OpenMP ...

Listado de errores comunes

M. Süß, C. Leopold, Common Mistakes in OpenMP and How To Avoid Them, 2006.

Presenter Notes

Otros errores

Errores de tipeo #pragma opm parallel for.
- Fallan silenciosamente, el código no es paralelo.
- Verificar scaling mínimamente.
Olvidarse de private.
Olvidarse de los { ... }.

Ejemplo

1 #pragma omp parallel
2     #pragma omp atomic
3     sum += a[omp_num_threads()];
4     ++a[omp_num_threads()];

Usar herramientas

Helgrind, parte de Valgrind.
Intel® Inspector XE, antes conocido como Intel® Thread Checker.

Presenter Notes

Bibliografía

Presenter Notes

Bibliografía

Keld Helsgaun, How to Get Good Performance by Using OpenMP, 2010.
Michael Süß, Claudia Leopold, Common Mistakes in OpenMP and How To Avoid Them, 2006.
Fiona Reid, Mark Bull, OpenMP Microbenchmarks Version 2.0, 2004.
Joseph Harkness, Extending the EPCC OpenMP Microbenchmarks for OpenMP 3.0, University of Edinburgh, 2010.
Patrick Atkinson, Simon McIntosh-Smith, On the performance of parallel tasking runtimes for an irregular fast multipole method application, 2017.

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Ejemplo

Usar herramientas

Table of Contents

Help