CUDA 6

Presenter Notes

Resumen

Motivación de scan.
Scan.
- Warp scan.
- Block scan.
  - Shared memory bank conflicts.
- Global scan.
Ballot Scan.
Bibliografía.
Próxima clase.

Nicolás Wolovick 20160609

Presenter Notes

Scan

También conocido como prefix sum.

1 values     4   0   5   5   0   5   5   1   3   1   0   3   1   1   3   5
2 inclusive  4   4   9  14  14  19  24  25  28  29  29  32  33  34  37  42
3 exclusive  0   4   4   9  14  14  19  24  25  28  29  29  32  33  34  37 42

Se lo denomina scan por la terminología usada en APL para la suma de prefijos.

1     +/22 93 4.6 10 3.3
2 132.9
3     +\22 93 4.6 10 3.3
4 22 115 119.6 129.6 132.9
5     +/ (^\ 1 1 1 0 1 1)
6 3

K. E. Iverson, A Programming Language. Wiley, 1962.

Idem a la notación de Haskell:

1 Hugs> scanl (+) 0 [4,0,5,5]
2 [0,4,4,9,14]
3 Hugs> scanl1 (+) [4,0,5,5]
4 [4,4,9,14]

Presenter Notes

I Like APL

Presenter Notes

Usos de `scan`

Stream compaction (filter).
Sorting.
Construcción de histogramas.
Operaciones BLAS{2,3} sobre representaciones compactas de matrices ralas.
Etc.

G. E. Blelloch, Prefix Sums and Their Applications, Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, 1990.

Presenter Notes

Ejemplo: Stream Compaction

Uno de los usos clásicos es la aplicación de un filtro a los datos según un predicado.

GPU Gems 3, Chp39, Fig.9

Esto es un exclusive scan seguida de un scatter.

GPU Gems 3, Chp39, Fig.10

M. Harris, S. Sengupta, J. D. Owens, Parallel Prefix Sum (Scan) with CUDA, GPU Gems 3, Chapter 39, NVIDIA, 2007.

Presenter Notes

Ejemplo: Radix Sort

GPU Gems 3, Chp39, Fig.14

... y otra vez para el siguiente dígito.

M. Harris, S. Sengupta, J. D. Owens, Parallel Prefix Sum (Scan) with CUDA, GPU Gems 3, Chapter 39, NVIDIA, 2007.

Presenter Notes

Versión secuencial

1 void ScanSeqInc(const unsigned int N, int *values) {
2     uint sum = 0;
3     for(size_t i = 0; i < N; ++i) {
4         // For inclusive scan, increment sum before writing back.
5         sum += values[i];
6         values[i] = sum;
7     }
8 }

Complejidad computacional: O(N).

Difícil de vencer por un algoritmo paralelo.
Es como el algoritmo de Thomas para la resolución de matrices tridiagonales.

Presenter Notes

Paralelización

Trivial con N procesadores

El procesador i acumula el rango [0..i].
O(N×N) operaciones en total.

Para N procesadores, es O(N), ¡Cuac!

Arbol binario

log(N) etapas de N operaciones: O(N×log(N)) operaciones en total.
En la etapa j, con offset = 2^j el procesador i acumula el values[i-offset].

GPU Gems 3, Chp39, Fig.2

(pensar en la synchro que necesito)

Presenter Notes

Ejemplo árbol binario

Values        4   0   5   5
offset = 1    4   4   5  10
offset = 2    4   4   9  14

Con N procesadores, tenemos complejidad O(log(N)).

Notar

¡Para que sea de orden menor que O(N), necesito N procesadores reales!
Necesito doble-buffering para evitar la sobre-escritura de valores.

Presenter Notes

Código paralelo O(log(N))

 1 void ScanParallelPass(const unsigned int N, const unsigned int offset, int* values) {
 2     // Make a copy of the input values.
 3     int *source = NULL;
 4     memdup(source, values, N*sizeof(int));
 5 
 6     // Iterate from offset to count. An i < offset iteration would dereference
 7     // the source array at a negative argument.
 8     for(int tid=0; tid<N; ++tid)
 9         // Add element tid - offset into element tid.
10         if(tid >= offset) 
11             values[tid] = source[tid] + source[tid - offset];
12 }
13 
14 void ScanParallel(const unsigned int N, int *values) {
15     for(int offset=1; offset<N; offset*=2)
16         ScanParallelPass(N, offset, values);

Necesito las líneas 3 y 4 para que la asincronicidad del paralelismo del for de la línea 8 no sobre-escriba valores nuevos antes de haberlos usados: aka read-after-write (RAW) data hazard.

¡Muchas desventajas!

Presenter Notes

Problema del doble-buffering

Una solución es traer a la shmem e ir pasito a pasito con __syncthreads.

(ver Fig. 9.2 de PMPP2, creo que tiene errores de concurrencia si el órden de ejecución de los warps no están ordenados de menor a mayor)

Ó ...

Warp scan

En CUDA tengo 32 procesadores para mi solito.

La solución está en el warp, la unidad de ejecución síncrona de 32 procesadores.

Tenemos realmente 32 procesadores en paralelo.
¡Para N=32 tengo complejidad log(N)!
Las operaciones son síncronas.
No necesito doble-buffering.

Sean Baxter, Modern GPU, Scan, 2011.

Presenter Notes

Warp scan kernel

 1 __global__ void warp_scan1(const int *values, int *inclusive, int *exclusive) {
 2     __shared__ volatile int scan[CUDA_WARP_SIZE];
 3     int tid = threadIdx.x;
 4     int lane = tid & CUDA_WARP_MASK;
 5 
 6     // read from global
 7     int x = values[lane];
 8     scan[lane] = x;
 9 
10     int sum = x;
11     #pragma unroll
12     for (int offset=1; offset<CUDA_WARP_SIZE; offset*=2) {
13         if (lane>=offset)
14             sum += scan[lane-offset];
15         scan[lane] = sum;
16     }
17     inclusive[lane] = sum;
18     exclusive[lane] = sum - x;
19 }

Uso de volatile para hacer que las actualizaciones a memoria bajen inmediatamente.
lane es la denominación tradicional de los procesadores vectoriales.
Esta implementación funciona hasta 32 hilos, luego tid=lane.
Notar el patrón estándar de uso de memoria shared: cargar, operar, escribir.

Presenter Notes

Optimización: eliminación de branches

Padding a izquierda con CUDA_WARP_SIZE/2 valores en 0.

Values        0   0   0   0   4   0   5   5   0   5   5   1
offset = 1    0   0   0   0   4   4   5  10   5   5  10   6
offset = 2    0   0   0   0   4   4   9  14  10  15  15  11
offset = 4    0   0   0   0   4   4   9  14  14  19  24  25

Ahora siempre puedo hacer:

sum += scan[lane-offset];

sin tener que comparar para que no sea negativo.

Sean Baxter, Modern GPU, Scan, 2011.

Presenter Notes

Warp scan kernel, branchless

 1 #define SCAN_STRIDE (CUDA_WARP_SIZE + CUDA_WARP_SIZE/2)
 2 
 3 __global__ void warp_scan2(const int *values, int *inclusive, int *exclusive) {
 4     __shared__ volatile int scan[SCAN_STRIDE];
 5     int tid = threadIdx.x;
 6     int lane = tid & CUDA_WARP_MASK;
 7 
 8     volatile int *s = scan + lane + CUDA_WARP_SIZE/2;
 9     s[-(CUDA_WARP_SIZE/2)] = 0;
10 
11     // Read from global memory.
12     int x = values[lane];
13     s[0] = x;
14 
15     // Run each pass of the scan.
16     int sum = x;
17     for (int offset=1; offset<CUDA_WARP_SIZE; offset*=2) {
18         sum += s[-offset];
19         s[0] = sum;
20     }
21 
22     // Write sum to inclusive and sum - x (the original source element) to
23     // exclusive.
24     inclusive[lane] = sum;
25     exclusive[lane] = sum - x;
26 }

Presenter Notes

Análisis de performance (2012)

Ejecución sobre vector de 1 GB = (1<<28) × 4 bytes.

Versión predicada `warp_scan1`

1 $ make && ./scan-warp 28 && grep scan cuda_profile_0.log 
2 nvcc -O3 -arch=compute_20 -code=sm_20 -I/opt/cudasdk/C/common/inc -o scan-warp.o -c scan-warp.cu
3 nvcc -O3 -arch=compute_20 -code=sm_20 -I/opt/cudasdk/C/common/inc -o scan-warp scan-warp.o 
4 method=[ _Z10warp_scan1PKiPiS1_ ] gputime=[ 28884.385 ] cputime=[ 4.000 ] occupancy=[ 1.000 ]

Versión sin predicados `warp_scan2`

1 $ make && ./scan-warp 28 && grep scan cuda_profile_0.log 
2 nvcc -O3 -arch=compute_20 -code=sm_20 -I/opt/cudasdk/C/common/inc -o scan-warp.o -c scan-warp.cu
3 nvcc -O3 -arch=compute_20 -code=sm_20 -I/opt/cudasdk/C/common/inc -o scan-warp scan-warp.o 
4 method=[ _Z10warp_scan2PKiPiS1_ ] gputime=[ 28664.801 ] cputime=[ 4.000 ] occupancy=[ 1.000 ]

Presenter Notes

¡La optimización no sirve! (2012)

nvcc-4.1 cambió radicalmente con la incorporación de LLVM como backend-compiler.

Lo que antes tenía sentido en nvcc-3.x, ahora no.
Los compiladores para GPUs está ganando la madurez de los compiladores para CPUs.

Sean Baxter me dijo

Yah don't worry about the ISA so much. I've learned it's best to ignore it if you want to actually get stuff done :)

Amen.

Presenter Notes

¿Qué ancho de banda usamos?

Reportado por un test

 1 $ /opt/cudasdk/C/bin/linux/release/bandwidthTest
 2  Device 0: Tesla C2070
 3  Host to Device Bandwidth, 1 Device(s), Paged memory
 4    Transfer Size (Bytes)    Bandwidth(MB/s)
 5    33554432         3757.4
 6  Device to Host Bandwidth, 1 Device(s), Paged memory
 7    Transfer Size (Bytes)    Bandwidth(MB/s)
 8    33554432         3323.7
 9  Device to Device Bandwidth, 1 Device(s)
10    Transfer Size (Bytes)    Bandwidth(MB/s)
11    33554432         112411.3

Revisamos que la ECC no esté activa en la memoria:

1 $ nvidia-smi -q -i 0 | grep -i "Ecc Mode" -C 1
2     Ecc Mode
3         Current                 : Disabled

Lo que medimos

>>> ((3*(1<<28)*4)/0.028664) / (1<<30)
104.66089868825007

Achalay! Performance casi pico!

Las specs reportan 144 GBps.

Presenter Notes

Mediciones en Kepler+`nvcc-6.0` (2014)

Branchfull

1 $ nvprof --print-gpu-summary ./scan-warp 28
2 ==16491== NVPROF is profiling process 16491, command: ./scan-warp 28
3 ==16491== Profiling application: ./scan-warp 28
4 ==16491== Profiling result:
5 Time(%)      Time     Calls       Avg       Min       Max  Name
6   3.47%  19.883ms         1  19.883ms  19.883ms  19.883ms  warp_scan1(int const *, int*, int*)

Branchless

$ nvprof --print-gpu-summary ./scan-warp 28
==16175== NVPROF is profiling process 16175, command: ./scan-warp 28
==16175== Profiling application: ./scan-warp 28
==16175== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
  3.16%  18.063ms         1  18.063ms  18.063ms  18.063ms  warp_scan2(int const *, int*, int*)

Vuelve a tener sentido (ʘ‿ʘ)
10% de mejora.
~35% de incremento en ofuscación de código.

Presenter Notes

Mediciones en Maxwell+`nvcc-7.5` (2016)

Branchfull

1 nicolasw@zx81:~/Scan$ nvprof --print-gpu-trace ./scan-warp 28
2 ==21014== NVPROF is profiling process 21014, command: ./scan-warp 28
3 ==21014== Profiling application: ./scan-warp 28
4 ==21014== Profiling result:
5    Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
6 3.23214s  20.853ms        (1048576 1 1)       (256 1 1)        15        0B        0B         -           -  GeForce GTX TIT         1         7  set(unsigned int, int*) [186]
7 3.25300s  12.388ms        (1048576 1 1)       (256 1 1)        11  1.0000KB        0B         -           -  GeForce GTX TIT         1         7  warp_scan1(int const *, int*, int*) [192]

Branchless

1 nicolasw@zx81:~/Scan$ nvprof --print-gpu-trace ./scan-warp 28
2 ==21947== NVPROF is profiling process 21947, command: ./scan-warp 28
3 ==21947== Profiling application: ./scan-warp 28
4 ==21947== Profiling result:
5    Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
6 3.47419s  20.986ms        (1048576 1 1)       (256 1 1)        15        0B        0B         -           -  GeForce GTX TIT         1         7  set(unsigned int, int*) [186]
7 3.49518s  12.395ms        (1048576 1 1)       (256 1 1)        14  1.5313KB        0B         -           -  GeForce GTX TIT         1         7  warp_scan2(int const *, int*, int*) [192]

Vuelve a NO tener sentido.
Notar que estamos leyendo al palo: 242 GBPs.
In [1]: (( 3*(1<<28)*4)/0.012388) / (1<<30)
Out[1]: 242.16984178237004

Presenter Notes

Block scan: multiscan

Hasta ahora podíamos hacer scan de hasta 32 valores.
Para saltar a un bloque de warps, usamos la técnica de multiscan.

· Scan inclusivo de los warps.
· Obtener el total de cada warp.
· Hacer un scan exclusivo de estos totales.
· Estos son los desplazamientos de cada warp.

ModernGPU Multiscan

Sean Baxter, Modern GPU, Scan, 2011.

Presenter Notes

Block scan

1 int tid = threadIdx.x;
2 int warp = tid / CUDA_WARP_SIZE;
3 int lane = tid & CUDA_WARP_MASK;

Algoritmo de la división: tid = warp*CUDA_WARP_SIZE+lane.

Inclusive scan de cada uno de los warps que componen el bloque.

 1 int x = values[tid];
 2 scan[tid] = x;
 3 
 4 int sum = x;
 5 #pragma unroll
 6 for (int offset=1; offset<CUDA_WARP_SIZE; offset*=2) {
 7     if (lane>=offset)
 8         sum += scan[tid-offset];
 9     scan[tid] = sum;
10 }

Sincronización de todos los warps dentro de un bloque para que estén listos los totales.

1 __syncthreads();

Presenter Notes

Block scan-2

Exclusive scan de los totales de cada warp.

Notar que solo se usan NUM_WARPS hilos de todo el bloque, el resto queda idling.

 1 __shared__ volatile int totals[NUM_WARPS];
 2 if (tid < NUM_WARPS) {
 3     // Grab the block total for the tid'th block. This is the last element
 4     // in the block's scanned sequence.
 5     int total = scan[tid*CUDA_WARP_SIZE + CUDA_WARP_SIZE-1];
 6     totals[tid] = total;
 7 
 8     int totalsSum = total;
 9     #pragma unroll
10     for(int offset=1; offset<NUM_WARPS; offset*=2) {
11         if (tid>=offset)
12             totalsSum += totals[tid-offset];
13         totals[tid] = totalsSum;
14     }
15 
16     // Subtract total from totalsSum for an exclusive scan.
17     totals[tid] = totalsSum - total;
18 }

El patrón es el mismo que el bloque scan anterior, solo cambia el valor que se escribe para que sea exclusivo.

Presenter Notes

Block scan-3

Sincronización para que estén todos los totales.
(¿Porqué no se puede sacar esta sincronización si el warp que calcula los totales va todo síncrono?)

1 // Synchronize to make the block scan available to all warps.
2 __syncthreads();

Como hay una variable (registro) sum por cada hilo, acumulo el desplazamiento de totals en paralelo en todo el bloque.

1 // Add the block scan to the inclusive sum for the block.
2 sum += totals[warp];
3 
4 // Write the inclusive and exclusive scans to global memory.
5 inclusive[gtid] = sum;
6 exclusive[gtid] = sum - x;

Presenter Notes

Shared memory bank conflicts

Para tener tan alto ancho de banda (approx. 1.6 GBps), la shmem de la arquitectura, desde Fermi se dividen en

Bancos de palabras de 32 bits.
Un total de 32 bancos.

Problema

Los accesos a un mismo banco por múltiples hilos dentro de un warp se serializan.

Revisar bien como es la organización de shared memory para:

Tesla (GT2xx, CC 1.y)
Fermi (GF1xx, CC 2.y)
Kepler (GK{1,2}xx, CC 3.y)
Maxwell (GM{1,2}xx, CC 5.y)

Hay muchos cambios de una generación a otra. Revisar:

NVIDIA, NVIDIA CUDA C Programming Guide, Section G.3, G.4, G.5.

Presenter Notes

Ejemplo

1 float data = shared[lane*STRIDE];

Supongamos que lane (número de hilo dentro del warp) está en [0..31].
Veamos a que banco accede según el valor de STRIDE.

STRIDE=1, 0% conflictos en warp, todo en paralelo.

lane  bank
0     0
...
31    31

STRIDE=32, 100% conflictos en warp, todo serializado.

lane  bank
0     0
...
31    0

STRIDE=33, 0% conflictos en warp, todo en paralelo.

lane  bank
0     0
...
31    31

Presenter Notes

Conflictos en `scan-block`

Obtener la suma total de cada warp, en la etapa de reducción (2da etapa) puede producir serialización en un warp.

1 int total = scan[tid*WARP_STRIDE + CUDA_WARP_SIZE-1];

Como tid varía en [0..NUM_WARPS] y solo ejecuta 1 warp dentro del bloque:

Si WARP_STRIDE = CUDA_WARP_SIZE, tenemos 100% de conflictos.
Lo solucionamos haciendo que WARP_STRIDE sea coprimo de CUDA_WARP_SIZE, por ejemplo CUDA_WARP_SIZE+1=33.

Presenter Notes

Mediciones para Fermi+¿`nvcc-5.0`? (2012)

Para N=1<<28:

Block  #Totals  Stride            Time
256    8        CUDA_WARP_SIZE    40.218ms
256    8        CUDA_WARP_SIZE+1  41.189ms

512    16       CUDA_WARP_SIZE    45.418ms
512    16       CUDA_WARP_SIZE+1  43.541ms

1024   32       CUDA_WARP_SIZE    83.142ms
1024   32       CUDA_WARP_SIZE+1  67.737ms

Al aumentar el tamaño del bloque bajan la cantidad de bloques por SM y la performance global cae.
Sin embargo, un bloque grande genera muchos accesos en conflicto a la memoria compartida y se nota la influencia del padding para evitar el bank conflict.
Solo es un 33/32 = 3% más de almacenamiento en la shared.
Bajamos a 74.59 GBPs.
>>> (( 3*(1<<28)*4)/0.040218) / (1<<30)
74.59346561241236

Presenter Notes

Mediciones Maxwell+`nvcc-7.5` (2016)

Para N=1<<28:

Block   #Totals  Stride       Time
256     8       CUDA_WARP_SIZE    12.470ms
256     8       CUDA_WARP_SIZE+1  12.460ms

512     16       CUDA_WARP_SIZE    12.901ms
512     16       CUDA_WARP_SIZE+1  12.739ms

1024    32       CUDA_WARP_SIZE    14.451ms
1024    32       CUDA_WARP_SIZE+1  14.177ms

Tenemos casi lo mismo: 240 GBPs.
In [1]: (( 3*(1<<28)*4)/0.01247) / (1<<30)
Out[1]: 240.57738572574178

Presenter Notes

Beware of Warp Synchronous Programming

As a means of mitigating the cost of repeated block-level synchronizations, particularly in parallel primitives such as reduction and prefix sum, some programmers exploit the knowledge that threads in a warp execute in lock-step with each other to omit __syncthreads() in some places where it is semantically necessary for correctness in the CUDA programming model.
The absence of an explicit synchronization in a program where different threads communicate via memory constitutes a data race condition or synchronization error. Warp-synchronous programs are unsafe and easily broken by evolutionary improvements to the optimization strategies used by the CUDA compiler toolchain, which generally has no visibility into cross-thread interactions of this variety in the absence of barriers, or by changes to the hardware memory subsystem's behavior. Such programs also tend to assume that the warp size is 32 threads, which may not necessarily be the case for all future CUDA-capable architectures.
Therefore, programmers should avoid warp-synchronous programming to ensure future-proof correctness in CUDA applications. When threads in a block must communicate or synchronize with each other, regardless of whether those threads are expected to be in the same warp or not, the appropriate barrier primitives should be used. Note that the Warp Shuffle primitive presents a future-proof, supported mechanism for intra-warp communication that can safely be used as an alternative in many cases.

Presenter Notes

Warp shuffle

¡El 4to scope de comunicación!

1 local
2 warp
3 shared
4 global

Recordar:

All performance is from parallelism (the free launch is over)
Machines are power limited (efficiency IS performance)
Machines are communication limited (locality IS performance)

Bill Dally, Efficiency and Programmability: Enablers for ExaScale, SC13.

Presenter Notes

Ejemplos

General

Up&down

Accelerware, Kepler Shuffle Instruction, 2013.

Presenter Notes

Usos: reducción

Una mezcla para abajo de ancho 8 y reducción en 3 pasos solo usando registros.

(Justin Luitjens, Faster Parallel Reductions in Kepler, ∥∀, 2013.)

Presenter Notes

Usos prefix sum

Presenter Notes

Código

 1 __global__ void warp_scan3(const int *values, int *inclusive, int *exclusive) {
 2     int tid = threadIdx.x;
 3     int lane = tid & CUDA_WARP_MASK;
 4     //int warp = tid / CUDA_WARP_SIZE;
 5     int gtid = blockIdx.x * blockDim.x + tid;
 6 
 7     // read from global
 8     int x = values[gtid];
 9 
10     int sum = x;
11     #pragma unroll
12     for (int i=1; i<CUDA_WARP_SIZE; i*=2) {
13         int n = __shfl_up(sum, i, CUDA_WARP_SIZE);
14         if (lane >= i)
15             sum += n;
16     }
17 
18     inclusive[gtid] = sum;
19     exclusive[gtid] = sum - x;
20 }

Presenter Notes

Medición Kepler+`nvcc-6.0` (2014)

1 $ nvprof ./scan-warp 28
2 ==4332== NVPROF is profiling process 4332, command: ./scan-warp 28
3 ==4332== Profiling application: ./scan-warp 28
4 ==4332== Profiling result:
5 Time(%)      Time     Calls       Avg       Min       Max  Name
6  98.01%  2.75685s         3  918.95ms  912.59ms  923.26ms  [CUDA memcpy DtoH]
7   1.45%  40.853ms         1  40.853ms  40.853ms  40.853ms  set(unsigned int, int*)
8   0.53%  14.979ms         1  14.979ms  14.979ms  14.979ms  warp_scan3(int const *, int*, int*)

Correcto.
Performante.
Independiente de nuevos cambios de arquitectura.
shuffle FTW!

Presenter Notes

Medición Maxwell+`nvcc-7.5` (2016)

1 $ nvprof ./scan-warp 28
2 ==5546== NVPROF is profiling process 5546, command: ./scan-warp 28
3 ==5546== Profiling application: ./scan-warp 28
4 ==5546== Profiling result:
5 Time(%)      Time     Calls       Avg       Min       Max  Name
6  89.01%  269.94ms         3  89.979ms  89.775ms  90.171ms  [CUDA memcpy DtoH]
7   6.91%  20.957ms         1  20.957ms  20.957ms  20.957ms  set(unsigned int, int*)
8   4.08%  12.369ms         1  12.369ms  12.369ms  12.369ms  warp_scan3(int const *, int*, int*)

Presenter Notes

Global scan

Presenter Notes

Global scan

La misma idea warp-block ahora en block-global.

GPU Gems 3, Chp39, Fig.6

Dos niveles con la misma idea.

M. Harris, S. Sengupta, J. D. Owens, Parallel Prefix Sum (Scan) with CUDA, GPU Gems 3, Chapter 39, NVIDIA, 2007.

Presenter Notes

Ballot scan

Presenter Notes

Stream Compaction

Uno de los usos clásicos es la aplicación de un filtro a los datos según un predicado.

GPU Gems 3, Chp39, Fig.9

Esto es simplemente una etapa de exclusive scan seguida de un scatter.

GPU Gems 3, Chp39, Fig.10

M. Harris, S. Sengupta, J. D. Owens, Parallel Prefix Sum (Scan) with CUDA, GPU Gems 3, Chapter 39, NVIDIA, 2007.

Presenter Notes

Ballot

Existen instrucciones de la Fermi ISA que hacen el exclusive scan de un arreglo binario de un warp en tan solo 3 operaciones, manejando toda la comunicación intra-warp de manera automática.

`uint __ballot(uint pred)`

Retorna en todos los hilos del warp un entero de 32 bits, donde cada bit está seteado a 1 sii pred es válido en ese hilo.

Notar que está haciendo un mapeo + broadcast en todo el warp.
Similar a MPI_Allreduce para dar un ejemplo en otro paradigma de paralelismo.

(Sección 8.7.12.5 de Parallel Thread Execution ISA)

Presenter Notes

Ballot scan en un warp

La magia de __ballot mas un par de operaciones bitwise hacen todo el trabajo.

1 uint flag = -1.0f != val;
2 
3 uint bits = __ballot(flag);
4 
5 uint mask = bfi(0, 0xffffffff, 0, tid);
6 uint exc = __popc(mask & bits);
7 uint total = __popc(bits);

Otras operaciones bitwise

__popc(uint a): cuenta la cantidad de bits prendidos en a.
uint bfi(uint a, uint b, uchar pos, uchar len): inserta a en la posición pos de b, len bits.

Entonces en cada hilo exc tiene la suma prefija y total cuantos hilos cumplen con el predicado.

Sean Baxter, Modern GPU, Scan, 2011.

Presenter Notes

Ballot, el código

Se necesita inline assembler, porque bfi no tiene un intrinsic asociado.

1 __device__ __forceinline__ uint bfi(uint x, uint y, uint bit, uint numBits) {
2     uint ret;
3     asm("bfi.b32 %0, %1, %2, %3, %4;" :
4         "=r"(ret) : "r"(y), "r"(x), "r"(bit), "r"(numBits));
5     return ret;
6 }

Inline PTX Assembly in CUDA v7.5, NVIDIA Inc., 2016.

Presenter Notes

Ballot, el código-2

 1 __global__ void BallotScanWarp(const float *dataIn_global,
 2                                float *dataOut_global, uint *countOut_global) {
 3     uint tid = threadIdx.x;
 4     float val = dataIn_global[tid];
 5 
 6     uint flag = -1.0f != val;
 7 
 8     uint bits = __ballot(flag);
 9 
10     uint mask = bfi(0, 0xffffffff, 0, tid);
11     uint exc = __popc(mask & bits);
12     uint total = __popc(bits);
13 
14     __shared__ volatile float shared[CUDA_WARP_SIZE];
15     if (flag)
16         shared[exc] = val;
17     val = shared[tid];
18 
19     if (tid < total)
20         dataOut_global[tid] = val;
21 
22     if (!tid)
23         *countOut_global = total;
24 }

Presenter Notes

Medición Maxwell+`nvcc-7.5` (2016)

 1 $ nvprof ./ballot-scan-warp-simple 28
 2 ==23398== NVPROF is profiling process 23398, command: ./ballot-scan-warp-simple 28
 3 13
 4  128   69  -53 -126  -83   36  122   96  -18 -116 -107    0  108  116   17  -97 -122  -35   84  126   52  -70 -127  -68   54  126   82  -37 -123  -95   19  117 
 5  128   69  122   96  108  116   84  126   52   54  126   82  117 
 6 ==23398== Profiling application: ./ballot-scan-warp-simple 28
 7 ==23398== Profiling result:
 8 Time(%)      Time     Calls       Avg       Min       Max  Name
 9  89.76%  247.66ms         3  82.552ms  1.8240us  247.65ms  [CUDA memcpy DtoH]
10   7.60%  20.967ms         1  20.967ms  20.967ms  20.967ms  set(unsigned int, int*)
11   2.65%  7.2994ms         1  7.2994ms  7.2994ms  7.2994ms  BallotScanWarp(int const *, int*, unsigned int*)

Mejoramos un poquito los 240 GiB/s.
In [2]: (( 2*(1<<28)*4)/0.00729) / (1<<30)
Out[2]: 274.34842249657066
Este es solo un stream compaction por warp. Falta hacerlo por bloque y luego global.
Para hacerlo por bloque hay que hacer una exclusive prefix sum para sacar los offsets del scatter.

Presenter Notes

Bibliografía

NVIDIA, NVIDIA CUDA C Programming Guide v7.5, 2016.
M. Harris, S. Sengupta, J. D. Owens, Parallel Prefix Sum (Scan) with CUDA, GPU Gems 3, Chapter 39, NVIDIA, 2007.
Sean Baxter, Modern GPU, Scan, 2011.
G. E. Blelloch, Prefix Sums and Their Applications, Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, 1990.
NVIDIA, Parallel Thread Execution ISA v7.5, 2016.
NVIDIA, Inline PTX Assembly in CUDA v7.5, 2016.

Presenter Notes

La clase que viene

GPUs, the hype ...

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Trivial con N procesadores

Arbol binario

Notar

Warp scan

Versión predicada warp_scan1

Versión sin predicados warp_scan2

Sean Baxter me dijo

Reportado por un test

Lo que medimos

Branchfull

Branchless

Branchfull

Branchless

Problema

General

Up&down

uint __ballot(uint pred)

Otras operaciones bitwise

Table of Contents

Help

Versión predicada `warp_scan1`

Versión sin predicados `warp_scan2`

`uint __ballot(uint pred)`