CUDA 3ra parte

Presenter Notes

Resumen

Motivación de scan.
Scan.
- Warp scan.
- Block scan.
  - Shared memory bank conflicts.
- Global scan.
Ballot Scan.
Bibliografía.
Próxima clase.

Nicolás Wolovick, $Date: 2012-06-14 18:44:32 -0300 (Thu, 14 Jun 2012) $, $Revision: 3609 $

Presenter Notes

Scan

También conocido como prefix sum.

1 values     4   0   5   5   0   5   5   1   3   1   0   3   1   1   3   5
2 inclusive  4   4   9  14  14  19  24  25  28  29  29  32  33  34  37  42
3 exclusive  0   4   4   9  14  14  19  24  25  28  29  29  32  33  34  37

Se lo denomina scan por la terminología usada en APL para la suma de prefijos.

K. E. Iverson, A Programming Language. Wiley, 1962.

Idem a la notación de Haskell:

1 Hugs> scanl (+) 0 [4,0,5,5,0,5,5,1,3,1,0,3,1,1,3,5]
2 [0,4,4,9,14,14,19,24,25,28,29,29,32,33,34,37,42]
3 Hugs> scanl1 (+) [4,0,5,5,0,5,5,1,3,1,0,3,1,1,3,5]
4 [4,4,9,14,14,19,24,25,28,29,29,32,33,34,37,42]

Presenter Notes

Usos de `scan`

Stream compaction (filter).
Sorting.
Construcción de histogramas.
Operaciones BLAS sobre representaciones compactas de matrices ralas.
Etc.

Bibliografía

G. E. Blelloch, Prefix Sums and Their Applications, Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, 1990.

Presenter Notes

Versión secuencial

1 void ScanSeqInc(const unsigned int N; int *values) {
2     uint sum = 0;
3     for(int i=0; i<N; ++i) {
4         // For inclusive scan, increment sum before writing back.
5         sum += values[i];
6         values[i] = sum;
7     }
8 }

Presenter Notes

Paralelización

Varias opciones

Trivial, hace O(N×N) operaciones
- Cada procesador i acumula el rango [0..i].
Binary-tree-based. log(N) etapas de N operaciones, O(N×log(N)).
- En la etapa offset = (1<<j) el procesador i acumula el lugar i-offset.

GPU Gems 3, Chp39, Fig.2

(M. Harris, S. Sengupta, J. D. Owens, Parallel Prefix Sum (Scan) with CUDA, GPU Gems 3, Chapter 39, NVIDIA, 2007)

Presenter Notes

Ejemplo

Values        4   0   5   5   0   5   5   1   3   1   0   3   1   1   3   5
offset = 1    4   4   5  10   5   5  10   6   4   4   1   3   4   2   4   8
offset = 2    4   4   9  14  10  15  15  11  14  10   5   7   5   5   8  10
offset = 4    4   4   9  14  14  19  24  25  24  25  20  18  19  15  13  17
offset = 8    4   4   9  14  14  19  24  25  28  29  29  32  33  34  37  42

La implementación secuencial es O(N)!

Problemas

Para que sea de orden menor que O(N), necesito N procesadores reales.
Necesito doble-buffering para evitar la sobre escritura de valores.

Presenter Notes

Código paralelo N×log(N)

Notar que es in-place.

 1 void ScanParallelPass(const unsigned int N, const unsigned int offset, int* values) {
 2     // Make a copy of the input values.
 3     int *source = NULL;
 4     memdup(source, values, N*sizeof(int));
 5 
 6     // Iterate from offset to count. An i < offset iteration would dereference
 7     // the source array at a negative argument.
 8     for(int tid=0; tid<N; ++tid)
 9         // Add element tid - offset into element tid.
10         if(tid >= offset) 
11             values[tid] = source[tid] + source[tid - offset];
12 }
13 
14 void ScanParallel(const unsigned int N, int *values) {
15     for(int offset=1; offset<N; offset*=2)
16         ScanParallelPass(N, offset, values);

Necesito las líneas 3 y 4 para que la asincronicidad del paralelismo del for de la línea 8 no sobre-escriba valores nuevos antes de haberlos usados: aka read-after-write data hazard.

Muchas desventajas!

Presenter Notes

Warp scan

La solución está en el warp, la unidad de ejecución síncrona de 32 procesadores.

Tenemos realmente 32 procesadores en paralelo.
Para N=32 tengo complejidad log(N)!
Las operaciones son síncronas.
No necesito doble-buffering.

(Sean Baxter, Modern GPU, Scan, 2011.)

Presenter Notes

Warp scan kernel

 1 __global__ void warp_scan1(const int *values, int *inclusive, int *exclusive) {
 2     __shared__ volatile int scan[WARP_SIZE];
 3     int tid = threadIdx.x;
 4     int lane = (WARP_SIZE-1) & tid;
 5 
 6     // read from global
 7     int x = values[lane];
 8     scan[lane] = x;
 9 
10     int sum = x;
11     #pragma unroll
12     for (int offset=1; offset<WARP_SIZE; offset*=2) {
13         if (lane>=offset) sum += scan[lane-offset];
14         scan[lane] = sum;
15     }
16     inclusive[lane] = sum;
17     exclusive[lane] = sum - x;
18 }

Uso de volatile para hacer que los updates a memoria bajen inmediatamente.
lane es la denominación tradicional de los procesadores vectoriales.
Esta implementación funciona hasta 32 hilos, luego tid=lane.
Notar el patrón estándar de uso de memoria shared: cargar, operar, escribir.

Presenter Notes

Fermi ISA

Veamos a que compila realmente: Fermi ISA, no es el PTX de nivel intermedio.

1 $ make scan-warp-simple.isa
2 nvcc -O3 -arch=compute_20 -code=sm_20 -I/opt/cudasdk/C/common/inc --cubin -o scan-warp-simple.cubin scan-warp-simple.cu
3 cuobjdump -sass scan-warp-simple.cubin > scan-warp-simple.isa
4 rm scan-warp-simple.cubin

Resultado

 1 code for sm_20
 2         Function : _Z9warp_scanPKiPiS1_
 3 /*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
 4 /*0008*/     /*0x84001c042c000000*/     S2R R0, SR_Tid_X;
 5 /*0010*/     /*0x10021de218000000*/     MOV32I R8, 0x4;
 6 /*0018*/     /*0x7c01dc036800c000*/     LOP.AND R7, R0, 0x1f;
 7 /*0020*/     /*0x08701e036000c000*/     SHL.W R0, R7, 0x2;
 8 /*0028*/     /*0x10709c435000c000*/     IMUL.U32.U32.HI R2, R7, 0x4;
 9 /*0030*/     /*0xfc71dc23190e0000*/     ISETP.EQ.AND P0, pt, R7, RZ, pt;
10 /*0038*/     /*0x80011c0348014000*/     IADD R4.CC, R0, c [0x0] [0x20];
11 /*0040*/     /*0xfc025c0348000000*/     IADD R9, R0, RZ;
12 /*0048*/     /*0x78719c035800c000*/     SHR.U32 R6, R7, 0x1e;
13 /*0050*/     /*0x90215c4348004000*/     IADD.X R5, R2, c [0x0] [0x24];
14 /*0058*/     /*0x00429c8584000000*/     LD.E R10, [R4];

· Instrucciones de ancho fijo.
· Arquitectura load/store.
· Instrucciones de 3 operandos (a veces hasta 4).

Presenter Notes

Fermi ISA-2

 1 code for sm_20
 2 /*0060*/     /*0x00029c85c9000000*/     STS [R0], R10;
 3 /*0068*/     /*0xf090a085c103ffff*/     @!P0 LDS R2, [R9+-0x4];
 4 /*0070*/     /*0x280101e428000000*/     @P0 MOV R4, R10;
 5 /*0078*/     /*0x2821200348000000*/     @!P0 IADD R4, R2, R10;
 6 /*0080*/     /*0x0871dc03188ec000*/     ISETP.LT.U32.AND P0, pt, R7, 0x2, pt;
 7 /*0088*/     /*0x00011c85c9000000*/     STS [R0], R4;
 8 /*0090*/     /*0xe090a085c103ffff*/     @!P0 LDS R2, [R9+-0x8];
 9 /*0098*/     /*0x1021200348000000*/     @!P0 IADD R4, R2, R4;
10 /*00a0*/     /*0x1071dc03188ec000*/     ISETP.LT.U32.AND P0, pt, R7, 0x4, pt;
11 /*00a8*/     /*0x00011c85c9000000*/     STS [R0], R4;
12 /*00b0*/     /*0xc090a085c103ffff*/     @!P0 LDS R2, [R9+-0x10];
13 /*00b8*/     /*0x1021200348000000*/     @!P0 IADD R4, R2, R4;
14 /*00c0*/     /*0x2071dc03188ec000*/     ISETP.LT.U32.AND P0, pt, R7, 0x8, pt;
15 /*00c8*/     /*0x00011c85c9000000*/     STS [R0], R4;
16 /*00d0*/     /*0x8090a085c103ffff*/     @!P0 LDS R2, [R9+-0x20];
17 /*00d8*/     /*0x1021200348000000*/     @!P0 IADD R4, R2, R4;
18 /*00e0*/     /*0x4071dc03188ec000*/     ISETP.LT.U32.AND P0, pt, R7, 0x10, pt;
19 /*00e8*/     /*0xa0709c0320118000*/     IMAD.U32.U32 R2.CC, R7, R8, c [0x0] [0x28];
20 /*00f0*/     /*0x00011c85c9000000*/     STS [R0], R4;
21 /*00f8*/     /*0x00916085c103ffff*/     @!P0 LDS R5, [R9+-0x40];
22 /*0100*/     /*0xb060dc4348004000*/     IADD.X R3, R6, c [0x0] [0x2c];
23 /*0108*/     /*0xc0721c0320118000*/     IMAD.U32.U32 R8.CC, R7, R8, c [0x0] [0x30];
24 /*0110*/     /*0xd0625c4348004000*/     IADD.X R9, R6, c [0x0] [0x34];
25 /*0118*/     /*0x1051200348000000*/     @!P0 IADD R4, R5, R4;
26 /*0120*/     /*0x28415d0348000000*/     IADD R5, R4, -R10;
27 /*0128*/     /*0x00211c8594000000*/     ST.E [R2], R4;
28 /*0130*/     /*0x00011c85c9000000*/     STS [R0], R4;
29 /*0138*/     /*0x00815c8594000000*/     ST.E [R8], R5;
30 /*0140*/     /*0x00001de780000000*/     EXIT;

Presenter Notes

Fermi ISA-3

El compilador hizo el unrolling.
Uso de branch predication.

Presenter Notes

Optimización: eliminación de branches

Padding a izquierda con WARP_SIZE/2 valores en 0.

Values        0   0   0   0   4   0   5   5   0   5   5   1
offset = 1    0   0   0   0   4   4   5  10   5   5  10   6
offset = 2    0   0   0   0   4   4   9  14  10  15  15  11
offset = 4    0   0   0   0   4   4   9  14  14  19  24  25

Ahora siempre puedo hacer:

sum += scan[lane-offset];

sin tener que comparar para que no sea negativo.

(Sean Baxter, Modern GPU, Scan, 2011.)

Presenter Notes

Warp scan kernel, branchless

 1 #define SCAN_STRIDE (WARP_SIZE + WARP_SIZE/2)
 2 
 3 __global__ void warp_scan2(const int *values, int *inclusive, int *exclusive) {
 4     __shared__ volatile int scan[SCAN_STRIDE];
 5     int tid = threadIdx.x;
 6     int lane = (WARP_SIZE-1) & tid;
 7 
 8     volatile int *s = scan + lane + WARP_SIZE/2;
 9     s[-(WARP_SIZE/2)] = 0;
10 
11     // Read from global memory.
12     int x = values[lane];
13     s[0] = x;
14 
15     // Run each pass of the scan.
16     int sum = x;
17     for (int offset=1; offset<WARP_SIZE; offset*=2) {
18         sum += s[-offset];
19         s[0] = sum;
20     }
21 
22     // Write sum to inclusive and sum - x (the original source element) to
23     // exclusive.
24     inclusive[lane] = sum;
25     exclusive[lane] = sum - x;
26 }

Presenter Notes

Fermi ISA

 1 code for sm_20
 2         Function : _Z10warp_scan2PKiPiS1_
 3 /*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
 4 /*0008*/     /*0x84001c042c000000*/     S2R R0, SR_Tid_X;
 5 /*0010*/     /*0x7c009c036800c000*/     LOP.AND R2, R0, 0x1f;
 6 /*0018*/     /*0x08201e036000c000*/     SHL.W R0, R2, 0x2;
 7 /*0020*/     /*0x10219c435000c000*/     IMUL.U32.U32.HI R6, R2, 0x4;
 8 /*0028*/     /*0x80009c0348014000*/     IADD R2.CC, R0, c [0x0] [0x20];
 9 /*0030*/     /*0x000fdc85c9000000*/     STS [R0], RZ;
10 /*0038*/     /*0x9060dc4348004000*/     IADD.X R3, R6, c [0x0] [0x24];
11 /*0040*/     /*0x00215c8584000000*/     LD.E R5, [R2];
12 /*0048*/     /*0x00015c85c9000001*/     STS [R0+0x40], R5;
13 /*0050*/     /*0xf0009c85c1000000*/     LDS R2, [R0+0x3c];
14 /*0058*/     /*0x1420dc0348000000*/     IADD R3, R2, R5;
15 /*0060*/     /*0x0000dc85c9000001*/     STS [R0+0x40], R3;
16 /*0068*/     /*0xe0009c85c1000000*/     LDS R2, [R0+0x38];
17 /*0070*/     /*0x0c20dc0348000000*/     IADD R3, R2, R3;
18 /*0078*/     /*0x0000dc85c9000001*/     STS [R0+0x40], R3;
19 /*0080*/     /*0xc0009c85c1000000*/     LDS R2, [R0+0x30];
20 /*0088*/     /*0x0c20dc0348000000*/     IADD R3, R2, R3;
21 /*0090*/     /*0x0000dc85c9000001*/     STS [R0+0x40], R3;
22 /*0098*/     /*0x80009c85c1000000*/     LDS R2, [R0+0x20];
23 /*00a0*/     /*0x0c21dc0348000000*/     IADD R7, R2, R3;
24 /*00a8*/     /*0xa0009c0348014000*/     IADD R2.CC, R0, c [0x0] [0x28];
25 /*00b0*/     /*0x0001dc85c9000001*/     STS [R0+0x40], R7;
26 /*00b8*/     /*0x00011c85c1000000*/     LDS R4, [R0];

Presenter Notes

Fermi ISA-2

1 /*00c0*/     /*0xb060dc4348004000*/     IADD.X R3, R6, c [0x0] [0x2c];
2 /*00c8*/     /*0xc0021c0348014000*/     IADD R8.CC, R0, c [0x0] [0x30];
3 /*00d0*/     /*0xd0625c4348004000*/     IADD.X R9, R6, c [0x0] [0x34];
4 /*00d8*/     /*0x1c41dc0348000000*/     IADD R7, R4, R7;
5 /*00e0*/     /*0x14711d0348000000*/     IADD R4, R7, -R5;
6 /*00e8*/     /*0x0021dc8594000000*/     ST.E [R2], R7;
7 /*00f0*/     /*0x0001dc85c9000001*/     STS [R0+0x40], R7;
8 /*00f8*/     /*0x00811c8594000000*/     ST.E [R8], R4;
9 /*0100*/     /*0x00001de780000000*/     EXIT;

Kernel mucho más pequeño.
No hay comparaciones ni instrucciones predicadas.

Presenter Notes

Análisis de performance

Ejecución sobre vector de 1 GB = (1<<28) x 4 bytes.

Versión predicada `warp_scan1`

1 $ make && ./scan-warp 28 && grep scan cuda_profile_0.log 
2 nvcc -O3 -arch=compute_20 -code=sm_20 -I/opt/cudasdk/C/common/inc -o scan-warp.o -c scan-warp.cu
3 nvcc -O3 -arch=compute_20 -code=sm_20 -I/opt/cudasdk/C/common/inc -o scan-warp scan-warp.o 
4 method=[ _Z10warp_scan1PKiPiS1_ ] gputime=[ 28884.385 ] cputime=[ 4.000 ] occupancy=[ 1.000 ]

Versión sin predicados `warp_scan2`

1 $ make && ./scan-warp 28 && grep scan cuda_profile_0.log 
2 nvcc -O3 -arch=compute_20 -code=sm_20 -I/opt/cudasdk/C/common/inc -o scan-warp.o -c scan-warp.cu
3 nvcc -O3 -arch=compute_20 -code=sm_20 -I/opt/cudasdk/C/common/inc -o scan-warp scan-warp.o 
4 method=[ _Z10warp_scan2PKiPiS1_ ] gputime=[ 28664.801 ] cputime=[ 4.000 ] occupancy=[ 1.000 ]

Presenter Notes

La optimización no sirve!

nvcc-4.1 cambió radicalmente con la incorporación de LLVM como backend-compiler.

Lo que antes tenía sentido en nvcc-3.x, ahora no.
Los compiladores para GPUs está ganando la madurez de los compiladores para CPUs.

Un experto me dijo

Yah don't worry about the ISA so much. I've learned it's best to ignore it if you want to actually get stuff done :)

Amen.

Presenter Notes

¿Qué ancho de banda usamos?

Reportado por un test

 1 $ /opt/cudasdk/C/bin/linux/release/bandwidthTest
 2 ...
 3  Device 0: Tesla C2070
 4  Quick Mode
 5 
 6  Host to Device Bandwidth, 1 Device(s), Paged memory
 7    Transfer Size (Bytes)    Bandwidth(MB/s)
 8    33554432         3757.4
 9  Device to Host Bandwidth, 1 Device(s), Paged memory
10    Transfer Size (Bytes)    Bandwidth(MB/s)
11    33554432         3323.7
12  Device to Device Bandwidth, 1 Device(s)
13    Transfer Size (Bytes)    Bandwidth(MB/s)
14    33554432         112411.3

Revisamos que la ECC no esté activa en la memoria:

1 $ nvidia-smi -q -i 0 | grep -i "Ecc Mode" -C 1
2     Ecc Mode
3         Current                 : Disabled

Lo que medimos

>>> (( 3*(1<<28)*4)/0.028664) / (1<<30)
104.66089868825007

Achalay! Performance casi pico!

(Las specs reportan 144GBps)

Presenter Notes

Block scan: multiscan

Hasta ahora podíamos hacer scan de hasta 32 valores.
Para saltar a un bloque de warps, usamos la técnica de multiscan.

· Scan inclusivo de los warps.
· Obtener el total de cada warp.
· Hacer un scan exclusivo de estos totales.
· Estos son los desplazamientos de cada warp.

ModernGPU Multiscan

(Sean Baxter, Modern GPU, Scan, 2011.)

Presenter Notes

Block scan

Obtención de coordenadas: tid = warp*WARP_SIZE+lane.

1 int tid = threadIdx.x;
2 int warp = tid/WARP_SIZE;
3 int lane = (WARP_SIZE-1) & tid;

Inclusive scan de cada uno de los warps que componen el bloque.

1 int x = values[tid];
2 scan[tid] = x;
3 
4 int sum = x;
5 #pragma unroll
6 for (int offset=1; offset<WARP_SIZE; offset*=2) {
7     if (lane>=offset) sum += scan[tid-offset];
8     scan[tid] = sum;
9 }

Sincronización de todos los warps dentro de un bloque para que estén listos los totales.

1 __syncthreads();

Presenter Notes

Block scan-2

Exclusive scan de los totales de cada warp.

Notar que solo se usan NUM_WARPS hilos de todo el bloque, el resto queda idling.

 1 __shared__ volatile int totals[NUM_WARPS];
 2 if (tid < NUM_WARPS) {
 3     // Grab the block total for the tid'th block. This is the last element
 4     // in the block's scanned sequence.
 5     int total = scan[tid*WARP_SIZE + WARP_SIZE-1];
 6     totals[tid] = total;
 7 
 8     int totalsSum = total;
 9     #pragma unroll
10     for(int offset=1; offset<NUM_WARPS; offset*=2) {
11         if (tid>=offset) totalsSum += totals[tid-offset];
12         totals[tid] = totalsSum;
13     }
14 
15     // Subtract total from totalsSum for an exclusive scan.
16     totals[tid] = totalsSum - total;
17 }

El patrón es el mismo que el bloque scan anterior, solo cambia el valor que se escribe para que sea exclusivo.

Presenter Notes

Block scan-3

Sincronización para que estén todos los totales.
(¿Porqué no se puede sacar esta sincronización si el warp que calcula los totales va todo síncrono?)

1 // Synchronize to make the block scan available to all warps.
2 __syncthreads();

Como hay una variable (registro) sum por cada hilo, acumulo el desplazamiento de totals en paralelo en todo el bloque.

1 // Add the block scan to the inclusive sum for the block.
2 sum += totals[warp];
3 
4 // Write the inclusive and exclusive scans to global memory.
5 inclusive[i] = sum;
6 exclusive[i] = sum - x;

Presenter Notes

Shared memory bank conflicts

Para tener tan alto ancho de banda (approx. 1.6GBps), la shared memory de la arquitectura Fermi se divide en

Bancos de palabras de 32 bits.
Un total de 32 bancos.

Problema

Los accesos a un mismo banco por múltiples hilos dentro de un warp se serializan.

Revisar bien como es la organización de shared memory para:

Tesla (GT2xx, CC 1.y)
Fermi (GF1xx, CC 2.y)
Kepler (GK1xx, CC 3.y)

Hay muchos cambios de una generación a otra. Revisar:

NVIDIA, NVIDIA CUDA C Programming Guide Version 4.2, Section F.3, F.4, F.5.

Presenter Notes

Ejemplo

1 float data = shared[lane*STRIDE];

Supongamos que lane (número de hilo dentro del warp) varía en [0..31].
Veamos a que banco accede según el valor de STRIDE.

STRIDE=1, 0% conflictos en warp, todo en paralelo.

lane  bank
0     0
1     1
...
31    31

STRIDE=32, 100% conflictos en warp, todo serializado.

lane  bank
0     0
1     0
...
31    0

STRIDE=33, 0% conflictos en warp, todo en paralelo.

lane  bank
0     0
1     1
...
31    31

Presenter Notes

Conflictos en `scan-block`

Obtener la suma total de cada warp, en la etapa de reducción (2da etapa) puede producir serialización en un warp.

1 int total = scan[tid*WARP_STRIDE + WARP_SIZE-1];

Como tid varía en [0..NUM_WARPS] y solo ejecuta 1 warp dentro del bloque:

Si WARP_STRIDE = WARP_SIZE, tenemos 100% de conflictos.
Lo solucionamos haciendo que WARP_STRIDE sea coprimo de WARP_SIZE, por ejemplo WARP_SIZE+1=33.

Presenter Notes

Mediciones

Para N=(1<<28), tiempos expresados en ns.

Block  #BlocksPerSM  #Totals  Stride       Time
256    6             8        WARP_SIZE    40218.816
256    6             8        WARP_SIZE+1  41189.566

512    3             16       WARP_SIZE    45418.016
512    3             16       WARP_SIZE+1  43541.055

1024   1             32       WARP_SIZE    83142.914
1024   1             32       WARP_SIZE+1  67737.953

Al aumentar el tamaño del bloque bajan la cantidad de bloques por SM y la performance global cae.
Sin embargo, un bloque grande genera muchos accesos en conflicto a la memoria compartida y se nota la influencia del padding para evitar el bank conflict.
Solo es un 33/32 = 3% más de almacenamiento.
Bajamos a 74.59 GBPs.
>>> (( 3*(1<<28)*4)/0.040218) / (1<<30)
74.59346561241236

Presenter Notes

Global scan

Presenter Notes

Global scan

(ToDo)

GPU Gems 3, Chp39, Fig.6

(M. Harris, S. Sengupta, J. D. Owens, Parallel Prefix Sum (Scan) with CUDA, GPU Gems 3, Chapter 39, NVIDIA, 2007)

Presenter Notes

Ballot scan

Presenter Notes

Stream Compaction

Uno de los usos clásicos es la aplicación de un filtro a los datos según un predicado.

GPU Gems 3, Chp39, Fig.9

Esto es simplemente una etapa de exclusive scan seguida de un scatter.

GPU Gems 3, Chp39, Fig.10

(M. Harris, S. Sengupta, J. D. Owens, Parallel Prefix Sum (Scan) with CUDA, GPU Gems 3, Chapter 39, NVIDIA, 2007)

Presenter Notes

Ballot

Existen instrucciones de la Fermi ISA que hacen el exclusive scan de un arreglo binario de un warp en tan solo 3 operaciones, manejando toda la comunicación intra-warp de manera automática.

`uint __ballot(uint pred)`

Retorna en todos los hilos del warp un entero de 32 bits, donde cada bit está seteado a 1 sii pred es válido en ese hilo.

Notar que está haciendo un mapeo + broadcast en todo el warp.
Similar a MPI_Allreduce para dar un ejemplo en otro paradigma de paralelismo.

(Table 43, PARALLEL THREAD EXECUTION ISA VERSION 3.0, NVIDIA Inc., 2012.)

Presenter Notes

Ballot scan en un warp

La magia de __ballot mas un par de operaciones bitwise hacen todo el trabajo.

1 uint flag = -1.0f != val;
2 
3 uint bits = __ballot(flag);
4 
5 uint mask = bfi(0, 0xffffffff, 0, tid);
6 uint exc = __popc(mask & bits);
7 uint total = __popc(bits);

Otras operaciones bitwise

__popc(uint a): cuenta la cantidad de bits prendidos en a.
uint bfi(uint a, uint b, uchar pos, uchar len): inserta a en la posición pos de b, len bits.

Entonces en cada hilo exc tiene la suma prefija y total cuantos hilos cumplen con el predicado.

(Sean Baxter, Modern GPU, Scan, 2011.)

Presenter Notes

Ballot, el código

Se necesita inline assembler, porque bfi no tiene un intrinsic asociado.

1 __device__ __forceinline__ uint bfi(uint x, uint y, uint bit, uint numBits) {
2     uint ret;
3     asm("bfi.b32 %0, %1, %2, %3, %4;" :
4         "=r"(ret) : "r"(y), "r"(x), "r"(bit), "r"(numBits));
5     return ret;
6 }

(USING INLINE PTX ASSEMBLY IN CUDA, NVIDIA Inc., 2011.)

Presenter Notes

Ballot, el código-2

 1 __global__ void BallotScanWarp(const float *dataIn_global,
 2                                float *dataOut_global, uint *countOut_global) {
 3     uint tid = threadIdx.x;
 4     float val = dataIn_global[tid];
 5 
 6     uint flag = -1.0f != val;
 7 
 8     uint bits = __ballot(flag);
 9 
10     uint mask = bfi(0, 0xffffffff, 0, tid);
11     uint exc = __popc(mask & bits);
12     uint total = __popc(bits);
13 
14     __shared__ volatile float shared[WARP_SIZE];
15     if (flag)
16         shared[exc] = val;
17     val = shared[tid];
18 
19     if (tid < total)
20         dataOut_global[tid] = val;
21 
22     if (!tid) *countOut_global = total;
23 }

Presenter Notes

Performance

(ToDo)

Presenter Notes

Bibliografía

NVIDIA, NVIDIA CUDA C Programming Guide Version 4.2, 2012.
M. Harris, S. Sengupta, J. D. Owens, Parallel Prefix Sum (Scan) with CUDA, GPU Gems 3, Chapter 39, NVIDIA, 2007.
Sean Baxter, Modern GPU, Scan, 2011.
G. E. Blelloch, Prefix Sums and Their Applications, Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, 1990.
NVIDIA Inc., PARALLEL THREAD EXECUTION ISA VERSION 3.0, 2012.
NVIDIA Inc., USING INLINE PTX ASSEMBLY IN CUDA, 2011.

Presenter Notes

La clase que viene

Algo.

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Bibliografía

Problemas

Versión predicada warp_scan1

Versión sin predicados warp_scan2

Un experto me dijo

Reportado por un test

Lo que medimos

Problema

uint __ballot(uint pred)

Otras operaciones bitwise

Table of Contents

Help

Versión predicada `warp_scan1`

Versión sin predicados `warp_scan2`

`uint __ballot(uint pred)`