cuda - How to generate random number inside pyCUDA kernel? -


i using pycuda cuda programming. need use random number inside kernel function. curand library doesn't work inside (pycuda). since, there lot of work done in gpu, generating random number inside cpu , transferring them gpu won't work, rather dissolve motive of using gpu.

supplementary questions:

  1. is there way allocate memory on gpu using 1 block , 1 thread.
  2. i using more 1 kernel. need use multiple sourcemodule blocks?

despite assert in question, pycuda has pretty comprehensive support curand. gpuarray module has direct interface fill device memory using host side api (noting random generators run on gpu in case).

it possible use device side api curand in pycuda kernel code. in use case trickiest part allocating memory thread generator states. there 3 choices -- statically in code, dynamically using host memory side allocation, , dynamically using device side memory allocation. following (very lightly tested) example illustrates latter, seeing asked in question:

import numpy np import pycuda.autoinit pycuda.compiler import sourcemodule pycuda import gpuarray  code = """     #include <curand_kernel.h>      const int nstates = %(ngenerators)s;     __device__ curandstate_t* states[nstates];      __global__ void initkernel(int seed)     {         int tidx = threadidx.x + blockidx.x * blockdim.x;          if (tidx < nstates) {             curandstate_t* s = new curandstate_t;             if (s != 0) {                 curand_init(seed, tidx, 0, s);             }              states[tidx] = s;         }     }      __global__ void randfillkernel(float *values, int n)     {         int tidx = threadidx.x + blockidx.x * blockdim.x;          if (tidx < nstates) {             curandstate_t s = *states[tidx];             for(int i=tidx; < n; += blockdim.x * griddim.x) {                 values[i] = curand_uniform(&s);             }             *states[tidx] = s;         }     } """  n = 1024 mod = sourcemodule(code % { "ngenerators" : n }, no_extern_c=true, arch="sm_52") init_func = mod.get_function("_z10initkerneli") fill_func = mod.get_function("_z14randfillkernelpfi")  seed = np.int32(123456789) nvalues = 10 * n init_func(seed, block=(n,1,1), grid=(1,1,1)) gdata = gpuarray.zeros(nvalues, dtype=np.float32) fill_func(gdata, np.int32(nvalues), block=(n,1,1), grid=(1,1,1)) 

here there initialization kernel needs run once allocate memory generator states , initialize them seed, , kernel uses states. need mindful of malloc heap size limits if want run lot of threads, can manipulated via pycuda driver api interface.


Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

Add a dynamic header in angular 2 http provider -

minify - Minimizing css files -