cuda - Using local/shared memory as a cache for global -


i have image processing kernel uses buffer of flags large fit local memory. flags accessed in predictable, raster pattern (upper left lower right hand side).

my idea store flags in global memory, , use local memory cache global. so, progress along raster pattern, want read flags global local, processing, write flags global. but, want hide latency involved.

so, suppose access image series of locations: a1,a2,a3...... want following:

  1. fetch a1 flags
  2. fetch a2 flags
  3. while a2 flags being fetched, process a1 location , store global memory
  4. fetch a3 flags
  5. while a3 flags being fetched, process a2 location , store global memory
  6. etc.

how should structure code ensure latency hidden ? need use vload/vstore this? or gpu hardware latency hiding automatically ?

the key make sure reads coalesced - that's way peak memory bandwidth. then, keep kernel complexity low enough occupancy high enough ensure compute hidden behind memory access. running fast possible.


Comments

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

minify - Minimizing css files -