cuda - Using local/shared memory as a cache for global -

March 15, 2011

i have image processing kernel uses buffer of flags large fit local memory. flags accessed in predictable, raster pattern (upper left lower right hand side).

my idea store flags in global memory, , use local memory cache global. so, progress along raster pattern, want read flags global local, processing, write flags global. but, want hide latency involved.

so, suppose access image series of locations: a1,a2,a3...... want following:

fetch a1 flags
fetch a2 flags
while a2 flags being fetched, process a1 location , store global memory
fetch a3 flags
while a3 flags being fetched, process a2 location , store global memory
etc.

how should structure code ensure latency hidden ? need use vload/vstore this? or gpu hardware latency hiding automatically ?

the key make sure reads coalesced - that's way peak memory bandwidth. then, keep kernel complexity low enough occupancy high enough ensure compute hidden behind memory access. running fast possible.

Search This Blog

Single

cuda - Using local/shared memory as a cache for global -

Comments

Post a Comment

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

linux - Why does bash short curcuit fail in crontab? -