cuda - Using local/shared memory as a cache for global -
i have image processing kernel uses buffer of flags large fit local memory. flags accessed in predictable, raster pattern (upper left lower right hand side).
my idea store flags in global memory, , use local memory cache global. so, progress along raster pattern, want read flags global local, processing, write flags global. but, want hide latency involved.
so, suppose access image series of locations: a1,a2,a3...... want following:
- fetch
a1flags - fetch
a2flags - while
a2flags being fetched, processa1location , store global memory - fetch
a3flags - while
a3flags being fetched, processa2location , store global memory - etc.
how should structure code ensure latency hidden ? need use vload/vstore this? or gpu hardware latency hiding automatically ?
the key make sure reads coalesced - that's way peak memory bandwidth. then, keep kernel complexity low enough occupancy high enough ensure compute hidden behind memory access. running fast possible.
Comments
Post a Comment