c++ - Cache-friendly way to collect results from multiple threads -
consider n
threads doing asynchronous tasks small result value double
or int64_t
. 8
result values can fit single cpu cache line. n
equal number of cpu cores.
on 1 hand, if allocate array of n
items, each double
or int64_t
, 8
threads share cpu cache line, seems inefficient.
on other hand, if allocate whole cache line each double
/int64_t
, receiver thread have fetch n
cache lines, each written different cpu core (except 1).
so there efficient solution scenario? cpu x86-64. solution in c++ preferred.
clarification 1: thread launch/exit overhead not big because thread pool used. it's synchronization on critical section.
clarification 2: parallel batches carry dependency. master thread can launch next batch of parallel computations after has collected , processed results of previous batch. because results of previous batch serve parameters of next batch.
update: may have misunderstood. looking fast turnarounds on many tiny batches of work? in case, you're better off each thread writing own cache line, or maybe group them in pairs. if each worker thread has exclusive access (mesi/mesif/moesi) write same cache line, serialize cores order.
having reader thread read results n threads lets cache misses happen in parallel.
i scatter , gather millions of such parallel calculations per second. in other words, head thread distributes work, launches worker threads, collects results, on it, , launches parallel computations again.
so have millions of results collect, 1 worker thread per core. each worker thread has produce ~100k results.
give each worker output array, stores consecutive results different tasks has finished. actual arrays might 4k entries long or something, synchronization let writer wrap around , reuse first half once reader has started on second half of thread's buffer.
when collector thread reads result 1 of arrays, bring cache line own l2/l1d caches, bringing 7 other results in same cache line (assuming usual case worker thread has filled 8 int64_t
slots , won't write cache line again group of tiny tasks).
or better, collect them in batches aligned cache lines, conflict misses don't evict cache line collector's l1d before gets it. (reduce chance of happening skewing result arrays different offset each thread, collector thread isn't reading n cache lines offset each other multiple of 4kib or something.)
if can use sentinel value in output arrays, that's ideal. if collector sees that, knows got ahead of worker , should check other threads. (or sleep if got through output arrays without finding new results).
otherwise also need current-output-position shared variables workers update (with release-store) after writing output array. (maybe batch these position-counter updates 1 per 8 array results. make sure pure atomic store, not += 8
. since producer thread 1 writes variable, silly have overhead of lock add
.)
this cause false sharing between worker threads if packed 1 array, , needs cached (not in uc or wc memory, worker thread can rewrite in-place efficiently). want each thread have own cache line these. collector have suffer penalty of reading n different cache lines (and suffering memory mis-speculation machine clears: what latency , throughput costs of producer-consumer sharing of memory location between hyper-siblings versus non-hyper siblings?)
actually, best option in case use 1 of 8 qwords in every cache line of output arrays "complete" flag or bitmap, collector thread can check see if 7 results in cache line ready.
if getting results between worker , collector threads main bottleneck, threading fine-grained. should break tasks more coarsely, or have worker threads of combining on multiple results produced, while they're still hot in l1d. that's much better bandwidth getting core through l3 or dram.
Comments
Post a Comment