scala - Only keep a limited amount of elements per key -
at moment try find solution following problem:
after processing, try limiting amount of values in key-value rdd key number (for example 200).
my initial solution groupbykey, elements same key 1 partition, followed flatmapvalues take first 200 elements of iterable.
although solution works fine smaller data, seems inefficient , not work when want process bigger data.
someone has idea how can pulled off more efficiently ?
thanks in advance!
if data dense enough (number of values per key >> n, n number of records retain) can optimize this:
- take @ n elements per key each partition.
- shuffle.
- combine until have n elements per key , ignore rest.
in worst case scenario shuffle 200 * number-of-partitions * number-of-keys.
this can implemented combinebykey
or aggregatebykey
. pseudocode (you should use mutable collections improve performance, possible improve mergecombiners
avoid copying if first buffer has enough records):
val rdd: rdd[(k, v)] rdd.combinebykey( x => vector(x), (acc: vector[v], x: v) => if (acc.length < n) acc :+ x else acc, (acc1: vector[v], acc2: vector[v]) => (acc1 ++ acc2) take n )
Comments
Post a Comment