scala - Only keep a limited amount of elements per key -


at moment try find solution following problem:

after processing, try limiting amount of values in key-value rdd key number (for example 200).

my initial solution groupbykey, elements same key 1 partition, followed flatmapvalues take first 200 elements of iterable.

although solution works fine smaller data, seems inefficient , not work when want process bigger data.

someone has idea how can pulled off more efficiently ?

thanks in advance!

if data dense enough (number of values per key >> n, n number of records retain) can optimize this:

  • take @ n elements per key each partition.
  • shuffle.
  • combine until have n elements per key , ignore rest.

in worst case scenario shuffle 200 * number-of-partitions * number-of-keys.

this can implemented combinebykey or aggregatebykey. pseudocode (you should use mutable collections improve performance, possible improve mergecombiners avoid copying if first buffer has enough records):

val rdd: rdd[(k, v)] rdd.combinebykey(   x => vector(x),   (acc: vector[v], x: v) => if (acc.length < n) acc :+ x else acc,    (acc1: vector[v], acc2: vector[v]) => (acc1 ++ acc2) take n ) 

Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -