python - Spark "reduceByKey" calls reducer more times than expected -


i have dataframe of 4 rows same key this:

row_no id  age  time 1      abc  70  1524299530 2      abc  69  1524299528 3      abc  68  1524299526 4      abc  67  1524299524 

then tried call reducebykey on dataframe this:

new_rdd = df.rdd \         .map(lambda row: (row['id'], [row['age'], row['time']])) \         .reducebykey(some_reducer) 

in some_reducer, purpose of testing return previous obj. used print trace invocation of reducer , found spark had invoked reducer 4 times, namely, (1, 2), (3, 4), (1, 3) , (1, 3). basically, reducer called on row 1 , 3 twice. running spark locally 4 processors. tried run spark 1 processor on job , reducer called 3 times expected on (1, 2), (3, 4) , (1, 3). must have how spark partitions data still hard make sense of behavior. can provide explanation behavior?

update: did more constructive test sticking integer column on each row , make reducer lambda a, b: + b. observed spark did 4 additions when run multi-processor mode in total: 1 + 1, 1 + 1, 2 + 2 , 2 + 2. however, final result still 4. in way, spark discarded duplicated reducing of 2 + 2. question why there duplicated reducing in first place , how spark deals them?


Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -