python - Spark "reduceByKey" calls reducer more times than expected -
i have dataframe of 4 rows same key this:
row_no id age time 1 abc 70 1524299530 2 abc 69 1524299528 3 abc 68 1524299526 4 abc 67 1524299524
then tried call reducebykey
on dataframe this:
new_rdd = df.rdd \ .map(lambda row: (row['id'], [row['age'], row['time']])) \ .reducebykey(some_reducer)
in some_reducer
, purpose of testing return previous obj. used print
trace invocation of reducer , found spark had invoked reducer 4 times, namely, (1, 2), (3, 4), (1, 3) , (1, 3)
. basically, reducer called on row 1 , 3 twice. running spark locally 4 processors. tried run spark 1 processor on job , reducer called 3 times expected on (1, 2), (3, 4) , (1, 3)
. must have how spark partitions data still hard make sense of behavior. can provide explanation behavior?
update: did more constructive test sticking integer column on each row , make reducer lambda a, b: + b
. observed spark did 4 additions when run multi-processor mode in total: 1 + 1
, 1 + 1
, 2 + 2
, 2 + 2
. however, final result still 4. in way, spark discarded duplicated reducing of 2 + 2
. question why there duplicated reducing in first place , how spark deals them?
Comments
Post a Comment