python - Spark "reduceByKey" calls reducer more times than expected -
i have dataframe of 4 rows same key this:
row_no id age time 1 abc 70 1524299530 2 abc 69 1524299528 3 abc 68 1524299526 4 abc 67 1524299524 then tried call reducebykey on dataframe this:
new_rdd = df.rdd \ .map(lambda row: (row['id'], [row['age'], row['time']])) \ .reducebykey(some_reducer) in some_reducer, purpose of testing return previous obj. used print trace invocation of reducer , found spark had invoked reducer 4 times, namely, (1, 2), (3, 4), (1, 3) , (1, 3). basically, reducer called on row 1 , 3 twice. running spark locally 4 processors. tried run spark 1 processor on job , reducer called 3 times expected on (1, 2), (3, 4) , (1, 3). must have how spark partitions data still hard make sense of behavior. can provide explanation behavior?
update: did more constructive test sticking integer column on each row , make reducer lambda a, b: + b. observed spark did 4 additions when run multi-processor mode in total: 1 + 1, 1 + 1, 2 + 2 , 2 + 2. however, final result still 4. in way, spark discarded duplicated reducing of 2 + 2. question why there duplicated reducing in first place , how spark deals them?
Comments
Post a Comment