python - Spark "reduceByKey" calls reducer more times than expected -
i have dataframe of 4 rows same key this: row_no id age time 1 abc 70 1524299530 2 abc 69 1524299528 3 abc 68 1524299526 4 abc 67 1524299524 then tried call reducebykey on dataframe this: new_rdd = df.rdd \ .map(lambda row: (row['id'], [row['age'], row['time']])) \ .reducebykey(some_reducer) in some_reducer , purpose of testing return previous obj. used print trace invocation of reducer , found spark had invoked reducer 4 times, namely, (1, 2), (3, 4), (1, 3) , (1, 3) . basically, reducer called on row 1 , 3 twice. running spark locally 4 processors. tried run spark 1 processor on job , reducer called 3 times expected on (1, 2), (3, 4) , (1, 3) . must have how spark partitions data still hard make sense of behavior. can provide explanation behavior? update: did more constructive test sticking integer column on each row , make reducer lambda a, b: + b . observed spark did 4 additions when run multi-pr...