pyspark - how to make different sum over the same line in Spark -
i have spark dataframe with, numeric columns. make several aggregationg operations on these columns creating new column each function, of may user defined.
the easy solution using dataframe , withcolumn. istance, if wanted calculate mean (by hand) , function my_function on fields field_1 , field_2 do:
df=df.withcolumn("mean",(df["field_1"]+df["field_2])/2) df=df.withcolumn("foo", my_function(df["field_1"],df["field_2]))
my doubt efficiency. each of 2 above functions scans whole database while smarter approach calculate both results using 1 single scan.
any hint on how that?
thanks
mauro
tl;dr you're trying solve problem doesn't exist
sql transformations lazy , declarative. series of operations converted logical execution plan, , physical execution plan. @ first stage spark optimizer has freedom reorder, combine or remove part of plan. have however, distinguish between 2 cases:
- python
udf
. - sql expression.
the first requires separate conversion python rdd. cannot combined native processing. second 1 processed natively using generated code.
once request results physical plan converted stages , executed.
Comments
Post a Comment