pyspark - how to make different sum over the same line in Spark -


i have spark dataframe with, numeric columns. make several aggregationg operations on these columns creating new column each function, of may user defined.

the easy solution using dataframe , withcolumn. istance, if wanted calculate mean (by hand) , function my_function on fields field_1 , field_2 do:

df=df.withcolumn("mean",(df["field_1"]+df["field_2])/2) df=df.withcolumn("foo", my_function(df["field_1"],df["field_2])) 

my doubt efficiency. each of 2 above functions scans whole database while smarter approach calculate both results using 1 single scan.

any hint on how that?

thanks

mauro

tl;dr you're trying solve problem doesn't exist

sql transformations lazy , declarative. series of operations converted logical execution plan, , physical execution plan. @ first stage spark optimizer has freedom reorder, combine or remove part of plan. have however, distinguish between 2 cases:

  • python udf.
  • sql expression.

the first requires separate conversion python rdd. cannot combined native processing. second 1 processed natively using generated code.

once request results physical plan converted stages , executed.


Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

Add a dynamic header in angular 2 http provider -

minify - Minimizing css files -