spark dataframe - Flatten Nested Struct in PySpark Array -


given schema like:

root |-- first_name: string |-- last_name: string |-- degrees: array |    |-- element: struct |    |    |-- school: string |    |    |-- advisors: struct |    |    |    |-- advisor1: string |    |    |    |-- advisor2: string 

how can schema like:

root |-- first_name: string |-- last_name: string |-- degrees: array |    |-- element: struct |    |    |-- school: string |    |    |-- advisor1: string |    |    |-- advisor2: string 

currently, explode array, flatten structure selecting advisor.* , group first_name, last_name , rebuild array collect_list. i'm hoping there's cleaner/shorter way this. currently, there's lot of pain renaming fields , stuff don't want here. thanks!

you can use udf change datatype of nested columns in dataframe. suppose have read dataframe df1

from pyspark.sql.functions import udf pyspark.sql.types import *  def foo(data):      return(list(map(lambda x: (x["school"], x["advisors"]["advisor1"],x["advisors"]["advisor1"]), data)))   struct = arraytype(structtype([structfield("school", stringtype()),                               structfield("advisor1", stringtype()),                               structfield("advisor2", stringtype())])) udf_foo = udf(foo, struct)  df2 = df1.withcolumn("degrees",udf_foo("degrees")) df1.printschema() 

output

root  |-- degrees: array (nullable = true)  |    |-- element: struct (containsnull = true)  |    |    |-- school: string (nullable = true)  |    |    |-- advisor1: string (nullable = true)  |    |    |-- advisor2: string (nullable = true)  |-- first_name: string (nullable = true)  |-- last_name: string (nullable = true) 

Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -