spark dataframe - Flatten Nested Struct in PySpark Array -

June 15, 2012

given schema like:

root |-- first_name: string |-- last_name: string |-- degrees: array |    |-- element: struct |    |    |-- school: string |    |    |-- advisors: struct |    |    |    |-- advisor1: string |    |    |    |-- advisor2: string

how can schema like:

root |-- first_name: string |-- last_name: string |-- degrees: array |    |-- element: struct |    |    |-- school: string |    |    |-- advisor1: string |    |    |-- advisor2: string

currently, explode array, flatten structure selecting advisor.* , group first_name, last_name , rebuild array collect_list. i'm hoping there's cleaner/shorter way this. currently, there's lot of pain renaming fields , stuff don't want here. thanks!

you can use udf change datatype of nested columns in dataframe. suppose have read dataframe df1

from pyspark.sql.functions import udf pyspark.sql.types import *  def foo(data):      return(list(map(lambda x: (x["school"], x["advisors"]["advisor1"],x["advisors"]["advisor1"]), data)))   struct = arraytype(structtype([structfield("school", stringtype()),                               structfield("advisor1", stringtype()),                               structfield("advisor2", stringtype())])) udf_foo = udf(foo, struct)  df2 = df1.withcolumn("degrees",udf_foo("degrees")) df1.printschema()

output

root  |-- degrees: array (nullable = true)  |    |-- element: struct (containsnull = true)  |    |    |-- school: string (nullable = true)  |    |    |-- advisor1: string (nullable = true)  |    |    |-- advisor2: string (nullable = true)  |-- first_name: string (nullable = true)  |-- last_name: string (nullable = true)

Search This Blog

Single

spark dataframe - Flatten Nested Struct in PySpark Array -

Comments

Post a Comment

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -