spark dataframe - Flatten Nested Struct in PySpark Array -
given schema like:
root |-- first_name: string |-- last_name: string |-- degrees: array | |-- element: struct | | |-- school: string | | |-- advisors: struct | | | |-- advisor1: string | | | |-- advisor2: string
how can schema like:
root |-- first_name: string |-- last_name: string |-- degrees: array | |-- element: struct | | |-- school: string | | |-- advisor1: string | | |-- advisor2: string
currently, explode array, flatten structure selecting advisor.*
, group first_name, last_name
, rebuild array collect_list
. i'm hoping there's cleaner/shorter way this. currently, there's lot of pain renaming fields , stuff don't want here. thanks!
you can use udf change datatype of nested columns in dataframe. suppose have read dataframe df1
from pyspark.sql.functions import udf pyspark.sql.types import * def foo(data): return(list(map(lambda x: (x["school"], x["advisors"]["advisor1"],x["advisors"]["advisor1"]), data))) struct = arraytype(structtype([structfield("school", stringtype()), structfield("advisor1", stringtype()), structfield("advisor2", stringtype())])) udf_foo = udf(foo, struct) df2 = df1.withcolumn("degrees",udf_foo("degrees")) df1.printschema()
output
root |-- degrees: array (nullable = true) | |-- element: struct (containsnull = true) | | |-- school: string (nullable = true) | | |-- advisor1: string (nullable = true) | | |-- advisor2: string (nullable = true) |-- first_name: string (nullable = true) |-- last_name: string (nullable = true)
Comments
Post a Comment