scala - Spark Dataframe Group by having New Indicator Column -
i need group "key" column , need check whether "type_code" column has both "pl" , "jl" values , if need add indicator column "y" else "n"
example :
//input values val values = list(list("66","pl") , list("67","jl") , list("67","pl"),list("67","po"), list("68","jl"),list("68","po")).map(x =>(x(0), x(1))) import spark.implicits._ //created dataframe val cmc = values.todf("key","type_code") cmc.show(false) ------------------------ key |type_code | ------------------------ 66 |pl | 67 |jl | 67 |pl | 67 |po | 68 |jl | 68 |po | ------------------------- expected output :
for each "key", if has "type_code" has both pl & jl y else n
----------------------------------------------------- key |type_code | indicator ----------------------------------------------------- 66 |pl | n 67 |jl | y 67 |pl | y 67 |po | y 68 |jl | n 68 |po | n --------------------------------------------------- for example, 67 has both pl & jl - "y" 66 has pl - "n" 68 has jl - "n"
one option:
1) collect type_code list;
2) check if contains specific strings;
3) flatten list explode:
(cmc.groupby("key") .agg(collect_list("type_code").as("type_code")) .withcolumn("indicator", when(array_contains($"type_code", "pl") && array_contains($"type_code", "jl"), "y").otherwise("n")) .withcolumn("type_code", explode($"type_code"))).show +---+---------+---------+ |key|type_code|indicator| +---+---------+---------+ | 68| jl| n| | 68| po| n| | 67| jl| y| | 67| pl| y| | 67| po| y| | 66| pl| n| +---+---------+---------+
Comments
Post a Comment