sql - Scala/Apache Spark Converting DataFrame column values and type, multiple when otherwise -
i have primary sql table reading spark , modifying write cassandradb. have working implementation converting gender 0, 1, 2, 3 (integers) "male", "female", "trans", etc (strings). though below method work, seems inefficient make seperate array mappings dataframe, join main table/dataframe, remove, rename, etc.
i have seen:
.withcolumn("gender", when(col("gender) === 1, "male").otherwise("female")
that allow me continue method chaining on primary table have not been able working more 2 options. there way this? have around 10 different columns on table each need own custom conversion created. since code processing tbs of data, there less repetitive , more efficient way accomplish this. in advance!
case class gender(tmpid: int, tmpgender: string) private def creategenderdf(spark:sparksession): dataframe = { import spark.implicits._ seq( gender(1, "male"), gender(2, "female"), gender(777, "prefer not answer") ).todf } private def createpersonsdf(spark: sparksession): dataframe = { val genderdf = creategenderdf(spark) genderdf.show() val personsdf: dataframe = spark.read .format("csv") .option("header", "true") .option("inferschema", "true") .option("delimiter", "\t") .load(datapath + "people.csv") .withcolumnrenamed("id", "id") .withcolumnrenamed("name_first", "firstname") val personsdf1: dataframe = personsdf .join(genderdf, personsdf("gender") === genderdf("tmpid"), "leftouter") val personsdf2: dataframe = personsdf1 .drop("gender") .drop("tmpid") .withcolumnrenamed("tmpgender", "gender") }
you can use nested when
function eliminate need of creating genderdf
, join
, drop
, rename
etc. example can following
import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.stringtype personsdf.withcolumn("gender", when(col("gender") === 1, "male").otherwise(when(col("gender") ===2, "female").otherwise("prefer not answer")).cast(stringtype))
you can add more when
function in above nested structure , can repeate same other 10 columns well.
Comments
Post a Comment