scala - How to create sha1 hashing for the entire row in a RDD/Dataframe -
i have dataframe 1 schema. existing dataframe having 50 columns .now want add new column in existing dataframe. new column name "hashing_id" , logic hashing_id sha1(row). how achive this?
i tried below code . these below 2 methods inside trait used main class . trait extends serializable
def addhashingkey():dataframe={ val sha1 = java.security.messagedigest.getinstance("sha-1") val encoder = new sun.misc.base64encoder() //encoder.encode(sha1.digest(row.mkstring.getbytes)) createdataframe(df.map(row => { row.fromseq(row.toseq ++ encoder.encode(sha1.digest(row.mkstring.getbytes))) }), df.schema.add("hashing_id", stringtype)) } def createdataframe(rdd: rdd[row], schema: structtype): dataframe = { sqlcontext.createdataframe(rdd, schema) }
how achieve sha1 using rdd ?
could me on
when run code , throws below exception
17/09/12 13:45:20 error yarn.applicationmaster: user class threw exception: org.apache.spark.sparkexception: task not serializable org.apache.spark.sparkexception: task not serializable caused by: java.io.notserializableexception: sun.misc.base64encoder serialization stack: - object not serializable (class: sun.misc.base64encoder, value: sun.misc.base64encoder@46c0813)
can't try this, seems working me in few test i've run:
val newdf = sqlcontext.createdataframe( rdd.map(x => row(x.toseq ++ seq(x.toseq.hashcode()): _*)), structtype(schema.iterator.toseq ++ seq(structfield("hashing_id", stringtype, true))))
obviously need replace hashcode hash function need
edit: use sha1 function
define function in class
object encoder { def sha1(s: row): string = messagedigest.getinstance("sha-1").digest(s.mkstring.getbytes()).tostring }
then in original class can call function follows
val newdf = sqlcontext.createdataframe(wordsrdd.map(x => row(x.toseq ++ seq(encoder.sha1(x)): _*)), structtype(schema.iterator.toseq ++ seq(structfield("hashing_id", stringtype, true)))).rdd.collect()
Comments
Post a Comment