process a text file with xml column in apache spark scala -


i have file :

1,<note><from>messi</from><body>don't forget me weekend!</body></note> 2,<note><from>ronaldo</from><body>don't forget laliga</body></note> 3,<note><from>neymar</from><body>i best </body></note> 4,<note><from>suarez</from><body>don't forget me weekend!</body></note> 

where first field id , second field data. need load rdd, parse xml string , extract fields, , create rdd this:

1,messi,don't forget me weekend! 2,ronaldo,don't forget laliga 3,neymar,i best  4,suarez,don't forget me weekend! 

since xml in actual scenario complex, use xml parser. how can this?

you can use scala's own xml library. but, need parse string elem object before can :

import scala.xml._  val str = "<note><from>messi</from><body>don't forget me weekend!</body></note>"  val xml = xml.loadstring(xml) xml: scala.xml.elem = <note><from>messi</from><body>don't forget me weekend!</body></note> 

to extract single element, use:

xml \\ "note" \\ "from" res19: scala.xml.nodeseq = nodeseq(<from>messi</from>) 

this results in object of type nodeseq, string, use:

(xml \\ "note" \\ "from").text res20: string = messi 

coming question

val rdd = sc.parallelize(array( (1,"<note><from>messi</from><body>don't forget me weekend!</body></note>"), (2,"<note><from>ronaldo</from><body>don't forget la liga</body></note>"), (3,"<note><from>neymar</from><body>i best </body></note>"), (4,"<note><from>suarez</from><body>don't forget me weekend!</body></note>") ))   rdd.map{ case (id, xml) =>      (id ,      (xml.loadstring(xml) \\ "note" \\ "from").text ,      (xml.loadstring(xml) \\ "note" \\ "body").text )  }.collect.foreach(println)  (1,messi,don't forget me weekend!) (2,ronaldo,don't forget laliga) (3,neymar,i best ) (4,suarez,don't forget me weekend!) 

Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -