process a text file with xml column in apache spark scala -
i have file :
1,<note><from>messi</from><body>don't forget me weekend!</body></note> 2,<note><from>ronaldo</from><body>don't forget laliga</body></note> 3,<note><from>neymar</from><body>i best </body></note> 4,<note><from>suarez</from><body>don't forget me weekend!</body></note>
where first field id , second field data. need load rdd, parse xml string , extract fields, , create rdd this:
1,messi,don't forget me weekend! 2,ronaldo,don't forget laliga 3,neymar,i best 4,suarez,don't forget me weekend!
since xml in actual scenario complex, use xml parser. how can this?
you can use scala's own xml library. but, need parse string elem
object before can :
import scala.xml._ val str = "<note><from>messi</from><body>don't forget me weekend!</body></note>" val xml = xml.loadstring(xml) xml: scala.xml.elem = <note><from>messi</from><body>don't forget me weekend!</body></note>
to extract single element, use:
xml \\ "note" \\ "from" res19: scala.xml.nodeseq = nodeseq(<from>messi</from>)
this results in object of type nodeseq
, string, use:
(xml \\ "note" \\ "from").text res20: string = messi
coming question
val rdd = sc.parallelize(array( (1,"<note><from>messi</from><body>don't forget me weekend!</body></note>"), (2,"<note><from>ronaldo</from><body>don't forget la liga</body></note>"), (3,"<note><from>neymar</from><body>i best </body></note>"), (4,"<note><from>suarez</from><body>don't forget me weekend!</body></note>") )) rdd.map{ case (id, xml) => (id , (xml.loadstring(xml) \\ "note" \\ "from").text , (xml.loadstring(xml) \\ "note" \\ "body").text ) }.collect.foreach(println) (1,messi,don't forget me weekend!) (2,ronaldo,don't forget laliga) (3,neymar,i best ) (4,suarez,don't forget me weekend!)
Comments
Post a Comment