Parse large python xml using xmltree -
i have python script parses huge xml files ( largest 1 446 mb)
try: parser = etree.xmlparser(encoding='utf-8') tree = etree.parse(os.path.join(srcdir, filename), parser) root = tree.getroot() except exception, e: print "error parsing file "+str(filename) + " reason "+str(e.message) child in root: if "personname" in child.tag: personname = child.text
this xml looks :
<?xml version="1.0" encoding="utf-8"?> <myroot xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xmlns:xsd="http://www.w3.org/2001/xmlschema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2"> <aliases authority="opp" xmlns="http://www.example.org/yml/data/commonv2"> <description>mydata</description> <identifier>43hhjh87n4nm</identifier> </aliases> <rollno uom="kpa">39979172.201167159</rollno> <personname>miracle smith</personname> <date>2017-06-02t01:10:32-05:00</date> ....
all want personname tags contents thats all. other tags don't care about.
sadly files huge , keep getting error when use code above :
error parsing file 2eb6d894-0775-e611.xml reason unknown error, line 1, column 310915857 error parsing file 2ecc18b5-ef41-e711-80f.xml reason content @ end of document, line 1, column 3428182 error parsing file 2f0d6926-b602-e711-80f4-005.xml reason content @ end of document, line 1, column 6162118 error parsing file 2f12636b-b2f5-e611-80f3-00.xml reason content @ end of document, line 1, column 8014679 error parsing file 2f14e35a-d22b-4504-8866-.xml reason content @ end of document, line 1, column 8411238 error parsing file 2f50c2eb-55c6-e611-80f0-005056a.xml reason content @ end of document, line 1, column 7636614 error parsing file 3a1a3806-b6af-e611-80ef-00505.xml reason content @ end of document, line 1, column 11032486
my xml fine , has no content .seems large files parsing causes error. have looked @ iterparse() seems complex want achieve provides parsing of whole dom while want 1 tag under root. , not give me sample correct value tag name ?
should use regex parse or grep /awk way ? or tweak code let me person name in these huge files ?
update: tried sample , seems printing whole world xml except tag ?
does iterparse read bottom top of file ? in case take long time top i.e personname tag ? tried changing line below read end start events=("end", "start") , same thing !!!
path = [] event, elem in et.iterparse('d:\\mystage\\2-80ea-005056.xml', events=("start", "end")): if event == 'start': path.append(elem.tag) elif event == 'end': # process tag print elem.text // prints whole world if elem.tag == 'personname': print elem.text path.pop()
iterparse not difficult use in case.
temp.xml
file presented in question </myroot>
stuck on line @ end.
think of source =
boilerplace, if will, parses xml file , returns chunks of element-by-element, indicating whether chunk 'start' of element or 'end' , supplying information element.
in case need consider 'start' events. watch 'personname' tags , pick texts. having found 1 , such item in xml file abandon processing.
>>> xml.etree import elementtree >>> source = iter(elementtree.iterparse('temp.xml', events=('start', 'end'))) >>> an_event, an_element in source: ... if an_event=='start' , an_element.tag.endswith('personname'): ... an_element.text ... break ... 'miracle smith'
edit, in response question in comment:
normally wouldn't since iterparse
intended use large chunks of xml. however, wrapping string in stringio
object can processed iterparse
.
>>> xml.etree import elementtree >>> io import stringio >>> xml = stringio('''\ ... <?xml version="1.0" encoding="utf-8"?> ... <myroot xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xmlns:xsd="http://www.w3.org/2001/xmlschema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2"> ... <aliases authority="opp" xmlns="http://www.example.org/yml/data/commonv2"> ... <description>mydata</description> ... <identifier>43hhjh87n4nm</identifier> ... </aliases> ... <rollno uom="kpa">39979172.201167159</rollno> ... <personname>miracle smith</personname> ... <date>2017-06-02t01:10:32-05:00</date> ... </myroot>''') >>> source = iter(elementtree.iterparse(xml, events=('start', 'end'))) >>> an_event, an_element in source: ... if an_event=='start' , an_element.tag.endswith('personname'): ... an_element.text ... break ... 'miracle smith'
Comments
Post a Comment