Parse large python xml using xmltree -


i have python script parses huge xml files ( largest 1 446 mb)

    try:         parser = etree.xmlparser(encoding='utf-8')         tree = etree.parse(os.path.join(srcdir, filename), parser)         root = tree.getroot()     except exception, e:         print "error parsing file "+str(filename) + " reason "+str(e.message)      child in root:         if "personname" in child.tag:             personname = child.text 

this xml looks :

<?xml version="1.0" encoding="utf-8"?> <myroot xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xmlns:xsd="http://www.w3.org/2001/xmlschema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">   <aliases authority="opp" xmlns="http://www.example.org/yml/data/commonv2">      <description>mydata</description>      <identifier>43hhjh87n4nm</identifier>   </aliases>   <rollno uom="kpa">39979172.201167159</rollno>   <personname>miracle smith</personname>   <date>2017-06-02t01:10:32-05:00</date> .... 

all want personname tags contents thats all. other tags don't care about.

sadly files huge , keep getting error when use code above :

error parsing file 2eb6d894-0775-e611.xml reason unknown error, line 1, column 310915857 error parsing file 2ecc18b5-ef41-e711-80f.xml reason content @ end of document, line 1, column 3428182 error parsing file 2f0d6926-b602-e711-80f4-005.xml reason content @ end of document, line 1, column 6162118 error parsing file 2f12636b-b2f5-e611-80f3-00.xml reason content @ end of document, line 1, column 8014679 error parsing file 2f14e35a-d22b-4504-8866-.xml reason content @ end of document, line 1, column 8411238 error parsing file 2f50c2eb-55c6-e611-80f0-005056a.xml reason content @ end of document, line 1, column 7636614 error parsing file 3a1a3806-b6af-e611-80ef-00505.xml reason content @ end of document, line 1, column 11032486 

my xml fine , has no content .seems large files parsing causes error. have looked @ iterparse() seems complex want achieve provides parsing of whole dom while want 1 tag under root. , not give me sample correct value tag name ?

should use regex parse or grep /awk way ? or tweak code let me person name in these huge files ?

update: tried sample , seems printing whole world xml except tag ?

does iterparse read bottom top of file ? in case take long time top i.e personname tag ? tried changing line below read end start events=("end", "start") , same thing !!!

path = [] event, elem in et.iterparse('d:\\mystage\\2-80ea-005056.xml', events=("start", "end")):     if event == 'start':             path.append(elem.tag)     elif event == 'end':             # process tag             print elem.text  // prints whole world              if elem.tag == 'personname':                 print elem.text             path.pop() 

iterparse not difficult use in case.

temp.xml file presented in question </myroot> stuck on line @ end.

think of source = boilerplace, if will, parses xml file , returns chunks of element-by-element, indicating whether chunk 'start' of element or 'end' , supplying information element.

in case need consider 'start' events. watch 'personname' tags , pick texts. having found 1 , such item in xml file abandon processing.

>>> xml.etree import elementtree >>> source = iter(elementtree.iterparse('temp.xml', events=('start', 'end'))) >>> an_event, an_element in source: ...     if an_event=='start' , an_element.tag.endswith('personname'): ...         an_element.text ...         break ...  'miracle smith' 

edit, in response question in comment:

normally wouldn't since iterparse intended use large chunks of xml. however, wrapping string in stringio object can processed iterparse.

>>> xml.etree import elementtree >>> io import stringio >>> xml = stringio('''\ ... <?xml version="1.0" encoding="utf-8"?> ... <myroot xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xmlns:xsd="http://www.w3.org/2001/xmlschema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2"> ...   <aliases authority="opp" xmlns="http://www.example.org/yml/data/commonv2"> ...        <description>mydata</description> ...             <identifier>43hhjh87n4nm</identifier> ...               </aliases> ...                 <rollno uom="kpa">39979172.201167159</rollno> ...                   <personname>miracle smith</personname> ...                     <date>2017-06-02t01:10:32-05:00</date> ... </myroot>''') >>> source = iter(elementtree.iterparse(xml, events=('start', 'end'))) >>> an_event, an_element in source: ...     if an_event=='start' , an_element.tag.endswith('personname'): ...         an_element.text ...         break ...      'miracle smith' 

Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -