python - Multithreading/Multiprocessing to parse single XML file? -

June 15, 2012

this question has answer here:

parsing large xml files using multiprocessing 2 answers

can tell me how assign jobs multiple threads speed parsing time? example, have xml file 200k lines, assign 50k lines each 4 threads , parse them using sax parser. have done far 4 threads parsing on 200k lines means 200k*4 = 800k duplicating results.

any appreciated.

test.xml:

<?xml version="1.0" encoding="utf-8"?> <votes>   <row id="1" postid="1" votetypeid="2" creationdate="2014-05-13t00:00:00.000" />   <row id="2" postid="1" votetypeid="2" creationdate="2014-05-13t00:00:00.000" />   <row id="3" postid="3" votetypeid="2" creationdate="2014-05-13t00:00:00.000" />   <row id="5" postid="3" votetypeid="2" creationdate="2014-05-13t00:00:00.000" /> </votes>

my source code:

import json   import xmltodict   lxml import etree import xml.etree.elementtree elementtree import threading import time  def sax_parsing():      t = threading.currentthread()      event, element in etree.iterparse("/home/xiang/downloads/fyp/parallel-python/test.xml"):         #below codes read attributes in element specified         if element.tag == 'row':             print("thread: %s" % t.getname())             row_id = element.attrib.get('id')             row_post_id = element.attrib.get('postid')             row_vote_type_id = element.attrib.get('votetypeid')             row_user_id = element.attrib.get('userid')             row_creation_date = element.attrib.get('creationdate')             print('id: %s, postid: %s, votetypeid: %s, userid: %s, creationdate: %s'% (row_id,row_post_id,row_vote_type_id,row_user_id,row_creation_date))             element.clear()        return  if __name__ == "__main__":        start = time.time() #calculate execution time      main_thread = threading.currentthread()     no_threads = 4     in range(no_threads):         t = threading.thread(target=sax_parsing)         t.start()      t in threading.enumerate():         if t main_thread:             continue      t.join()      end = time.time() #calculate execution time     exec_time = end - start     print('execution time: %fs' % (exec_time))

simplest way expend parse function receive start row , end row so: def sax_parsing(start, end):

and when sending threading command: t = threading.thread(target=sax_parsing, args=(i*50, i+1*50))

and change if element.tag == 'row': if element.tag == 'row' , element.attrib.get('id') >= start , element.attrib.get('id') < end:

so each thread checks rows given in range (didn't check this, play around)

Search This Blog

Single

python - Multithreading/Multiprocessing to parse single XML file? -

Comments

Post a Comment

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -