python - Multithreading/Multiprocessing to parse single XML file? -
this question has answer here:
can tell me how assign jobs multiple threads speed parsing time? example, have xml file 200k lines, assign 50k lines each 4 threads , parse them using sax parser. have done far 4 threads parsing on 200k lines means 200k*4 = 800k duplicating results.
any appreciated.
test.xml:
<?xml version="1.0" encoding="utf-8"?> <votes> <row id="1" postid="1" votetypeid="2" creationdate="2014-05-13t00:00:00.000" /> <row id="2" postid="1" votetypeid="2" creationdate="2014-05-13t00:00:00.000" /> <row id="3" postid="3" votetypeid="2" creationdate="2014-05-13t00:00:00.000" /> <row id="5" postid="3" votetypeid="2" creationdate="2014-05-13t00:00:00.000" /> </votes>
my source code:
import json import xmltodict lxml import etree import xml.etree.elementtree elementtree import threading import time def sax_parsing(): t = threading.currentthread() event, element in etree.iterparse("/home/xiang/downloads/fyp/parallel-python/test.xml"): #below codes read attributes in element specified if element.tag == 'row': print("thread: %s" % t.getname()) row_id = element.attrib.get('id') row_post_id = element.attrib.get('postid') row_vote_type_id = element.attrib.get('votetypeid') row_user_id = element.attrib.get('userid') row_creation_date = element.attrib.get('creationdate') print('id: %s, postid: %s, votetypeid: %s, userid: %s, creationdate: %s'% (row_id,row_post_id,row_vote_type_id,row_user_id,row_creation_date)) element.clear() return if __name__ == "__main__": start = time.time() #calculate execution time main_thread = threading.currentthread() no_threads = 4 in range(no_threads): t = threading.thread(target=sax_parsing) t.start() t in threading.enumerate(): if t main_thread: continue t.join() end = time.time() #calculate execution time exec_time = end - start print('execution time: %fs' % (exec_time))
simplest way expend parse function receive start row , end row so: def sax_parsing(start, end):
and when sending threading command: t = threading.thread(target=sax_parsing, args=(i*50, i+1*50))
and change if element.tag == 'row':
if element.tag == 'row' , element.attrib.get('id') >= start , element.attrib.get('id') < end
:
so each thread checks rows given in range (didn't check this, play around)
Comments
Post a Comment