java - How to extract "infobox company" data from wiki dumps -
i have downloaded big wiki dump xml file https://dumps.wikimedia.org/enwiki/20170520/
i want extract metadata company name , parent company wikidumps. company data located in xml template below:
{{infobox company | name = | logo = | type = | industry = | fate = | predecessor = <!-- or: | predecessors = --> | successor = <!-- or: | successors = --> | founded = <!-- if known: {{start date , age|yyyy|mm|dd}} in [[city]], [[state]], [[country]] --> | founder = <!-- or: | founders = --> | defunct = <!-- {{end date|yyyy|mm|dd}} --> | hq_location_city = | hq_location_country = | area_served = <!-- or: | areas_served = --> | key_people = | products = | owner = <!-- or: | owners = --> | num_employees = | num_employees_year = <!-- year of num_employees data (if known) --> | parent = | website = <!-- {{url|example.com}} --> }} i did research , found mediawiki parser. reference: https://github.com/dkpro/dkpro-jwpl/blob/master/de.tudarmstadt.ukp.wikipedia.parser/src/main/java/de/tudarmstadt/ukp/wikipedia/parser/tutorial/t1_simpleparserdemo.java
https://dkpro.github.io/dkpro-jwpl/jwplparser/
i tried use parser. requires file converted in string. wiki dump xml file 60 gb in size. can't convert big file in string , keep in memory. also, there no description mediawiki parser on how find specific element infobox company, go inside , extract name , other fields. below sample code mediawiki parser:
public static void main(string[] args) throws ioexception { file file = new file("c:/users/njaiswal/downloads/accenture_data_from_wikidumps.xml"); string str = fileutils.readfiletostring(file); // parsedpage object mediawikiparserfactory pf = new mediawikiparserfactory(); mediawikiparser parser = pf.createparser(); parsedpage pp = parser.parse(str); // sections (section section : pp.getsections()) { system.out.println("section : " + section.gettitle()); system.out.println(" nr of paragraphs : " + section.nrofparagraphs()); system.out.println(" nr of tables : " + section.nroftables()); system.out.println(" nr of nested lists : " + section.nrofnestedlists()); system.out.println(" nr of definition lists: " + section.nrofdefinitionlists()); (link link : section.getlinks(link.type.internal)) { system.out.println(" " + link.gettarget()); } } } is there other parser can solve problem? or can use same mediawiki parser "inbox company" , extract fields? appreciated. thanks
update: tried use wikixmlj parser khalil suggested. able "infobox" data, want limit "infobox company" data. below code , output:
import edu.jhu.nlp.wikipedia.*; public class test { public static void main(string[] args) throws exception{ wikixmlparser parser = wikixmlparserfactory.getsaxparser("c:/users/njaiswal/downloads/enwiki-20170520-pages-articles-multistream.xml/enwiki-20170520-pages-articles-multistream.xml"); parser.setpagecallback(new pagecallbackhandler() { public void process(wikipage page) { try { infobox infobox=page.getinfobox(); system.out.println(infobox.dumpraw()); } catch (wikitextparserexception e) { // todo auto-generated catch block e.printstacktrace(); } //do info box } }); parser.parse(); } } o/p:
{{infobox monarch | name = attila | title = [[list of hunnic rulers|ruler]] of [[hunnic empire]] | place of burial = }} {{infobox sea | name = aegean sea | image = aegean sea map.png | caption = map of aegean sea | pushpin_map = world | pushpin_map_alt = world | pushpin_label_position = right }} {{infobox company | name = audi ag | logo = audi-logo 2016.svg | logo_size = 235 | image = audi ingolstadt.jpg | image_size = 265 }}
i used before wikixmlj simple dumb parser. shall parse perfectly:
// dumppath should c:\your/path/articles.xml.bz2" wikixmlparser wxsp = wikixmlparserfactory.getsaxparser(dumppath); wxsp.setpagecallback(new pagecallbackhandler() { @override public void process(wikipage page) { //system.out.println("info box:" + page.getinfobox()); string regex = "\\{{infobox company(.|\\n)+"; pattern pattern = pattern.compile(regex); matcher matcher = pattern.matcher(page.getinfobox()); while (matcher.find()) { system.out.println(matcher.group(0));} } }); wxsp.parse(); }
Comments
Post a Comment