java - How to extract "infobox company" data from wiki dumps -

March 15, 2011

i have downloaded big wiki dump xml file https://dumps.wikimedia.org/enwiki/20170520/

i want extract metadata company name , parent company wikidumps. company data located in xml template below:

{{infobox company | name = | logo =  | type =  | industry =  | fate =  | predecessor = <!-- or: | predecessors = --> | successor = <!-- or: | successors = --> | founded = <!-- if known: {{start date , age|yyyy|mm|dd}} in [[city]], [[state]], [[country]] --> | founder = <!-- or: | founders = --> | defunct = <!-- {{end date|yyyy|mm|dd}} --> | hq_location_city =  | hq_location_country =  | area_served = <!-- or: | areas_served = --> | key_people =  | products =  | owner = <!-- or: | owners = --> | num_employees =  | num_employees_year = <!-- year of num_employees data (if known) --> | parent =  | website = <!-- {{url|example.com}} --> }}

i did research , found mediawiki parser. reference: https://github.com/dkpro/dkpro-jwpl/blob/master/de.tudarmstadt.ukp.wikipedia.parser/src/main/java/de/tudarmstadt/ukp/wikipedia/parser/tutorial/t1_simpleparserdemo.java

https://dkpro.github.io/dkpro-jwpl/jwplparser/

i tried use parser. requires file converted in string. wiki dump xml file 60 gb in size. can't convert big file in string , keep in memory. also, there no description mediawiki parser on how find specific element infobox company, go inside , extract name , other fields. below sample code mediawiki parser:

public static void main(string[] args) throws ioexception {      file file = new file("c:/users/njaiswal/downloads/accenture_data_from_wikidumps.xml");     string str = fileutils.readfiletostring(file);      // parsedpage object     mediawikiparserfactory pf = new mediawikiparserfactory();     mediawikiparser parser = pf.createparser();     parsedpage pp = parser.parse(str);     // sections       (section section : pp.getsections()) {         system.out.println("section : " + section.gettitle());         system.out.println(" nr of paragraphs      : " + section.nrofparagraphs());         system.out.println(" nr of tables          : " + section.nroftables());         system.out.println(" nr of nested lists    : " + section.nrofnestedlists());         system.out.println(" nr of definition lists: " + section.nrofdefinitionlists());         (link link : section.getlinks(link.type.internal)) {           system.out.println("  " + link.gettarget());       } }  }

is there other parser can solve problem? or can use same mediawiki parser "inbox company" , extract fields? appreciated. thanks

update: tried use wikixmlj parser khalil suggested. able "infobox" data, want limit "infobox company" data. below code , output:

import edu.jhu.nlp.wikipedia.*;     public class test {      public static void main(string[] args) throws exception{         wikixmlparser parser = wikixmlparserfactory.getsaxparser("c:/users/njaiswal/downloads/enwiki-20170520-pages-articles-multistream.xml/enwiki-20170520-pages-articles-multistream.xml");             parser.setpagecallback(new pagecallbackhandler() {                 public void process(wikipage page) {                   try {                     infobox infobox=page.getinfobox();                     system.out.println(infobox.dumpraw());                 } catch (wikitextparserexception e) {                     // todo auto-generated catch block                     e.printstacktrace();                 }                    //do info box                 }             });             parser.parse();     }  }

o/p:

{{infobox monarch | name            = attila | title           = [[list of hunnic rulers|ruler]] of [[hunnic empire]] | place of burial =  }} {{infobox sea | name = aegean sea | image = aegean sea map.png | caption = map of aegean sea | pushpin_map = world | pushpin_map_alt = world | pushpin_label_position = right }} {{infobox company | name             = audi ag  | logo             = audi-logo 2016.svg | logo_size = 235 | image            = audi ingolstadt.jpg | image_size = 265 }}

i used before wikixmlj simple dumb parser. shall parse perfectly:

// dumppath should c:\your/path/articles.xml.bz2" wikixmlparser wxsp = wikixmlparserfactory.getsaxparser(dumppath); wxsp.setpagecallback(new pagecallbackhandler() {     @override     public void process(wikipage page) {         //system.out.println("info box:" + page.getinfobox());         string regex = "\\{{infobox company(.|\\n)+";        pattern pattern = pattern.compile(regex);        matcher matcher = pattern.matcher(page.getinfobox());        while (matcher.find()) {        system.out.println(matcher.group(0));}  }     });     wxsp.parse(); }

demo of regex

Search This Blog

Single

java - How to extract "infobox company" data from wiki dumps -

Comments

Post a Comment

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

linux - Why does bash short curcuit fail in crontab? -