web crawler - Crawl data with PHP to load more? -
i trying crawl data website , did problem there load more button, can crawl visible data, data coming after click on load-more button can't able crawl.
using preg_match_all :
$page = file_get_contents('https://www.healthfrog.in/chemists/medical-store/gujarat/surat'); preg_match_all( '/<h3><a href="(.*?)">(.*?)<\/a><\/h3><p><i class="fa fa-map-marker"><\/i>(.*?)<\/p>/s', $page, $retailers, // contain article data preg_set_order // formats data array of posts ); foreach ($retailers $post) { $retailer['name'] = $post[2]; $retailer['address'] = $post[3]; echo "<b>".$retailer['name']."</b><br/>".$retailer['address']."<br/><br/>"; }
using domdocument :
$html = new domdocument(); @$html->loadhtmlfile('https://www.healthfrog.in/chemists/medical-store/gujarat/surat'); $xpath = new domxpath( $html ); $nodelist = $xpath->query('//*[@id="setrecord"]/div[@class="listing "]'); foreach ($nodelist $n){ $retailer = $xpath->query('h3/a', $n)->item(0)->nodevalue."<br>"; $address = $xpath->query('p', $n)->item(0)->nodevalue; echo "<b>".$retailer."</b><br/>".$address."<br/><br/>"; }
any idea how grab whole data @ time?
i think need try crawling web page more efficient way.
my first suggestion using phantomjs complex web engine in command line. means can execute phantom js operations(in javascript) getting web pages, triggering dom events , getting data need php exec command.
phantomjs headless webkit scriptable javascript api. has fast , native support various web standards: dom handling, css selector, json, canvas, , svg.
// simple javascript example console.log('loading web page'); var page = require('webpage').create(); var url = 'http://phantomjs.org/'; page.open(url, function (status) { //do dom operations( click read more button or else) , console.log(yourdatathatyouneed) phantom.exit(); });
for getting data need php driver phantomjs.
here example php client phantomjs => https://github.com/jonnnnyw/php-phantomjs
actualy have php driver phantomjs developed side project , i'll planning publish on github account in next days.
the second way(frankly in opinion right way complex projects) i'm suggesting using scraping framework scrapy. can take documentation how scraping data web pages scrapy.
scrapy powerful framework extracting data need websites, based on python.
you can take tutorial using scrapy https://docs.scrapy.org/en/latest/intro/tutorial.html
Comments
Post a Comment