python - Scrapy spider stops abruptly -
i using example here. change identity tor/privoxy have faced several issues such having type "scrapy crawl something.py" multiple times start spider or having spider stop abruptly in middle of crawl without sort of error message.
something.py
class it(crawlspider): name = 'it' allowed_domains = ["www.jobstreet.com.sg"] start_urls = [ 'https://jobscentral.com.sg/jobs-it', ] custom_settings = { 'tor_renew_identity_enabled': true, 'tor_items_to_scrape_per_identity': 20 } download_delay = 4 handle_httpstatus_list = [301, 302] rules = ( #rule(sgmllinkextractor(allow=(), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="next"]',)), callback="self.parse", follow=true), rule(linkextractor(allow_domains=("jobscentral.com.sg", ), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="next"]',)), callback='self.parse', follow=true), ) def parse(self, response): items = [] self.logger.info("visited outer link %s", response.url) sel in response.xpath('//h4'): item = jobsitems() next_page = response.xpath('//li[@class="page-item"]/a[@aria-label="next"]/@href').extract_first() if next_page: base_url = get_base_url(response) absolute_next_page = urljoin(base_url,next_page) yield scrapy.request(absolute_next_page, self.parse, dont_filter=true) def parse_jobdetails(self, response): self.logger.info('visited internal link %s', response.url) print response item = response.meta['item'] item = self.getjobinformation(item, response) return item def getjobinformation(self, item, response): trans_table = {ord(c): none c in u'\r\n\t\u00a0'} item['jobnature'] = response.xpath('//job-snapshot/dl/div[1]/dd//text()').extract_first() return item
error message when fails start crawling:
2017-09-12 16:55:09 [scrapy.middleware] info: enabled item pipelines: ['jobscentral.pipelines.jobscentralpipeline'] 2017-09-12 16:55:09 [scrapy.core.engine] info: spider opened 2017-09-12 16:55:09 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-09-12 16:55:09 [scrapy.extensions.telnet] debug: telnet console listening on 127.0.0.1:6023 2017-09-12 16:55:11 [scrapy.extensions.throttle] info: slot: jobscentral.com.sg | conc: 1 | delay: 4000 ms (-1000) | latency: 1993 ms | size: 67510 bytes 2017-09-12 16:55:11 [scrapy.core.engine] debug: crawled (200) <get https://jobscentral.com.sg/jobs-it> (referer: none) 2017-09-12 16:55:11 [it] info: got response 200 'https://jobscentral.com.sg/jobs-it' 2017-09-12 16:55:11 [it] info: visited outer link https://jobscentral.com.sg/jobs-it 2017-09-12 16:55:11 [scrapy.core.engine] info: closing spider (finished) 2017-09-12 16:55:11 [it] debug: closing connection pool...
edit: error log
<<<huge chunk of html>> response.body here --------------------------------------------------------- 2017-09-12 17:39:01 [it] info: visited outer link https://jobscentral.com.sg/jobs-it 2017-09-12 17:39:01 [scrapy.core.engine] info: closing spider (finished) 2017-09-12 17:39:01 [it] debug: closing connection pool... 2017-09-12 17:39:01 [scrapy.statscollectors] info: dumping scrapy stats: {'downloader/request_bytes': 290, 'downloader/request_count': 1, 'downloader/request_method_count/get': 1, 'downloader/response_bytes': 68352, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 9, 12, 9, 39, 1, 683612), 'log_count/debug': 4, 'log_count/info': 12, 'memusage/max': 58212352, 'memusage/startup': 58212352, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 9, 12, 9, 38, 58, 660671)} 2017-09-12 17:39:01 [scrapy.core.engine] info: spider closed (finished)
Comments
Post a Comment