python - Scrapy spider stops abruptly -

April 15, 2014

i using example here. change identity tor/privoxy have faced several issues such having type "scrapy crawl something.py" multiple times start spider or having spider stop abruptly in middle of crawl without sort of error message.

something.py

class it(crawlspider):     name = 'it'      allowed_domains = ["www.jobstreet.com.sg"]     start_urls = [         'https://jobscentral.com.sg/jobs-it',     ]      custom_settings = {                        'tor_renew_identity_enabled': true,                        'tor_items_to_scrape_per_identity': 20                        }      download_delay = 4     handle_httpstatus_list = [301, 302]      rules = (         #rule(sgmllinkextractor(allow=(), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="next"]',)), callback="self.parse", follow=true),         rule(linkextractor(allow_domains=("jobscentral.com.sg", ), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="next"]',)), callback='self.parse', follow=true),     )      def parse(self, response):              items = []              self.logger.info("visited outer link %s", response.url)              sel in response.xpath('//h4'):                 item = jobsitems()              next_page = response.xpath('//li[@class="page-item"]/a[@aria-label="next"]/@href').extract_first()              if next_page:                 base_url = get_base_url(response)                 absolute_next_page = urljoin(base_url,next_page)                 yield scrapy.request(absolute_next_page, self.parse, dont_filter=true)      def parse_jobdetails(self, response):          self.logger.info('visited internal link %s', response.url)         print response         item = response.meta['item']         item = self.getjobinformation(item, response)         return item      def getjobinformation(self, item, response):         trans_table = {ord(c): none c in u'\r\n\t\u00a0'}          item['jobnature'] = response.xpath('//job-snapshot/dl/div[1]/dd//text()').extract_first()         return item

error message when fails start crawling:

2017-09-12 16:55:09 [scrapy.middleware] info: enabled item pipelines: ['jobscentral.pipelines.jobscentralpipeline'] 2017-09-12 16:55:09 [scrapy.core.engine] info: spider opened 2017-09-12 16:55:09 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-09-12 16:55:09 [scrapy.extensions.telnet] debug: telnet console listening on 127.0.0.1:6023 2017-09-12 16:55:11 [scrapy.extensions.throttle] info: slot: jobscentral.com.sg | conc: 1 | delay: 4000 ms (-1000) | latency: 1993 ms | size: 67510 bytes 2017-09-12 16:55:11 [scrapy.core.engine] debug: crawled (200) <get https://jobscentral.com.sg/jobs-it> (referer: none) 2017-09-12 16:55:11 [it] info: got response 200 'https://jobscentral.com.sg/jobs-it' 2017-09-12 16:55:11 [it] info: visited outer link https://jobscentral.com.sg/jobs-it 2017-09-12 16:55:11 [scrapy.core.engine] info: closing spider (finished) 2017-09-12 16:55:11 [it] debug: closing connection pool...

edit: error log

<<<huge chunk of html>> response.body here --------------------------------------------------------- 2017-09-12 17:39:01 [it] info: visited outer link https://jobscentral.com.sg/jobs-it 2017-09-12 17:39:01 [scrapy.core.engine] info: closing spider (finished) 2017-09-12 17:39:01 [it] debug: closing connection pool... 2017-09-12 17:39:01 [scrapy.statscollectors] info: dumping scrapy stats: {'downloader/request_bytes': 290,  'downloader/request_count': 1,  'downloader/request_method_count/get': 1,  'downloader/response_bytes': 68352,  'downloader/response_count': 1,  'downloader/response_status_count/200': 1,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2017, 9, 12, 9, 39, 1, 683612),  'log_count/debug': 4,  'log_count/info': 12,  'memusage/max': 58212352,  'memusage/startup': 58212352,  'response_received_count': 1,  'scheduler/dequeued': 1,  'scheduler/dequeued/memory': 1,  'scheduler/enqueued': 1,  'scheduler/enqueued/memory': 1,  'start_time': datetime.datetime(2017, 9, 12, 9, 38, 58, 660671)} 2017-09-12 17:39:01 [scrapy.core.engine] info: spider closed (finished)

Search This Blog

Single

python - Scrapy spider stops abruptly -

Comments

Post a Comment

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -