python - Scrapy spider stops abruptly -

i using example here. change identity tor/privoxy have faced several issues such having type "scrapy crawl" multiple times start spider or having spider stop abruptly in middle of crawl without sort of error message.

class it(crawlspider):     name = 'it'      allowed_domains = [""]     start_urls = [         '',     ]      custom_settings = {                        'tor_renew_identity_enabled': true,                        'tor_items_to_scrape_per_identity': 20                        }      download_delay = 4     handle_httpstatus_list = [301, 302]      rules = (         #rule(sgmllinkextractor(allow=(), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="next"]',)), callback="self.parse", follow=true),         rule(linkextractor(allow_domains=("", ), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="next"]',)), callback='self.parse', follow=true),     )      def parse(self, response):              items = []    "visited outer link %s", response.url)              sel in response.xpath('//h4'):                 item = jobsitems()              next_page = response.xpath('//li[@class="page-item"]/a[@aria-label="next"]/@href').extract_first()              if next_page:                 base_url = get_base_url(response)                 absolute_next_page = urljoin(base_url,next_page)                 yield scrapy.request(absolute_next_page, self.parse, dont_filter=true)      def parse_jobdetails(self, response):'visited internal link %s', response.url)         print response         item = response.meta['item']         item = self.getjobinformation(item, response)         return item      def getjobinformation(self, item, response):         trans_table = {ord(c): none c in u'\r\n\t\u00a0'}          item['jobnature'] = response.xpath('//job-snapshot/dl/div[1]/dd//text()').extract_first()         return item 

error message when fails start crawling:

2017-09-12 16:55:09 [scrapy.middleware] info: enabled item pipelines: ['jobscentral.pipelines.jobscentralpipeline'] 2017-09-12 16:55:09 [scrapy.core.engine] info: spider opened 2017-09-12 16:55:09 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-09-12 16:55:09 [scrapy.extensions.telnet] debug: telnet console listening on 2017-09-12 16:55:11 [scrapy.extensions.throttle] info: slot: | conc: 1 | delay: 4000 ms (-1000) | latency: 1993 ms | size: 67510 bytes 2017-09-12 16:55:11 [scrapy.core.engine] debug: crawled (200) <get> (referer: none) 2017-09-12 16:55:11 [it] info: got response 200 '' 2017-09-12 16:55:11 [it] info: visited outer link 2017-09-12 16:55:11 [scrapy.core.engine] info: closing spider (finished) 2017-09-12 16:55:11 [it] debug: closing connection pool... 

edit: error log

<<<huge chunk of html>> response.body here --------------------------------------------------------- 2017-09-12 17:39:01 [it] info: visited outer link 2017-09-12 17:39:01 [scrapy.core.engine] info: closing spider (finished) 2017-09-12 17:39:01 [it] debug: closing connection pool... 2017-09-12 17:39:01 [scrapy.statscollectors] info: dumping scrapy stats: {'downloader/request_bytes': 290,  'downloader/request_count': 1,  'downloader/request_method_count/get': 1,  'downloader/response_bytes': 68352,  'downloader/response_count': 1,  'downloader/response_status_count/200': 1,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2017, 9, 12, 9, 39, 1, 683612),  'log_count/debug': 4,  'log_count/info': 12,  'memusage/max': 58212352,  'memusage/startup': 58212352,  'response_received_count': 1,  'scheduler/dequeued': 1,  'scheduler/dequeued/memory': 1,  'scheduler/enqueued': 1,  'scheduler/enqueued/memory': 1,  'start_time': datetime.datetime(2017, 9, 12, 9, 38, 58, 660671)} 2017-09-12 17:39:01 [scrapy.core.engine] info: spider closed (finished) 


Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -