python - ValueError: Missing scheme in request url: h when using media pipeline -
i trying download pdf website, followed instruction provided scrapy website got error:
file "/home/joseph/env/lib/python3.5/site-packages/scrapy/http/request/__init__.py", line 58, in _set_url raise valueerror('missing scheme in request url: %s' % self._url) valueerror: missing scheme in request url: h 2017-09-12 17:47:40 [scrapy.core.scraper] error: error processing {'file_urls': 'https://www.sec.gov/divisions/corpfin/cf-noaction/2008/jpmorgan080409-405.pdf', 'title': ('jpmorgan chase & co.',)}
settings.py
item_pipelines = { 'sec_scrape.pipelines.secscrapepipeline': 300, 'sec_scrape.pipelines.jsonwriterpipeline': 800, 'scrapy.pipelines.files.filespipeline': 1, } files_store = '/home/joseph/pdf'
items.py
import scrapy class letteritem(scrapy.item): title = scrapy.field() file_urls = scrapy.field() files = scrapy.field()
spider.py
import scrapy sec_scrape.items import letteritem class quotesspider(scrapy.spider): name = "corporate_finance" allowed_domains = ["sec.gov"] start_urls = ['https://www.sec.gov/divisions/corpfin/cf-noaction.shtml'] def parse(self, response): letter in response.xpath('//table[2]/tr/td[3]/ul[74]/li/a'): item = letteritem() item['title'] = letter.xpath('text()').extract_first(), item['file_urls'] = response.urljoin(letter.xpath('@href').extract_first()) yield item
any idea why getting error?
thank you
file_urls
item attribute has list, while set string (the url of file download). change line
item['file_urls'] = response.urljoin(letter.xpath('@href').extract_first())
to
item['file_urls'] = [response.urljoin(letter.xpath('@href').extract_first())]
Comments
Post a Comment