(Python 3): Scrapy MongoDB pipeline not working -
i attempting connect mongodb scrappy pipeline via pymongo in order create new database , populate have scraped, running weird issue. followed basic tutorials , set 2 command lines, 1 run scrapy in , other run mongod. unfortunately when run scrapy code after running mongod, mongod not appear pick on scrapy pipeline trying set , maintains 'waiting connections on port 27107' notice.
in command line 1 (scrapy) set directory documents/pyprojects/twitterbot/krugman
in command line 2 (mongod) set documents/pyprojects/twitterbot
the scripts using follows: krugman/krugman/spiders/krugspider.py (pulls paul krugman blog entries):
from scrapy import http scrapy.selector import selector scrapy.spiders import crawlspider import scrapy import pymongo import json krugman.items import blogpost class krugspider(crawlspider): name = 'krugbot' start_url = ['https://krugman.blogs.nytimes.com'] def __init__(self): self.url = 'https://krugman.blogs.nytimes.com/more_posts_jsons/page/{0}/?homepage=1&apagenum={0}' def start_requests(self): yield http.request(self.url.format('1'), callback = self.parse_page) def parse_page(self, response): data = json.loads(response.body) block in range(len(data['posts'])): article in self.parse_block(data['posts'][block]): yield article page = data['args']['paged'] + 1 url = self.url.format(str(page)) yield http.request(url, callback = self.parse_page) def parse_block(self, content): article = blogpost(author = 'paul krugman', source = 'blog') paragraphs = selector(text = str(content['html'])) article['paragraphs']= paragraphs.css('p.story-body-text::text').extract() article['links'] = paragraphs.css('p.story-body-text a::attr(href)').extract() article['datetime'] = content['post_date'] article['post_id'] = content['post_id'] article['url'] = content['permalink'] article['title'] = content['headline'] yield article
krugman/krugman/settings.py:
item_pipelines = ['krugman.pipelines.krugmanpipeline'] mongodb_server = 'localhost' mongodb_port = 27017 mongodb_db = 'scrapedb' mongodb_tweets = 'tweetcol' mongodb_facebook = 'fbcol' mongodb_blog = 'blogcol'
krugman/krugman/pipelines.py
from pymongo import mongoclient scrapy.conf import settings scrapy import log class krugmanpipeline(object): def __init(self): connection = mongoclient(settings['mongodb_server'], settings['mongodb_port']) db = connection[settings['mongodb_db']] self.collection = db[settings['mongodb_blog']] def process_item(self, item, spider): self.collection.insert_one(dict(item)) log.msg("test out") return item
i'm not getting error messages i'm having difficulties troubleshooting. seems refusing fire off @ all. ideas problem be?
in settings, didn't add mongopipeline.
item_pipelines = { 'crawler.pipelines.mongopipeline': 800, 'scrapy.pipelines.images.imagespipeline': 300, }
Comments
Post a Comment