(Python 3): Scrapy MongoDB pipeline not working -

January 15, 2011

i attempting connect mongodb scrappy pipeline via pymongo in order create new database , populate have scraped, running weird issue. followed basic tutorials , set 2 command lines, 1 run scrapy in , other run mongod. unfortunately when run scrapy code after running mongod, mongod not appear pick on scrapy pipeline trying set , maintains 'waiting connections on port 27107' notice.

in command line 1 (scrapy) set directory documents/pyprojects/twitterbot/krugman

in command line 2 (mongod) set documents/pyprojects/twitterbot

the scripts using follows: krugman/krugman/spiders/krugspider.py (pulls paul krugman blog entries):

from scrapy import http scrapy.selector import selector scrapy.spiders import crawlspider import scrapy import pymongo import json krugman.items import blogpost   class krugspider(crawlspider):     name = 'krugbot'     start_url = ['https://krugman.blogs.nytimes.com']      def __init__(self):         self.url = 'https://krugman.blogs.nytimes.com/more_posts_jsons/page/{0}/?homepage=1&apagenum={0}'      def start_requests(self):         yield http.request(self.url.format('1'), callback = self.parse_page)      def parse_page(self, response):         data = json.loads(response.body)         block in range(len(data['posts'])):             article in self.parse_block(data['posts'][block]):                 yield article           page = data['args']['paged'] + 1         url = self.url.format(str(page))         yield http.request(url, callback = self.parse_page)       def parse_block(self, content):         article = blogpost(author = 'paul krugman', source = 'blog')                         paragraphs = selector(text = str(content['html']))          article['paragraphs']= paragraphs.css('p.story-body-text::text').extract()         article['links'] = paragraphs.css('p.story-body-text a::attr(href)').extract()         article['datetime'] = content['post_date']         article['post_id'] = content['post_id']         article['url'] = content['permalink']         article['title'] = content['headline']          yield article

krugman/krugman/settings.py:

item_pipelines = ['krugman.pipelines.krugmanpipeline']  mongodb_server = 'localhost' mongodb_port = 27017 mongodb_db = 'scrapedb' mongodb_tweets = 'tweetcol' mongodb_facebook = 'fbcol' mongodb_blog = 'blogcol'

krugman/krugman/pipelines.py

from pymongo import mongoclient scrapy.conf import settings scrapy import log  class krugmanpipeline(object):      def __init(self):         connection = mongoclient(settings['mongodb_server'], settings['mongodb_port'])         db = connection[settings['mongodb_db']]         self.collection = db[settings['mongodb_blog']]      def process_item(self, item, spider):         self.collection.insert_one(dict(item))         log.msg("test out")         return item

i'm not getting error messages i'm having difficulties troubleshooting. seems refusing fire off @ all. ideas problem be?

in settings, didn't add mongopipeline.

 item_pipelines = {      'crawler.pipelines.mongopipeline': 800,      'scrapy.pipelines.images.imagespipeline': 300,  }

Search This Blog

Single

(Python 3): Scrapy MongoDB pipeline not working -

Comments

Post a Comment

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -