scrapy - How to ignore URLs with query strings (?xxx=xxx) -

September 15, 2013

i want spider ignore url has query string. have tried adding expression deny rules (for \?) in linkextractor (see below) gets ignored ie. spider still crawling/extracting urls include ? character.

i have 1 start url root of domain way crawl links via linkextractor.

here rule in crawlspider implementation.

rule(linkextractor(             allow=(),              deny=(':443', ':80', '\?', )),              callback='parse_page',              follow=true), )

urls include port numbers being excluded containing ? still being included.

the docs don't discuss particular use case - @ least cannot find it.

anyone have ideas of how exclude urls contain query strings being extracted?

i using scrapy 1.4.0.

update

for reason, scrapy seems ignoring expressions containing ? character in deny attribute of linkextractor definition. got alternative approach of filtering links working however.

rule(linkextractor(             allow=(),              deny=(':443', ':80', )),             process_links='filter_links',              callback='parse_page',              follow=true),      def filter_links(self, links):         link in links:             if '?' in link.url:                 continue             else:                 yield link

Search This Blog

Single

scrapy - How to ignore URLs with query strings (?xxx=xxx) -

Comments

Post a Comment

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -