scrapy - How to ignore URLs with query strings (?xxx=xxx) -
i want spider ignore url has query string. have tried adding expression deny rules (for \?
) in linkextractor (see below) gets ignored ie. spider still crawling/extracting urls include ?
character.
i have 1 start url root of domain way crawl links via linkextractor.
here rule in crawlspider implementation.
rule(linkextractor( allow=(), deny=(':443', ':80', '\?', )), callback='parse_page', follow=true), )
urls include port numbers being excluded containing ?
still being included.
the docs don't discuss particular use case - @ least cannot find it.
anyone have ideas of how exclude urls contain query strings being extracted?
i using scrapy 1.4.0.
update
for reason, scrapy seems ignoring expressions containing ?
character in deny
attribute of linkextractor definition. got alternative approach of filtering links working however.
rule(linkextractor( allow=(), deny=(':443', ':80', )), process_links='filter_links', callback='parse_page', follow=true), def filter_links(self, links): link in links: if '?' in link.url: continue else: yield link
Comments
Post a Comment