Spider的代码是这样的:
def parse(self, response):
url_list = response.xpath('//a/@href').extract()[0]
for single_url in url_list:
url = 'https:' + single_url.xpath('./@href').extract()[0]
name = single_url.xpath('./text()').extract()[0]
yield scrapy.Request(url=url, callback=self.parse_get, meta={'url':url, 'name':name})
def parse_get(self, response):
print(1)
item = MySpiderItem()
item['name'] = response.mate['name']
item['url'] = response.mate['url']
yield item
middlewares的代码是这样的:
def process_request(self, request, spider):
self.driver = webdriver.Chrome()
self.driver.get(request.url)
if 'anime' in request.meta:
element = WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.ID, 'header')))
else:
element = WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.ID, 'header')))
html = self.driver.page_source
self.driver.quit()
return scrapy.http.HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8')
我是用Chrome来运行的,Request里面的url是一个一个地打开了,但是一直没有调用parse_get。一直都没有加allowed_domains,也尝试过在Request中加dont_filter=True,但是网站能打开,证明应该不是网站被过滤了的问题。实在是没有想法了,求大神指导!!!!