quan_quan_zhou 2019-06-12 11:25 采纳率: 0%
浏览 641

scrapy爬取小说,从详情页获取了书名等信息再爬小说正文,如何做到所有章节被放在一个list里面?

我想要这样的形式:

{"book_name": "XXXX", "writer": "XXX", "type": "XXX", "total_click": "XXX", "book_intro": "XXX", "label": ["XX", "XX", "XX", "XX"], "total_word_number": "XX ", "total_introduce": "XX", "week_introduce": "XX", "read_href": "XX", "chapters": [{"name": "第0001章 XX", "word_count": "XX", "time": "XX", "text": "XXXX"},{"name": "第0002章 XX", "word_count": "XX", "time": "XX", "text": "XXXX"},……]}

就像这样

图片说明

但是现在的结果不是章节在一个dict里面而是每章都返回一次item,我知道是哪里的逻辑有问题,但是不会改

代码如下

# -*- coding: utf-8 -*-
import scrapy
from novel.items import NovelItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re


url_page=1

class NovelSpider(CrawlSpider):
    name = 'novel'
    allowed_domains = ['book.zongheng.com']
    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36", }
    start_urls = []
    # for i in range(1,2):
    i = 1
    start_urls.append('http://book.zongheng.com/store/c0/c0/b0/u1/p' + str(i) +'/v9/s9/t0/u0/i1/ALL.html')

    rules = (
        Rule(LinkExtractor(allow=r'book/\d+'), callback="parse_detail"),
    )

     def parse_detail(self,response):
        item = NovelItem()
        item['book_name'] = response.css('div.book-name::text').extract_first()
        item['writer'] = response.css("div.au-name a::text").extract_first()
        item['type'] = response.css(
            "body > div.wrap > div.book-html-box.clearfix > div.book-top.clearfix > div.book-main.fl > div.book-detail.clearfix > div.book-info > div.book-label > a.label::text").extract_first()
        item['total_click'] = response.css(
            "body > div.wrap > div.book-html-box.clearfix > div.book-top.clearfix > div.book-main.fl > div.book-detail.clearfix > div.book-info > div.nums > span:nth-child(3) > i::text").extract_first()
        item['book_intro'] = response.css(
            "body > div.wrap > div.book-html-box.clearfix > div.book-top.clearfix > div.book-main.fl > div.book-detail.clearfix > div.book-info > div********.book-dec.Jbook-dec.hide > p::text").extract_first()
        item['label'] = response.xpath("//div[@class='book-label']/span/a/text()").extract()
        item['total_word_number'] = response.xpath("//div[@class='nums']/span[1]/i/text()").extract_first()
        item['total_introduce'] = response.xpath("//div[@class='nums']/span[2]/i/text()").extract_first()
        item['week_introduce'] = response.xpath("//div[@class='nums']/span[4]/i/text()").extract_first()
        read_href = response.css("div.btn-group>a::attr(href)").extract_first()

        if read_href:
            yield scrapy.Request(
                read_href,
                callback=self.parse_content,
                dont_filter=True,
                meta={"item": item},
            )

    def parse_content(self, response):  # 处理正文
        item = response.meta["item"]
        chapters = []
        chapter_name = response.css("div.title_txtbox::text").extract_first()
        word_count = response.css("#readerFt > div > div.bookinfo > span:nth-child(2)::text").extract_first()
        time = response.css("#readerFt > div > div.bookinfo > span:nth-child(3)::text").extract_first()
        content_link = response.css("div.content")
        paragraphs = content_link.css("p::text").extract()
        content_text = ""
        for i in range(0,len(paragraphs)):
            content_text = content_text + paragraphs[i] + "\n"

        content = dict(name=chapter_name,word_count=word_count,time=time,text=content_text)
        chapters.append(content)
        item['chapters'] = chapters#应该是这里出了问题,但是我不知道怎么解决
        global url_page
        url_page = url_page+1
        next_page = response.css("a.nextchapter::attr(href)").extract_first()
        if url_page<21:
            yield scrapy.Request(
                next_page,
                callback=self.parse_content,
                dont_filter = True,
                meta = {"item": item},
            )
        # print(chapters)
        yield item

  • 写回答

1条回答 默认 最新

  • CSDN-Ada助手 CSDN-AI 官方账号 2022-09-09 18:36
    关注
    不知道你这个问题是否已经解决, 如果还没有解决的话:

    如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 以帮助更多的人 ^-^
    评论

报告相同问题?

悬赏问题

  • ¥50 如何增强飞上天的树莓派的热点信号强度,以使得笔记本可以在地面实现远程桌面连接
  • ¥15 MCNP里如何定义多个源?
  • ¥20 双层网络上信息-疾病传播
  • ¥50 paddlepaddle pinn
  • ¥20 idea运行测试代码报错问题
  • ¥15 网络监控:网络故障告警通知
  • ¥15 django项目运行报编码错误
  • ¥15 请问这个是什么意思?
  • ¥15 STM32驱动继电器
  • ¥15 Windows server update services