苦蓝 2023-07-20 04:10 采纳率: 60%
浏览 137
已结题

Python爬虫|爬取小说|为什么爬取不出来


import json
import re

import requests
import os
import sys
import traceback
sys.tracebacklimit=0
url='https://www.qb5200.la/book/116524/'
ajax_url='https://pagead2.googlesyndication.com/getconfig/sodar?sv=200&tid=gda&tv=r20230718&st=env'
headers={
':authority: pagead2.googlesyndication.com',
':method: GET',
':path: /getconfig/sodar?sv=200&tid=gda&tv=r20230718&st=env',
':scheme: https',
'accept: */*',
'accept-encoding: gzip, deflate, br',
'accept-language: zh-CN,zh;q=0.9',
'origin: https://www.qb5200.la',
'referer: https://www.qb5200.la/',
'sec-ch-ua: ";Not A Brand";v="99", "Chromium";v="94"',
'sec-ch-ua-mobile: ?0',
'sec-ch-ua-platform: "Windows"',
'sec-fetch-dest: empt',
'sec-fetch-mode: cors',
'sec-fetch-site: cross-site',
'user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
}
start_url=requests.get(url,headers=headers).content.decode('gbk','ignore')
ajax_urlz=requests.get(ajax_url,headers=headers).content.decode('gbk','ignore')

def get_toc(html):

      toc_url_list=[]
      toc_block=re.findall('<dl class="zjlist>(.*?)</dl>',html,re.S)[0]
      toc_url=re.findall('href="(.*?)"',toc_block,re.S)
      for url in toc_url:
         toc_url_list.append(start_url+url)
         return toc_url_list
def get_article(html):
     chapter_name=re.search('<div class="border">(.*?)</div>',html,re.S).group(1)
     chapter_namez=chapter_name.select('h1:nth-of-type(1)')
     text_block=re.search('<div id="content">(.*?)</div>',html,re.S).group(1)
     text_block=text_block.replace('<br>','')
     return chapter_namez,text_block
def save(chapter_namez,text_block):
     os.makedirs('星门',exist_ok=True)
     with open(os.path.join('星门',chapter_namez+'.txt'),'w',encoding='gbk')as f:
         f.write(text_block)

img

img


修改后还是爬不出来

import json
import re
import requests
import os
import sys
import traceback

sys.tracebacklimit = 0

url = 'https://www.qb5200.la/book/116524/'
ajax_url = 'https://pagead2.googlesyndication.com/getconfig/sodar?sv=200&tid=gda&tv=r20230718&st=env'

headers = {
    'authority': 'pagead2.googlesyndication.com',
    'method': 'GET',
    'path': '/getconfig/sodar?sv=200&tid=gda&tv=r20230718&st=env',
    'scheme': 'https',
    'accept': '*/*',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'zh-CN,zh;q=0.9',
    'origin': 'https://www.qb5200.la',
    'referer': 'https://www.qb5200.la/',
    'sec-ch-ua': '";Not A Brand";v="99", "Chromium";v="94"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'cross-site',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
}

start_url = requests.get(url, headers=headers).content.decode('gbk', 'ignore')
ajax_urlz = requests.get(ajax_url, headers=headers).content.decode('gbk', 'ignore')


def get_toc(html):
    toc_url_list = []
    toc_block = re.findall('<dl class="zjlist>(.*?)</dl>', html, re.S)[0]
    toc_url = re.findall('href="(.*?)"', toc_block, re.S)
    for url in toc_url:
        toc_url_list.append(start_url + url)
    return toc_url_list


def get_article(html):
    chapter_name = re.search('<div class="border">(.*?)</div>', html, re.S).group(1)
    chapter_name = chapter_name.select('h1:nth-of-type(1)')
    text_block = re.search('<div id="content">(.*?)</div>', html, re.S).group(1)
    text_block = text_block.replace('<br>', '')
    return chapter_name, text_block


def save(chapter_namez, text_block):
    os.makedirs('星门', exist_ok=True)
    i = 0;
    while i < 627 in chapter_namez:
        i += 1;
        chapter_name = chapter_namez[i]
        if chapter_name:
            break
        else:
            'Unknown_Chapter_Name'
    with open(os.path.join('星门', chapter_namez + '.txt'), 'w', encoding='gbk') as f:
        f.write(text_block)


展开全部

  • 写回答

8条回答 默认 最新

  • cjh4312 2023-07-20 04:56
    关注

    爬虫用xpath,比re好使

    import requests
    from lxml import etree
    
    url='https://www.qb5200.la/book/116524/'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
    
    res=requests.get(url,headers=headers)
    html=etree.HTML(res.text)
    chapter_name=html.xpath("//*/dl[@class='zjlist']/dd//text()")
    href=html.xpath("//*/dl[@class='zjlist']/dd/a/@href")
    base_url="https://www.qb5200.la/book/116524/"
    for i in range(len(chapter_name)):
        print(chapter_name[i],base_url+href[i])
        data=requests.get(base_url+href[i],headers=headers)
        html=etree.HTML(data.text)
        content=html.xpath("//*/div[@id='content']//text()")
        print(content)
    

    img

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论 编辑记录
    苦蓝 2023-07-20 08:38

    谢谢你

    回复
    苦蓝 2023-07-20 09:36

    如何保存到文件夹

    回复
    苦蓝 回复 苦蓝 2023-07-20 09:36

    🥺

    回复
    展开全部10条评论
查看更多回答(7条)
编辑
预览

报告相同问题?

问题事件

  • 系统已结题 7月27日
  • 已采纳回答 7月20日
  • 修改了问题 7月20日
  • 修改了问题 7月20日
  • 展开全部

悬赏问题

  • ¥100 二维码被拦截如何处理
  • ¥15 怎么解决LogIn.vue中多出来的div
  • ¥15 优博讯dt50巴枪怎么提取镜像
  • ¥30 在CodBlock上用c++语言运行
  • ¥15 求C6748 IIC EEPROM程序固化烧写算法
  • ¥50 关于#php#的问题,请各位专家解答!
  • ¥15 python 3.8.0版本,安装官方库ibm_db遇到问题,提示找不到ibm_db模块。如何解决?
  • ¥15 TMUXHS4412如何防止静电,
  • ¥30 Metashape软件中如何将建模后的图像中的植被与庄稼点云删除
  • ¥20 机械振动学课后习题求解答
手机看
程序员都在用的中文IT技术交流社区

程序员都在用的中文IT技术交流社区

专业的中文 IT 技术社区,与千万技术人共成长

专业的中文 IT 技术社区,与千万技术人共成长

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

客服 返回
顶部