求助大神。Python爬取某报纸，似乎遇到传递值，爬不出内容。麻烦帮助修改代码，万分感谢。

网页其中的源代码：

<dt>本期版面导航</dt>
<div class="dd-box">
<dd>
<a class="page-name" href="index.html?date={<!-- -->{jdate}}&page={<!-- -->{pnumber}}">{<!-- -->{pnumber}}版：{<!-- -->{pname}}</a>


<dt>本版新闻列表（<span id="news-num">0</span>）</dt>
<div class="dd-box news-list">
<dd><a href="detail.html?date={<!-- -->{jdate}}&id={<!-- -->{id}}&page={<!-- -->{pageNo}}" target="_blank"><i>●</i>{<!-- -->{title}}</a></dd>

Python代码：

import requests
import bs4
import os
import datetime
import time


def fetchUrl(url):
    '''
    功能：访问 url 的网页，获取网页内容并返回
    参数：目标网页的 url
    返回：目标网页的 html 内容
    '''

    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }

    r = requests.get(url, headers=headers)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    return r.text


def getPageList(year, month, day):
    '''
    功能：获取当天报纸的各版面的链接列表
    参数：年，月，日
    '''
    url = 'https://www.shobserver.com/staticsg/res/html/journal/index.html?date=' + year + '-' + month + '-' + day + '&page=01'
    html = fetchUrl(url)
    bsobj = bs4.BeautifulSoup(html, 'html.parser')
    pageList = bsobj.find('div', attrs={'class': 'dd-box'}).find_all('dd')
    linkList = []

    for page in pageList:
        tempList = page.find_all('a')
        for temp in tempList:
            link = temp["href"]
            if 'index.html' in link:
                url = 'https://www.shobserver.com/staticsg/res/html/journal/' + link
        linkList.append(url)

    return linkList


def getTitleList(year, month, day, pageUrl):
    '''
    功能：获取报纸某一版面的文章链接列表
    参数：年，月，日，该版面的链接
    '''
    html = fetchUrl(pageUrl)
    bsobj = bs4.BeautifulSoup(html, 'html.parser')
    titleList = bsobj.find('div', attrs={'class': 'dd-box news-list'}).find_all('dd')
    linkList = []

    for title in titleList:
        tempList = title.find_all('a')
        for temp in tempList:
            link = temp["href"]
            if 'detail.html' in link:
                url = 'https://www.shobserver.com/staticsg/res/html/journal/' + link
        linkList.append(url)
    return linkList


def getContent(html):
    '''
    功能：解析 HTML 网页，获取新闻的文章内容
    参数：html 网页内容
    '''
    bsobj = bs4.BeautifulSoup(html, 'html.parser')

    # 获取文章 标题
    title = bsobj.find_all('div', attrs={'class': 'con-title'})
    content1 = ''
    for p1 in title:
        content1 += p1.text + '\n'
        # print(content1)

    # 获取文章 内容
    pList = bsobj.find_all('div', attrs={'class': 'txt-box'})
    content = ''
    for p in pList:
        content += p.text + '\n'
        # print(content)

    # 返回结果 标题+内容
    resp = content1 + content
    return resp


def saveFile(content, path, filename):
    '''
    功能：将文章内容 content 保存到本地文件中
    参数：要保存的内容，路径，文件名
    '''
    # 如果没有该文件夹，则自动生成
    if not os.path.exists(path):
        os.makedirs(path)

    # 保存文件
    with open(path + filename, 'w', encoding='utf-8') as f:
        f.write(content)


def download_rmrb(year, month, day, destdir):
    '''
    功能：网站 某年 某月 某日 的新闻内容，并保存在 指定目录下
    参数：年，月，日，文件保存的根目录
    '''
    pageList = getPageList(year, month, day)
    for page in pageList:
        titleList = getTitleList(year, month, day, page)
        for url in titleList:
            # 获取新闻文章内容
            html = fetchUrl(url)
            content = getContent(html)

            # 生成保存的文件路径及文件名
            temp = url.split('=')[-2].split('&')[0].split('-')
            pageNo = temp[0]
            titleNo = temp[0] if int(temp[0]) >= 10 else '0' + temp[0]
            path = destdir + '/' + year + month + day + '/'
            fileName = year + month + day + '-' + pageNo + '-' + titleNo + '.txt'

            # 保存文件
            saveFile(content, path, fileName)


if __name__ == '__main__':
    '''
    主函数：程序入口
    '''
    # 爬取指定日期的新闻
    newsDate = input('请输入要爬取的日期（格式如 20210916 ）:')

    year = newsDate[0:4]
    month = newsDate[4:6]
    day = newsDate[6:8]

    download_rmrb(year, month, day, 'D:02/cqrb')
    print("爬取完成：" + year + month + day)

劳烦大神指导解决，万分感谢。

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

5条回答默认最新

机灵鹤新星创作者: python技术领域 2021-04-19 20:12

关注

import requests
import bs4
import os
import datetime
import time
import json
 
def fetchUrl(url):
    '''
    功能：访问 url 的网页，获取网页内容并返回
    参数：目标网页的 url
    返回：目标网页的 html 内容
    '''
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }
    r = requests.get(url, headers=headers)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    return r.text

def saveFile(content, path, filename):
    '''
    功能：将文章内容 content 保存到本地文件中
    参数：要保存的内容，路径，文件名
    '''
    # 如果没有该文件夹，则自动生成
    if not os.path.exists(path):
        os.makedirs(path)
    # 保存文件
    with open(path + filename, 'w', encoding='utf-8') as f:
        f.write(content)

def download_rmrb(year, month, day, destdir):
    '''
    功能：网站 某年 某月 某日 的新闻内容，并保存在 指定目录下
    参数：年，月，日，文件保存的根目录
    '''
    url = 'https://www.shobserver.com/staticsg/data/journal/' + year + '-' + month + '-' + day + '/navi.json'
    html = fetchUrl(url)
    jsonObj = json.loads(html)

    for page in jsonObj["pages"]:
        pageName = page["pname"]
        pageNo = page["pnumber"]
        print(pageNo, pageName)
        for article in page["articleList"]:
            title = article["title"]
            subtitle = article["subtitle"]
            pid = article["id"]
            url = "https://www.shobserver.com/staticsg/data/journal/" + year + '-' + month + '-' + day + "/" + str(pageNo) + "/article/" + str(pid) + ".json"
            print(pid, title, subtitle)

            html = fetchUrl(url)
            cont = json.loads(html)["article"]["content"]
            bsobj = bs4.BeautifulSoup(cont, 'html.parser')
            content = title + subtitle + bsobj.text
            print(content)
            
            path = destdir + '/' + year + month + day + '/' + str(pageNo) + " " + pageName + "/"
            fileName = year + month + day + '-' + pageNo + '-' + str(pid) + "-" + title + '.txt'
            saveFile(content, path, fileName)

if __name__ == '__main__':
    '''
    主函数：程序入口
    '''
    # 爬取指定日期的新闻
    newsDate = input('请输入要爬取的日期（格式如 20210916 ）:')
    year = newsDate[0:4]
    month = newsDate[4:6]
    day = newsDate[6:8]
    download_rmrb(year, month, day, 'cqrb')
    print("爬取完成：" + year + month + day)

这个网站的内容是动态加载出来的，并非人民日报那样的静态网页（就是说数据是通过其他请求获取到，然后加载到网页中的）

我看了一下，这个要比人民日报简单一些，它的版面列表和文章列表放在了同一个请求中（人民日报每个版面要单独请求一次文章列表）

上面的代码简单调整了一下，仅供参考

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(4条)

报告相同问题？

关注问题

Python爬虫案例与实战：爬取某游戏Top100选手信息
2024-08-07 00:52

andyyah晓波的博客本章案例将展示使用Python爬虫工具，从在线网站爬取表格并保存成如 Excel或CSV文档等可以重复使用编辑的形式，从网页获取表格的方式多种多样，本案例会根据网页的元素和特性选择合适的方案来编写爬虫。
python爬虫爬取某站上海租房图片
2021-01-20 04:47

这段时间开始学习python爬虫，今天周末无聊写了一段代码爬取上海租房图片，其实很简短就是利用爬虫的第三方库Requests与BeautifulSoup。python 版本：python3.6 ,IDE ：pycharm。其实就几行代码，但希望没有开发基础...
python 爬取道客巴巴文档_Python爬取百度百科！付费文档同样爬！
2020-12-29 08:19

火宅K的博客任务简介利用 python 爬取百度百科的任何一个词条的简介，在本文中我们将了解爬虫的几个库的基本使用方法，例如 bs4 (BeautifulSoup)，requests 等等，可以这么说，学完这一篇文章，你就可以爬取一些静...
python怎样爬取付费文档_Python爬取百度百科！付费文档同样爬！
2020-12-02 14:01

weixin_39889329的博客任务简介利用 python 爬取百度百科的任何一个词条的简介，在本文中我们将了解爬虫的几个库的基本使用方法，例如 bs4 (BeautifulSoup)，requests 等等，可以这么说，学完这一篇文章，你就可以爬取一些静...
Python爬取生态环境水污染排放标准.mp4
2020-08-28 16:49

一个初学者的爬虫案例。用到了网页内容获取和解析方法。步骤很详细。适合初学者练习或修改。里面有不少可以简化的地方，比如使用函数、获取完整链接等。希望大神可以帮忙完善。
求 python爬取researchgate的源代码谢谢各位大神！
2024-02-27 13:24

2401_83136367的博客有哪位大神能提供researchgate国内网的爬虫源代码，有偿，提供指导也可以！
【Python】手把手教你用Python爬取某网小说数据，并进行可视化分析
2021-09-22 11:20

风度78的博客借助相关互联网手段来表现文学作品及含有一部分文字作品的网络技术产品，在当前成为一种新兴的文学现象，并快速兴起，各种网络小说也是层出不穷，今天我们使用selenium爬取红袖天香网站小说数据，并做简单数据可视化...
手把手教你用Python爬取某网小说数据，并进行可视化分析
2023-05-04 11:17

Python小远的博客借助相关互联网手段来表现文学作品及含有一部分文字作品的网络技术产品，在当前成为一种新兴的文学现象，并快速兴起，各种网络小说也是层出不穷，今天我们使用selenium爬取红袖天香网站小说数据，并做简单数据可视化...
python爬取豆瓣影评代码分析_教你用python登陆豆瓣并爬取影评
2020-12-01 12:43

weixin_39929646的博客这是我的第二篇原创文章在上篇文章爬取豆瓣电影top250后，想想既然爬了电影，干脆就连影评也爬了，这样可以看看人们评价的电影，再加上刚出不久的移动迷官3好像挺热的，干脆就爬他吧，爬完看看好不好看！进入主题1....
python爬取贴吧_python爬取
2020-11-21 02:40

weixin_39814378的博客本次放出python爬取百度贴吧源码数据资源下载，并提供webdriver加载程序，安装后可以让源码加载到谷歌浏览器上运行，有相关百度贴吧爬取需求的朋友们不妨试试吧！python爬取百度贴吧源码说明：webdriverBeaut...
没有解决我的问题, 去提问

码龄粉丝数原力等级 --

求助大神。Python爬取某报纸，似乎遇到传递值，爬不出内容。麻烦帮助修改代码，万分感谢。

5条回答默认最新

码龄粉丝数原力等级 --

求助大神。Python爬取某报纸，似乎遇到传递值，爬不出内容。麻烦帮助修改代码，万分感谢。

5条回答 默认 最新

5条回答默认最新