关于#python#的问题：用python编写爬虫程序，将文字和图像等信息抓取到sqlite中保存

用python编写爬虫程序，将文字和图像等信息抓取到sqlite中保存，须有整理

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

1条回答默认最新

白驹_过隙算法领域新星创作者 2022-06-04 10:55

关注

import sqlite3
import re
import requests
from lxml import html

findlink = re.compile(r'<a href="(.*?)"')  # 创建正则表达式对象，表示规则（字符串的模式）
findname = re.compile(r'<a href=".*?">(.*?)</a>')
findname2 = re.compile(r'<td style="outline: 0px !important;">(.*?)</td>')
findname3 = re.compile(
    r'<td style="outline: 0px !important;"><p style="line-height: 1.8; outline: 0px !important;">(.*?)</p></td>')
findname4 = re.compile(
    r'<td style="outline: 0px !important;"><p style="line-height: 1.8; outline: 0px !important;"><a href=".*?">(.*?)</a>.*?</p></td>')
findaddres = re.compile(r'<td style="outline: 0px !important;">(.*?)</td>')
findadress1 = re.compile(r'<td style="outline: 0px !important;"><a href=".*?">(.*?)</a></td>')

'''
通过findall找到所有table里面的tr
然后对tr里面的内容进行解析，如果没有链接，则data添加信息为空，有链接调用函数来解析链接网页
再向数据库传输解析内容
'''

def main():
    basicurl = "http://www.qianmu.org/ranking/1528.htm"
    datalist = getData(basicurl)
    for data in datalist:
        print(data)
    saveDatadb(datalist,"university.db")

# 得到一个指定的网页内容
def askURL(url):
    et = html.etree
    respon = requests.get("http://www.qianmu.org/ranking/1528.htm")
    selector = et.HTML(respon.text)
    return selector

# 爬取主网页，将网页的tr提取出来进行分析
def getData(basicurl):
    datalist = []
    selector = askURL(basicurl)
    # 找出每个tr,对每个tr解析
    trs = selector.xpath('//div[@class="rankItem"]//tr[position()>1]')
    # names = selector.xpath('//div[@class="rankItem"]//tr[position()>1]/td/a/text() | //div[@class="rankItem"]//tr['
    #                        'position()>1]/td[2]/text()')
    # links = selector.xpath('//div[@class="rankItem"]//tr[position()>1]/td/a/@href')
    # 获得了每一个tr内容
    for tr in trs:
        data = []
        tr = html.tostring(tr, encoding='utf-8').decode('utf-8')
        name = re.findall(findname, tr)
        name1 = re.findall(findname2, tr)
        if len(name) == 0:
            name = name1[1]
        else:
            name = name[0]
        data.append(name)

        # 获取英文名字
        if len(re.findall(findname4, tr)) > 1 or len(re.findall(findname4, tr)) == 1:
            english = ''.join(re.findall(findname4, tr)[0])
        else:
            english = re.findall(findname3, tr)[1]
        data.append(english)

        if len(re.findall(findadress1, tr)) > 1:
            address = ''.join(re.findall(findadress1, tr)[1])
        else:
            address = re.findall(findaddres, tr)[3]
        data.append(address)
        link = re.findall(findlink, tr)
        # if len(link) > 1:
        #     link = link[0]
        # elif len(link) == 0:
        #     link = ' '
        # else:
        #     link = ''.join(link)
        # 开始对link进行分析

        if len(link) > 1:
            link = link[0]
        elif len(link) == 0:
            link = ' '
        else:
             link = ''.join(link)
        data.append(link)
        datalist.append(data)
    return datalist

# 保存数据
def saveDatadb(datalist, dbpath):
    init_db(dbpath)
    conn = sqlite3.connect(dbpath)
    cur = conn.cursor()  # 获取游标
    # print("我执行了")
    for data in datalist:
        for index in range(len(data)):
            data[index] = '"' + str(data[index]) + '"'  # '"'+data[index]+'"'
        sql = '''
            insert into university(
            name, ename, address, link) 
            values (%s)''' % ",".join(data)
        # print(sql)
        cur.execute(sql)
        conn.commit()  # 提交
    cur.close()
    conn.close()  # 关闭链接

# 创建数据库
def init_db(dbpath):
    sql = '''
        create table university(
        id integer primary key autoincrement,
        name text ,
        ename text ,
        address text ,
        link text
        );
    '''
    conn = sqlite3.connect(dbpath)  # 建表
    cursor = conn.cursor()  # 游标
    cursor.execute(sql)  # 执行sql语句建表
    conn.commit()  # 提交
    conn.close()  # 关闭

if __name__ == "__main__":  # 当程序执行时，调用函数  这样写的目的是严格控制函数执行的主流程
    main()

本回答被题主选为最佳回答 , 对您是否有帮助呢?

报告相同问题？

关注问题

关于#python#的问题：用python编写爬虫程序，将文字和图像等信息抓取到sqlite中保存 python
2022-06-04 10:47

回答 1 已采纳 import sqlite3 import re import requests from lxml import html findlink = re.compile(r'<a href=
关于#python#的问题，请各位专家解答！ python 爬虫
2023-03-01 22:24

回答 2 已采纳应该是Judge = re.findall(findJudge, item)[0]中正则匹配re.findall(findJudge, item)结果是空列表，用[0]读取的时候提示索引越界了。可修改
sqlite3 Python update 变量数据到数据库中失败 python sqlite
2022-04-15 14:41

回答 2 已采纳 c.execute('''UPDATE EssentialInformation SET Attribute = '%s', name = '%s', NowTime = '%s' WHERE ID
Python语言-编写一个网络爬虫程序，将文字和图像等信息抓取到sqlite中保存（实时爬取微博热搜数据）
2021-06-26 21:24

美腻程序员的博客 import sqlite3 def opendb(): con = sqlite3.connect("D:/realtimehot.db") cur = con.execute("""create table if not exists realtimehot(snum text primary key, swords text, slink text)""") return cur,
用Python调用sqlite将一列数据从大到小排列 python sqlite 有问必答
2021-09-27 22:05

回答 1 已采纳使用pandas处理得到需要可视化的列数据，然后用matplotlib作出频次的条形图。给你一个示例，应用你的数据时，将相关变量替换一下即可。 df1 = pd.DataFrame({'first_n
python3虚拟环境SQLite3版本显示过低问题 python sqlite
2022-01-14 11:18

回答 1 已采纳你用pip list 看下sqlite3的版本。
python,sqlite批量插入数值问题 python sqlite
2022-03-24 17:31

回答 1 已采纳 import sqlite3 con = sqlite3.connect('example.db') cur = con.cursor() lang_list = [ ("Fortran", 195
Python爬虫程序，特点：使用Python编写脚本，提供强大的APIPython，强大的WebUI和脚本编辑器、任务监控和项目
2023-09-03 23:14

Python爬虫程序，特点：使用Python编写脚本，提供强大的APIPython，强大的WebUI和脚本编辑器、任务监控和项目管理和结果查看支持JavaScript页面后端系统支持：MySQL, MongoDB, SQLite, Postgresql支持任务优先级、重...
使用Python中自带的SQLite进行数据库操作，无法查找已经插入的数据。 python sqlite 数据库
2022-01-20 13:28

回答 1 已采纳已经解决，原因是在插入数据以后没有提交事务，conn.commit()
python sqlite无法查询中文数据 python sqlite
2022-05-12 00:08

回答 2 已采纳 cur.execute("select * from login where username='{}'".format(self.getusername.get()))拼接sql语句的时候出错了
Python和SQLite如何在用SELECT语句查询数据库时忽略'_'和'-'认为他们是一个符号并且忽略大小写? python sqlite 有问必答
2022-07-08 19:22

回答 2 已采纳可以试下是不是忽略大小写，mysql是忽略大小写的。 '_'和'-'认为他们是一个符号？将where 后面查询的字段内容替换一下(_替换为-)
Python爬虫技术来抓取RabbitMQ数据，对外提供接口，接口中提供抓取的RabbitMQ信息，使用爬虫技术来抓取MySQL
2024-01-18 10:56

LuckDbSqliteHelperUtils.py Sqlite数据库辅助类 LuckMysqlApi.py MQ原生Api LuckMysqlConstant.py 常量类 LuckMysqlService.py 逻辑服务层 LuckMysqlMain.py 启动入口 requirements.txt 引用库管理文件 ...
关于#javascript#的问题：功能是把sqlite的db文件拖到网页打开，我想把这个拖动的动作去掉改成html只取网页根目录指定的db文件 css javascript
2022-12-30 09:33

回答 3 已采纳弄好了。私聊我发你核心代码就是这
python抓取网页数据，并存储到sqlite中
2022-05-23 16:51

田无水的博客鸿蒙开发过程中，从网页抓取数据，并生成sqlite轻量级数据库文件，便于进一步实现增删改查。
PySipder是一个Python爬虫程序.rar
2023-07-05 17:07

PySipder 是一个 Python 爬虫程序使用 Python 编写脚本，提供强大的 API Python 2&3 强大的 WebUI 和脚本编辑器、任务监控和项目管理和结果查看支持 JavaScript 页面后端系统支持：MySQL, MongoDB, SQLite, ...
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
系统已结题 6月13日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
已采纳回答 6月5日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 6月4日

悬赏问题

¥15 对于这个问题的代码运行
¥50 三种调度算法报错有实例
¥15 关于#python#的问题，请各位专家解答！
¥200 询问：python实现大地主题正反算的程序设计，有偿
¥15 smptlib使用465端口发送邮件失败
¥200 总是报错，能帮助用python实现程序实现高斯正反算吗？有偿
¥15 对于squad数据集的基于bert模型的微调
¥15 为什么我运行这个网络会出现以下报错？CRNN神经网络
¥20 steam下载游戏占用内存
¥15 CST保存项目时失败

关于#python#的问题：用python编写爬虫程序，将文字和图像等信息抓取到sqlite中保存

1条回答 默认 最新

问题事件

悬赏问题

1条回答默认最新