weixin_42081994
2021-07-29 13:36
采纳率: 0%
浏览 1.6k

Python爬虫提取数据

我有若干年份的中国教育统计年鉴Excel表,想从中提取出我需要的数据(不同省份年份指标的数据),并将这些数据输入到我的汇总Excel表中,反复复制粘贴繁琐,请问可否教我,我完全不懂编程这些。

  • 点赞
  • 收藏

19条回答 默认 最新

  • 鸡蛋酱$ 2021-07-29 13:48

    用python的xlwt和wxrd来操作,把想要的数据提取出来先存入一个列表或者文本,再写入一个新的excle表格就可以了,私信可以教你或者给你做

    点赞 4 打赏 评论
  • CSDN专家-HGJ 2021-07-29 13:40

    可以使用pandas和openpyxl去提取和处理数据,将原表读取成数据框,通过整理和汇总写入新的excel表格中。

    点赞 4 打赏 评论
  • 风华绝代飞入鬓 2021-07-29 18:34

    你的问题可以类比爬 无反爬虫网站cnnvd

    http://cnnvd.org.cn/web/vulnerability/queryLds.tag
    CNNVD网站全部网络安全漏洞信息的起始页面网址为http://cnnvd.org.cn/web/vulnerability/queryLds.tag , 页面主要信息为首页的近期十条漏洞数据信息的用于指定超链接目标的 URL,底部可获取其它所有页面的URL,每一个URL包含十条漏洞数据信息的指定超链接目标URL。点击指定超链接目标的URL后就转到漏洞信息页面,主要分成六层:
    第一层是漏洞信息详情,HTML源码标签标识信息为div class="detail_xq w770",内容主要包含一个漏洞的名称、CNNVD编号、危害等级、国际编号CEV编号、漏洞类型、发布时间、威胁类型、更新时间、厂商和漏洞来源,除发布时间和更新时间为日期类型外,其余全为字符串信息;
    第二层是漏洞简介,HTML源码标签标识信息为div 'class':'d_ldjj',有两行简介,内容是长字符串信息;
    第三层是漏洞公告,HTML源码标签标识信息为div class="d_ldjj m_t_20",有两行公告,内容是长字符串信息;
    第四层是参考网址,HTML源码标签标识信息为div class="d_ldjj m_t_20",有多条不同来源提供的网址链接,内容是长字符串信息;
    第五层是受影响实体信息,HTML源码标签标识信息为div class="d_ldjj m_t_20",内容是长字符串信息,绝大多数为暂无;
    第六层是补丁的链接,HTML源码标签标识信息为div class="d_ldjj m_t_20",内容是长字符串信息。

    整个爬取网页数据的方法都是先根据网页布局和源码来设计实现的,首先需要通过首页获取漏洞详情链接列表进行分析追踪,再提取一个页面的漏洞详情链接,然后再根据下一个页面URL,如此循环,直至获取用户输入要求所有的漏洞详情链接,获取URL完成后,开始对每一个URL开始解析页面的漏洞具体内容,最后对漏洞数据信息进行数据集存储。
    img

    img
    爬取数据存入EXCEL文件中
    img

    import urllib.request
    from urllib import parse
    from bs4 import BeautifulSoup
    import http.cookiejar
    import xlwt
    import zlib
    import re
    import time
    import xlsxwriter
    import sys
    import datetime
    import pymysql
    
    '''
    运行方法:
    python vulnerabilities_crawler 2017-10-01 2017-10-31 178
    第一个为开始时间,第二个为结束时间,第三个为总页数。
    '''
    
    
    # 获得漏洞详情链接列表
    def vulnerabilities_url_list(url, start_time, end_time):
        header = {
            'User-Agent': 'Mozilla/5.0 (Linux; Android 4.1.2; Nexus 7 Build/JZ054K) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19',
            'Accept-Encoding': 'gzip, deflate',
            'Referer': 'http://cnnvd.org.cn/web/vulnerability/queryLds.tag'
        }  # 通过F12获取目的网页URL
        data = {
            'qstartdate': '2017-10-30',  # ---------------》开始日期
            'qenddate': '2017-10-31'  # ---------------》结束日期
        }
        data['qstartdate'] = start_time
        data['qenddate'] = end_time
        data = parse.urlencode(data).encode('utf-8')
        vulnerabilities_url_html = urllib.request.Request(url, headers=header, data=data)
        vulnerabilities_url_cookie = http.cookiejar.CookieJar()
        vulnerabilities_url_opener = urllib.request.build_opener(
            urllib.request.HTTPCookieProcessor(vulnerabilities_url_cookie))
        vulnerabilities_url_html = vulnerabilities_url_opener.open(vulnerabilities_url_html)
        vulnerabilities_url_html = zlib.decompress(vulnerabilities_url_html.read(), 16 + zlib.MAX_WBITS)
        vulnerabilities_url_html = vulnerabilities_url_html.decode()
    
        # 提取漏洞详情链接
        response = r'href="(.+?)" target="_blank" class="a_title2"'
        vulnerabilities_link_list = re.compile(response).findall(vulnerabilities_url_html)
    
        # 添加http前序
        i = 0
        for link in vulnerabilities_link_list:
            vulnerabilities_lists.append('http://cnnvd.org.cn' + vulnerabilities_link_list[i])
            i += 1
            print("已完成爬行第%d个漏洞链接" % i)
            time.sleep(0.2)
    
    
    # 漏洞信息爬取函数
    def vulnerabilities_data(url):
        header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0',
            'Accept-Encoding': 'gzip, deflate, sdch',
        }
        vulnerabilities_data_html = urllib.request.Request(url, headers=header)
        vulnerabilities_data_cookie = http.cookiejar.CookieJar()
        vulnerabilities_data_opener = urllib.request.build_opener(
            urllib.request.HTTPCookieProcessor(vulnerabilities_data_cookie))
        vulnerabilities_data_html = vulnerabilities_data_opener.open(vulnerabilities_data_html)
        vulnerabilities_data_html = zlib.decompress(vulnerabilities_data_html.read(), 16 + zlib.MAX_WBITS)
        vulnerabilities_data_html = vulnerabilities_data_html.decode()
    
        global vulnerabilities_result_list
        vulnerabilities_result_list = []  # 抓取信息列表命名
    
        # 添加漏洞信息详情
        vulnerabilities_detainled_soup1 = BeautifulSoup(vulnerabilities_data_html, 'html.parser')
        vulnerabilities_detainled_data = vulnerabilities_detainled_soup1.find('div', attrs={
            'class': 'detail_xq w770'})  ##定义 漏洞信息详情 块的soup
        vulnerabilities_detainled_data = vulnerabilities_detainled_data.decode()
        vulnerabilities_detainled_soup = BeautifulSoup(vulnerabilities_detainled_data, 'html.parser')  # 二次匹配
    
        vulnerabilities_detainled_data_list = vulnerabilities_detainled_soup.find_all('li')  # 标签a信息汇总
    
        try:
            vulnerabilities_name = vulnerabilities_detainled_soup.h2.string  # 漏洞名称
        except:
            vulnerabilities_name = ''
        vulnerabilities_result_list.append(vulnerabilities_name)
    
        try:
            vulnerabilities_cnnvd_num = vulnerabilities_detainled_soup.span.string  # cnnvd编号
            vulnerabilities_cnnvd_num = re.findall(r"\:([\s\S]*)", vulnerabilities_cnnvd_num)[0]
        except:
            vulnerabilities_cnnvd_num = ''
        vulnerabilities_result_list.append(vulnerabilities_cnnvd_num)
    
        try:  # 漏洞等级
            vulnerabilities_rank = vulnerabilities_detainled_soup.a.decode()
            vulnerabilities_rank = re.search(u'([\u4e00-\u9fa5]+)', vulnerabilities_rank).group(0)
        except:
            vulnerabilities_rank = ''
        vulnerabilities_result_list.append(vulnerabilities_rank)
    
        vulnerabilities_cve_html = vulnerabilities_detainled_data_list[2].decode()  # 漏洞cve编号
        vulnerabilities_cve_soup = BeautifulSoup(vulnerabilities_cve_html, 'html.parser')
        try:
            vulnerabilities_cve = vulnerabilities_cve_soup.a.string
            vulnerabilities_cve = vulnerabilities_cve.replace("\r", "").replace("\t", "").replace("\n", "").replace(" ", "")
        except:
            vulnerabilities_cve = ''
        vulnerabilities_result_list.append(vulnerabilities_cve)
    
        vulnerabilities_type_html = vulnerabilities_detainled_data_list[3].decode()  # 漏洞类型
        vulnerabilities_type_soup = BeautifulSoup(vulnerabilities_type_html, 'html.parser')
        try:
            vulnerabilities_type = vulnerabilities_type_soup.a.string
            vulnerabilities_type = vulnerabilities_type.replace("\r", "").replace("\t", "").replace("\n", "").replace(" ",
                                                                                                                      "")
        except:
            vulnerabilities_type = ''
        vulnerabilities_result_list.append(vulnerabilities_type)
    
        vulnerabilities_time_html = vulnerabilities_detainled_data_list[4].decode()  # 发布时间
        vulnerabilities_time_soup = BeautifulSoup(vulnerabilities_time_html, 'html.parser')
        try:
            vulnerabilities_time = vulnerabilities_time_soup.a.string
            vulnerabilities_time = vulnerabilities_time.replace("\r", "").replace("\t", "").replace("\n", "")
        except:
            vulnerabilities_time = ''
        vulnerabilities_result_list.append(vulnerabilities_time)
    
        vulnerabilities_attack_html = vulnerabilities_detainled_data_list[5].decode()  # 威胁类型
        vulnerabilities_attack_soup = BeautifulSoup(vulnerabilities_attack_html, 'html.parser')
        try:
            vulnerabilities_attack = vulnerabilities_attack_soup.a.string
            vulnerabilities_attack = vulnerabilities_attack.replace("\r", "").replace("\t", "").replace("\n", "")
        except:
            vulnerabilities_attack = ''
        vulnerabilities_result_list.append(vulnerabilities_attack)
    
        vulnerabilities_update_html = vulnerabilities_detainled_data_list[6].decode()  # 更新时间
        vulnerabilities_update_soup = BeautifulSoup(vulnerabilities_update_html, 'html.parser')
        try:
            vulnerabilities_update = vulnerabilities_update_soup.a.string
            vulnerabilities_update = vulnerabilities_update.replace("\r", "").replace("\t", "").replace("\n", "")
        except:
            vulnerabilities_update = ''
        vulnerabilities_result_list.append(vulnerabilities_update)
    
        vulnerabilities_firm_html = vulnerabilities_detainled_data_list[7].decode()  # 厂商
        vulnerabilities_firm_soup = BeautifulSoup(vulnerabilities_firm_html, 'html.parser')
        try:
            vulnerabilities_firm = vulnerabilities_firm_soup.a.string
            vulnerabilities_firm = vulnerabilities_firm.replace("\r", "").replace("\t", "").replace("\n", "")
        except:
            vulnerabilities_firm = ''
        vulnerabilities_result_list.append(vulnerabilities_firm)
    
        vulnerabilities_source_html = vulnerabilities_detainled_data_list[8].decode()  # 漏洞来源
        vulnerabilities_source_soup = BeautifulSoup(vulnerabilities_source_html, 'html.parser')
        try:
            vulnerabilities_source = vulnerabilities_source_soup.a.string
            vulnerabilities_source = vulnerabilities_source.replace("\r", "").replace("\t", "").replace("\n", "")
        except:
            vulnerabilities_source = ''
        vulnerabilities_result_list.append(vulnerabilities_source)
    
        # 添加漏洞简介详情
        vulnerabilities_title_html = vulnerabilities_detainled_soup1.find('div',
                                                                          attrs={'class': 'd_ldjj'})  # 定义 漏洞简介 块的soup
        vulnerabilities_title_html = vulnerabilities_title_html.decode()
        vulnerabilities_title_soup2 = BeautifulSoup(vulnerabilities_title_html, 'html.parser')
    
        try:
            vulnerabilities_titles1 = vulnerabilities_title_soup2.find_all(name='p')[0].string
            vulnerabilities_titles2 = vulnerabilities_title_soup2.find_all(name='p')[1].string
            vulnerabilities_titles = vulnerabilities_titles1 + vulnerabilities_titles2
            vulnerabilities_titles = vulnerabilities_titles.replace(' ', '').replace('\t', '').replace('\r', '').replace(
                '\n', '')
        except:
            vulnerabilities_titles = ''
        vulnerabilities_result_list.append(vulnerabilities_titles)
    
        # 漏洞公告
        vulnerabilities_notice_html = vulnerabilities_detainled_soup1.find('div', attrs={
            'class': 'd_ldjj m_t_20'})  # 定义 漏洞公告 块的soup
        vulnerabilities_notice_html = vulnerabilities_notice_html.decode()
        vulnerabilities_notice_soup2 = BeautifulSoup(vulnerabilities_notice_html, 'html.parser')
    
        try:
            vulnerabilities_notice1 = vulnerabilities_notice_soup2.find_all(name='p')[0].string
            vulnerabilities_notice2 = vulnerabilities_notice_soup2.find_all(name='p')[1].string
            vulnerabilities_notice = vulnerabilities_notice1 + vulnerabilities_notice2
            vulnerabilities_notice = vulnerabilities_notice.replace('\n', '').replace('\r', '').replace('\t', '')
        except:
            vulnerabilities_notice = ''
        vulnerabilities_result_list.append(vulnerabilities_notice)
    
        # 参考网址
        vulnerabilities_reference_html = vulnerabilities_detainled_soup1.find_all('div', attrs={'class': 'd_ldjj m_t_20'})[
            1]  # 定义 参考网址 块的soup
        vulnerabilities_reference_html = vulnerabilities_reference_html.decode()
        vulnerabilities_reference_soup2 = BeautifulSoup(vulnerabilities_reference_html, 'html.parser')
    
        try:
            vulnerabilities_reference = vulnerabilities_reference_soup2.find_all(name='p')[1].string
            vulnerabilities_reference = vulnerabilities_reference.replace('\n', '').replace('\r', '').replace('\t',
                                                                                                              '').replace(
                '链接:', '')
        except:
            vulnerabilities_reference = ''
        vulnerabilities_result_list.append(vulnerabilities_reference)
    
        # 受影响实体
        vulnerabilities_effect_html = vulnerabilities_detainled_soup1.find_all('div', attrs={'class': 'd_ldjj m_t_20'})[
            2]  # 定义 受影响实体 块的soup
        vulnerabilities_effect_html = vulnerabilities_effect_html.decode()
        vulnerabilities_effect_soup2 = BeautifulSoup(vulnerabilities_effect_html, 'html.parser')
        try:
            vulnerabilities_effect = vulnerabilities_effect_soup2.find_all(name='p')[0].string
            vulnerabilities_effect = vulnerabilities_effect.replace('\n', '').replace('\r', '').replace('\t', '').replace(
                ' ', '')
        except:
            try:
                vulnerabilities_effect = vulnerabilities_effect_soup2.find_all(name='a')[0].string
                vulnerabilities_effect = vulnerabilities_effect.replace('\n', '').replace('\r', '').replace('\t',
                                                                                                            '').replace(' ',
                                                                                                                        '')
            except:
                vulnerabilities_effect = ''
        vulnerabilities_result_list.append(vulnerabilities_effect)
    
        # 补丁
        vulnerabilities_patch_html = vulnerabilities_detainled_soup1.find_all('div', attrs={'class': 'd_ldjj m_t_20'})[
            3]  # 定义 补丁 块的soup
        vulnerabilities_patch_html = vulnerabilities_patch_html.decode()
        vulnerabilities_patch_soup2 = BeautifulSoup(vulnerabilities_patch_html, 'html.parser')
    
        try:
            vulnerabilities_patch = vulnerabilities_patch_soup2.find_all(name='p')[0].string
            vulnerabilities_patch = vulnerabilities_patch.replace('\n', '').replace('\r', '').replace('\t', '').replace(' ',
                                                                                                                        '')
        except:
            vulnerabilities_patch = ''
        vulnerabilities_result_list.append(vulnerabilities_patch)
    
    
    # 漏洞信息写入excel
    def vulnerabilities_excel(excel):
        workbook = xlsxwriter.Workbook('vulnerabilities_data.xlsx')
        worksheet = workbook.add_worksheet()
    
        row = 0
        col = 0
        worksheet.write(row, 0, '漏洞名称')
        worksheet.write(row, 1, 'CNNVD编号')
        worksheet.write(row, 2, '危害等级')
        worksheet.write(row, 3, 'CVE编号')
        worksheet.write(row, 4, '漏洞类型')
        worksheet.write(row, 5, '发布时间')
        worksheet.write(row, 6, '攻击途径')
        worksheet.write(row, 7, '更新时间')
        worksheet.write(row, 8, '厂商')
        worksheet.write(row, 9, '漏洞来源')
        worksheet.write(row, 10, '漏洞描述')
        worksheet.write(row, 11, '解决方案')
        worksheet.write(row, 12, '参考链接')
        worksheet.write(row, 13, '受影响实体')
        worksheet.write(row, 14, '补丁')
    
        row = 1
        for i in range(len(excel)):
            worksheet.write(row, col, excel[i][0])
            worksheet.write(row, col + 1, excel[i][1])
            worksheet.write(row, col + 2, excel[i][2])
            worksheet.write(row, col + 3, excel[i][3])
            worksheet.write(row, col + 4, excel[i][4])
            worksheet.write(row, col + 5, excel[i][5])
            worksheet.write(row, col + 6, excel[i][6])
            worksheet.write(row, col + 7, excel[i][7])
            worksheet.write(row, col + 8, excel[i][8])
            worksheet.write(row, col + 9, excel[i][9])
            worksheet.write(row, col + 10, excel[i][10])
            worksheet.write(row, col + 11, excel[i][11])
            worksheet.write(row, col + 12, excel[i][12])
            worksheet.write(row, col + 13, excel[i][13])
            worksheet.write(row, col + 14, excel[i][14])
            row += 1
        workbook.close()
    
    
    # 爬取代理ip
    def main():
        # 调用漏洞列表函数并获得漏洞链接列表
        begin = datetime.datetime.now()
        global vulnerabilities_lists
        vulnerabilities_lists = []
        j = 1
        page_count = sys.argv[3]
        page_count = int(page_count)
        start_time = sys.argv[1]
        end_time = sys.argv[2]
        while j <= page_count:
            try:
                vulnerabilities_url = 'http://cnnvd.org.cn/web/vulnerability/queryLds.tag?pageno=%d&repairLd=' % j
                vulnerabilities_url_list(vulnerabilities_url, start_time, end_time)
                print("已完成爬行第%d页" % j)
                print('\n')
                time.sleep(2)
                j += 1
            except:
                print('爬取失败,等待5秒后重新爬取。')
                time.sleep(5)
    
        # 调用漏洞信息函数并爬取漏洞信息
        vulnerabilities_result_lists = []
        a = 0
        while a < len(vulnerabilities_lists):
            try:
                vulnerabilities_data(vulnerabilities_lists[a])
                vulnerabilities_result_lists.append(vulnerabilities_result_list)
                a += 1
                print("完成爬行第%d个漏洞信息" % a)
                time.sleep(1)
            except:
                print('爬取失败,等待5秒后重新爬取。')
                time.sleep(5)
    
        # 漏洞信息写入excel
        vulnerabilities_excel(vulnerabilities_result_lists)
    
        # 漏洞信息写入MySQL
        # vulnerabilities_mysql(vulnerabilities_result_lists)
    
        # 爬行结束
        end = datetime.datetime.now()
        total_time = end - begin
        print('漏洞信息爬取结束')
        print('应该爬行漏洞数量: ', len(vulnerabilities_lists))
        print('爬行时间: ', total_time)
    
    
    if __name__ == '__main__':
        main()
    
    
    点赞 4 打赏 评论
  • 传说啊哈哈哈 2021-07-29 15:09

    把详细需求交给我,我可以手把手教学或帮你做

    点赞 1 打赏 评论
  • 我辈李想 2021-07-29 16:28

    使用这四个python就行,xlrd、xlwt、os、pandas,其中os读取文件获得excel的路径,pandas处理数据(就像excel一样),xlrd、xlwt这个两个是写入excel的,可以保存成最终需要的格式,另外还有一个installer库,可以把这个程序打包成exe执行程序,这样就可以任意电脑可操作了。

    点赞 1 打赏 评论
  • alpha ghost 2021-07-31 09:13

    这应该是不用爬虫的,个人认为Excel用文件操作就可以解决

    点赞 1 打赏 评论
  • 咕泡-三木 2021-07-29 14:16

    python和vba都可以轻松实现,需要帮助可以私信我

    点赞 1 打赏 评论
  • 风华绝代飞入鬓 2021-07-29 19:10

    你的问题不是爬虫 而是excel数据处理

    点赞 打赏 评论
  • Rwpsyo 2021-07-29 19:24

    使用POWER QUERY轻松解决你的需求,并不一定要用Python处理。

    点赞 打赏 评论
  • Besproma_ 2021-07-30 01:46

    可以私信我 每行代码都帮你搞明白

    点赞 打赏 评论
  • concyclics 2021-07-30 16:01

    可以先提供一份表格让我看看大致的格式,然后我来写。

    点赞 打赏 评论
  • BoYang_Zz 2021-07-30 17:18

    不用其他的编程,直接用excel就能做~ 给一份表头 和你需要的格式,可以帮你写出来哦,然后你把数据一复制进去,结果就出来了。你也可以自己做,网上百度ecxel常用函数,有你能用的~

    点赞 打赏 评论
  • QQ_572747232 2021-07-30 19:53

    这个不是爬虫是数据处理,建议问主改一下题目。


    这个分两种方法处理:
    1.所有表格格式相同:这个好办,遍历所有表格数据,自动提取。Python这方面有很多这类的库,比如xlwd。
    2.所有表格格式不同:有点麻烦,这个可能Python不能全自动处理,要匹配类型,写起来.......(自行脑补感受)。
    关于xlwd之类的库CSDN上有许多数据处理大佬的文章,这里由于我不太擅长就略了,希望可以帮你。

    点赞 打赏 评论
  • renminzdb 2021-07-31 21:13

    正好最近写了一个excel处理的脚本 要不要帮你看一看

    点赞 打赏 评论
  • 孙叫兽 2021-08-01 00:26

    做一个excel数据透视

    点赞 打赏 评论
  • qq_25295819 2021-08-03 09:31

    会excel就用excel,要不就Python的pandas,pandas神器--从一堆数据中拎自己想要的数据

    点赞 打赏 评论
  • 时光的藏宝屋 2021-08-03 10:08

    其实对于题主来说,这个问题最好的解决方式就是用Excel解决。对于这个问题,提出两点可行的建议:

    • 第一,使用Excel解决该问题。
      在Excel中解决这个问题有三种方式,VBA、表格合并、函数多联表提取。VBA用的不多,这不多说。A.如果用表格合并功能,可以将所有需要合并的表格放在同一个文件夹中,然后再创建一个合并表格,通过表格合并功能,设置合并字段、格式,进行多表格合并。最后将合并的数据进行筛选就完成了;B.使用函数进行多联表提取。首先提取若干表格的文件名(可以通过bat批量操作文件的方式提取dir /a-d /b /B . /s>a.txt),提取之后就可以将这若干个表格作为原数据,通过提取的文件名用函数提取想要的信息了。

    • 第二,使用python。
      如果会python就更好了,因为python操作很方便快捷。可以通过python中的pandas模块中的read_excel()函数和for循环提取多个表格的数据,然后再使用pandas中的concat函数进行合并,最后将合并的文件通过to_excel写入本地文件就完成了。

    点赞 打赏 评论
  • Rolei_zl 2021-08-04 21:42

    Excel, 录段宏,然后VBA加代码循环。。。

    点赞 打赏 评论
  • 程序员杰森 2021-08-05 13:15

    Excel本身就可以实现题主的需求吧

    点赞 打赏 评论

相关推荐 更多相似问题