方知有 2019-07-15 15:12 采纳率: 0%
浏览 848

爬虫部分数据写入excel失败

最近学习爬虫,参考大佬代码,但是爬取结果没有抬头,第三个爬取的数据并没有插入excel中

import requests
from lxml import etree
from openpyxl import Workbook
import random

class tengxun():
    def __int__(self):
        self.url = 'https://ke.qq.com/course/list?mt=1001&page={}'
        self.header = {
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0",
            "Connection": "keep - alive",
        }
        self.wb = Workbook()
        self.ws = self.wb.active
        self.ws.append(['title', 'link', 'now_reader'])

    def geturl(self):
        self.url = 'https://ke.qq.com/course/list?mt=1001&page={}'
        url = [self.url.format(i) for i in range(1,5)]
        return url

    def prase_url(self,url):
        self.header = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0",
            "Connection": "keep - alive",
        }
        response = requests.get(url, headers=self.header, timeout=5)
        return response.content.decode('gbk', 'ignore')

    def get_list(self,html_str):
        html = etree.HTML(html_str)
        connect_list = []
        lists = html.xpath("//li[@class ='course-card-item']")
        for list in lists:
             item = {}
             item['title'] = ''.join(list.xpath("./h4/a[@class = 'item-tt-link']/text()"))
             item['link'] = ''.join(list.xpath("./a[@class = 'item-img-link']/@href"))
             item['now_reader'] = ''.join(list.xpath("./div[@class = 'item-line item-line--moddle']/span[@class='line-cell item-user']/text()"))
             connect_list.append(item)
        return connect_list

    def save_list(self, connects):
        self.wb = Workbook()

        self.ws = self.wb.active

        for connect in connects:
            self.ws.append([connect['title'], connect['link'], connect['now_reader']])
        print('保存成功页招聘信息')

    def run(self):
        url_list = self.geturl()
        for url in url_list:
            html_url = self.prase_url(url)
            connects = self.get_list(html_url)
            self.save_list(connects)
        self.wb.save(r'C:\Users\Administrator\Desktop\resource\UA_ls\demo_09 try.xlsx')

if __name__=='__main__':
    spider = tengxun()
    spider.run()
  • 写回答

2条回答 默认 最新

  • 小黑LLB 2019-07-15 18:20
    关注

    item-line item-line--moddle => item-line item-line--middle
    但是数据还是不全,我也没办法了

    评论

报告相同问题?

悬赏问题

  • ¥15 C++使用Gunplot
  • ¥15 这个电路是如何实现路灯控制器的,原理是什么,怎么求解灯亮起后熄灭的时间如图?
  • ¥15 matlab数字图像处理频率域滤波
  • ¥15 在abaqus做了二维正交切削模型,给刀具添加了超声振动条件后输出切削力为什么比普通切削增大这么多
  • ¥15 ELGamal和paillier计算效率谁快?
  • ¥15 file converter 转换格式失败 报错 Error marking filters as finished,如何解决?
  • ¥15 Arcgis相交分析无法绘制一个或多个图形
  • ¥15 关于#r语言#的问题:差异分析前数据准备,报错Error in data[, sampleName1] : subscript out of bounds请问怎么解决呀以下是全部代码:
  • ¥15 seatunnel-web使用SQL组件时候后台报错,无法找到表格
  • ¥15 fpga自动售货机数码管(相关搜索:数字时钟)