ccccchimon 2022-09-15 21:25 采纳率: 25%
浏览 18
已结题

python爬虫存入excel却只重复第一行信息

请问各位,python爬取网页信息的时候,print出来的都是对的,但是存入excel却全都是重复的第一行信息是怎么回事啊
driver = webdriver.Chrome()

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument",{
  "source": """
    Object.defineProperty(navigator, 'webdriver', {
      get: () => undefined
    })
  """
})

driver.get('https://kns.cnki.net/kns8/defaultresult/index')
time.sleep(1)
driver.find_element(By.ID,"txt_search").send_keys('福建妇女')   # .send_keys()  向搜索框中输入内容
driver.find_element(By.CLASS_NAME,"search-btn").click()       # .click() 代表点击动作
time.sleep(1)
datalist = []
# html = driver.page_source
# print(html)
while True:

    trs = driver.find_elements(By.XPATH,"//table[@class='result-table-list']/tbody/tr")
    for tr in trs:

        # 序号
        order = tr.find_element(By.XPATH,"./td[@class='seq']").text
        print(order)
        datalist.append(order)
        # 题目
        title = tr.find_element(By.XPATH,"./td[@class='name']").text
        print(title)
        datalist.append(title)
        # 论文链接
        href = tr.find_element(By.XPATH,"./td[@class='name']/a").get_attribute('href')
        print(href)
        datalist.append(href)
        # 作者

        author = tr.find_element(By.XPATH,"./td[@class='author']").text
        print(author)
        datalist.append(author)


        # 来源
        source = tr.find_element(By.XPATH,"./td[@class='source']").text
        print(source)
        datalist.append(source)
        # 发表时间
        pubtime = tr.find_element(By.XPATH,"./td[@class='date']").text
        print(pubtime)
        datalist.append(pubtime)
        # 发表类别
        category = tr.find_element(By.XPATH,"./td[@class='data']").text
        print(category)
        datalist.append(category)



    # 下一页
    try:
        next = driver.find_element(By.ID,"PageNext").click()
        time.sleep(2)
        html = driver.page_source
        print(html)
    except:
        print("没有下一页了")
        break

driver.quit()


print("saving")
workbook = xlwt.Workbook(encoding="utf-8")
worksheet = workbook.add_sheet('sheet1',cell_overwrite_ok=True)
col = ("序号","题目","链接","作者","来源","发表时间","类别")
for i in range(0,7):
    worksheet.write(0,i,col[i])
for i in range(0,180):
    # print("第%d条" %(i+1))
    datalist2 = datalist[i]
    for j in range(0,7):
        worksheet.write(i+1,j,datalist2[j])
   
workbook.save("zhiwang2.xls")

img

  • 写回答

1条回答 默认 最新

  • 梦里逆天 2022-09-15 22:35
    关注
    import time
    
    import xlwt as xlwt
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.common.by import By
    
    # driver = webdriver.Chrome()
    service = Service(r"D:\chromedriver.exe")
    driver = webdriver.Chrome(service=service)
    driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
        "source": """
        Object.defineProperty(navigator, 'webdriver', {
          get: () => undefined
        })
      """
    })
    
    driver.get('https://kns.cnki.net/kns8/defaultresult/index')
    time.sleep(1)
    driver.find_element(By.ID, "txt_search").send_keys('福建妇女')  # .send_keys()  向搜索框中输入内容
    driver.find_element(By.CLASS_NAME, "search-btn").click()  # .click() 代表点击动作
    time.sleep(1)
    datalist = []
    # html = driver.page_source
    # print(html)
    while True:
        trs = driver.find_elements(By.XPATH, "//table[@class='result-table-list']/tbody/tr")
        for tr in trs:
            # 序号
            order = tr.find_element(By.XPATH, "./td[@class='seq']").text
            print(order)
            # datalist.append(order)
            # 题目
            title = tr.find_element(By.XPATH, "./td[@class='name']").text
            print(title)
            # datalist.append(title)
            # 论文链接
            href = tr.find_element(By.XPATH, "./td[@class='name']/a").get_attribute('href')
            print(href)
            # datalist.append(href)
            # 作者
            author = tr.find_element(By.XPATH, "./td[@class='author']").text
            print(author)
            # datalist.append(author)
            # 来源
            source = tr.find_element(By.XPATH, "./td[@class='source']").text
            print(source)
            # datalist.append(source)
            # 发表时间
            pubtime = tr.find_element(By.XPATH, "./td[@class='date']").text
            print(pubtime)
            # datalist.append(pubtime)
            # 发表类别
            category = tr.find_element(By.XPATH, "./td[@class='data']").text
            print(category)
            # datalist.append(category)
            datalist.append([order, title, href, author, source, pubtime, category])
        # 下一页
        try:
            driver.find_element(By.ID, "PageNext").click()
            time.sleep(2)
            html = driver.page_source
            print(html)
        except:
            print("没有下一页了")
            break
    
    driver.quit()
    
    print("saving")
    print(datalist)
    workbook = xlwt.Workbook(encoding="utf-8")
    worksheet = workbook.add_sheet('sheet1', cell_overwrite_ok=True)
    col = ("序号", "题目", "链接", "作者", "来源", "发表时间", "类别")
    for i in range(0, 7):
        worksheet.write(0, i, col[i])
    for i in range(len(datalist)):
        datalist2 = datalist[i]
        for j in range(0, 7):
            worksheet.write(i + 1, j, datalist2[j])
    
    workbook.save("zhiwang2.xls")
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

问题事件

  • 系统已结题 9月23日
  • 已采纳回答 9月15日
  • 创建了问题 9月15日

悬赏问题

  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
  • ¥15 用windows做服务的同志有吗
  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值