python爬虫问题，BeautifulSoup提问，真心求教，急

目的是爬取北京到上海机票预订 - 北京到上海机票预约 - 同程机票预订 (ly.com)网站中的机票信息。设计使用BeautifulSoup库中的find_all()函数先获取所有 <div class="flight-item-head data-v-13439d30"> 标签的信息，之后使用正则表达式获取更加细致的信息。但是，在第一步find_all()函数的使用后，没有获取网页中所有符合要求div标签的内容，想知道是为什么。

import urllib.error
import urllib.request
from tkinter import *

from bs4 import BeautifulSoup


def checkChinese(InPut):
    flag = False
    for i in range(0, len(InPut)):
        if ord(InPut[i]) <= 255:
            flag = True
            break
    if flag is False:
        return True
    else:
        return False


def checkDigit(InPut):
    return InPut.isdigit()


def askForUrl(url):
    head = {'User-Agent': 'Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / '
                          '80.0 3987.122 Safari / 537.36'}
    request = urllib.request.Request(url, headers=head)
    html = ''
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode('utf-8')
    except urllib.error.URLError as e:
        if hasattr(e, 'code'):
            print(e.code)
        if hasattr(e, 'reason'):
            print(e.reason)
    return html


def setUrl(baseurl, locDict):
    depLocation = str(e1.get())
    ariLocation = str(e2.get())
    depLocation = str(locDict[depLocation])[2:5]
    ariLocation = str(locDict[ariLocation])[2:5]
    ariYear = str(e3.get())
    ariMonth = str(e4.get())
    ariDay = str(e5.get())
    url = baseurl + depLocation + '-' + ariLocation + '?' + 'date=' + ariYear + '-' + ariMonth + '-' + ariDay
    return url


def searchForBaseInfo(baseurl1, locDict1, flightNameList1, depTime1, ariTime1, depAirport1, ariAirport1, ifAddDays1):
    url1 = setUrl(baseurl1, locDict1)
    # print(url1)
    html = askForUrl(url1)
    soup = BeautifulSoup(html, 'lxml')
#这里####################
    for item in soup.find_all('div', class_='flight-item'):
#这里####################
        item = str(item)
        print(item)
        flightNameList1.append(re.findall(findFlightName, item)[0])
        tempTime = re.findall(findDepTime, item)[0][0] + ':' + re.findall(findDepTime, item)[0][1]
        depTime1.append(tempTime)
        tempTime = re.findall(findAriTime, item)[0][0] + ':' + re.findall(findAriTime, item)[0][1]
        ariTime1.append(tempTime)
        depAirport1.append(re.findall(findDepAirport, item)[0])
        ariAirport1.append(re.findall(findAriAirport, item)[0])
        tempAddDays = re.findall(findIfAddDays, item)
        if len(tempAddDays) == 0:
            ifAddDays1.append('n')
        else:
            ifAddDays1.append(tempAddDays[0])

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
木三136 2021-04-26 14:31
关注
首先在使用爬虫前您需要确保爬取的页面数据是全部的即不存在动态加载的情况

若有的数据是动态加载的您还需要爬取网页所携带的的json文件

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

python爬虫问题，BeautifulSoup提问，真心求教，急 python
2021-04-26 13:53

回答 1 已采纳首先在使用爬虫前您需要确保爬取的页面数据是全部的即不存在动态加载的情况若有的数据是动态加载的您还需要爬取网页所携带的的json文件
python爬虫问题求解 python 爬虫
2022-04-29 11:12

回答 1 已采纳我给你改了一下，你对比看看吧： from bs4 import BeautifulSoup import pandas as pd import requests def crawer_travel
Python爬虫 BeautifulSoup解析网页爬取内容为None python 有问必答
2021-08-31 14:07

回答 2 已采纳你抓的频率太快，IP被墙了
用python做数据爬取的问题虚心求教
2024-03-01 10:46

py有趣的博客 python爬虫文件转换
关于python爬虫中beautifulsoup4与正则表达式的运用问题！ python
2020-06-21 21:50

回答 1 已采纳 soup.find_all(href=re.compile("view")) soup.查找所有（href属性里面含有view关键字）的结果有时间看看RE模块的用法
Python爬虫配合VPN爬取出现报错 python 爬虫
2021-12-22 17:33

回答 1 已采纳你这个是VPN代理问题，你可以将VPN设置成部分代理，不要全部代理你的网络。
python爬虫位置问题 python 爬虫
2023-03-08 13:31

回答 2 已采纳该回答引用GPTᴼᴾᴱᴺᴬᴵ如果您想要提取 div class="detail-context"标签下所有的 tr 标签，并进一步提取每个 tr 中的 td 标签内的内容，可以使用以下代码： impo
python爬虫post参数_Python爬虫post参数包含重复键
2020-12-08 06:53

weixin_39623350的博客我在写Python爬虫的时候，pycharm一直提示包含重复键，但程序可以运行。但是运行结果只爬了一部分内容下来。三个“Pu00021_Iuput.content”只能运行一个。我试过设置参数for循环，都不行。求教！coding=utf-8”’ ...
python爬虫爬取网页代码遇到了一些问题 python 爬虫
2022-08-17 17:07

回答 3 已采纳因为元素里的你要的内容是通过 ajax 请求动态加载的，可以浏览器抓包去看下，你想要的这条数据到底是哪个请求返回的，找到真正的请求，然后模拟发送就行了
beautifulSoup4爬虫问题，python简单代码请教一下 python 有问必答爬虫
2022-01-15 17:29

回答 1 已采纳就是获取 soup.find_all("script", type="text/javascript") 返回的结果，取第3个元素的文本。
python爬虫运行没有结果的问题 python 爬虫
2023-02-24 21:28

回答 3 已采纳给你起个头，其它比较容易，一次请求了5000个，多了好像不行，试了9000个都可以，可以分两次 url="http://vip.stock.finance.sina.com.cn/fund_cent
python爬虫语句_Python爬虫正则语句求指导
2021-02-04 06:26

你一直在玩儿的博客刚开始学python，对爬虫和正则表达式这部分还不是很熟悉。我现在准备爬取下面这段源码的href和title部分，爬href中的网址后，要在每个网址前面加上"http://www.infoq.com" ，用到的库是urllib2和re.我现在碰到的问题...
关于#python爬虫#的问题：TypeError python 爬虫
2022-10-23 22:38

回答 2 已采纳 urlopen里面的逗号改为.是.format有帮助的话采纳一下哦！
python爬虫返回none_Python爬取网站，前几个有数据，之后返回None？
2020-12-23 11:21

weixin_39915721的博客想获取廖雪峰python教程网站的内容练练手，发现有的章节能返回数据，但到Python基础这一章开始返回的都是None，没明白问题出在哪，求教错误如下：Traceback (most recent call last):File "scraping_the_tutorial.py...
php爬虫指定标签,python - 【求教】：如何用BeautifulSoup爬取指定标签下的内容
2021-04-26 13:41

weixin_39881167的博客先附上数据:Apple iPhone 6 (A1589) 16GB 金色移动4G手机用bs想要爬取title和href里面的内容,由于.../usr/bin/python2 #coding:utf-83 from bs4 import BeautifulSoup4 import re567 soup = BeautifulSoup(open('jd...
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
系统已结题 9月29日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
已采纳回答 9月21日

悬赏问题

¥15 关于#java#的问题：找一份能快速看完mooc视频的代码
¥15 这种微信登录授权谁可以做啊
¥15 请问我该如何添加自己的数据去运行蚁群算法代码
¥20 用HslCommunication 连接欧姆龙 plc有时会连接失败。报异常为“未知错误”
¥15 网络设备配置与管理这个该怎么弄
¥20 机器学习能否像多层线性模型一样处理嵌套数据
¥20 西门子S7-Graph,S7-300，梯形图
¥50 用易语言http 访问不了网页
¥50 safari浏览器fetch提交数据后数据丢失问题
¥15 matlab不知道怎么改，求解答！！

python爬虫问题，BeautifulSoup提问，真心求教，急

1条回答 默认 最新

问题事件

悬赏问题

1条回答默认最新