问题遇到的现象和发生背景 、
在使用Python进行网络爬虫下载部分资源的时候,发现有些网的附件无法解析出ip地址。
附件在解析元素时,显示如下,
问题相关代码,请勿粘贴截图
url = ‘http://www.ccgp.gov.cn/cggg/dfgg/qtgg/202207/t20220715_18271681.htm’
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
main_info = soup.find('div', class_='vF_detail_content')
att_info = main_info.find_all('table')
if not att_info: # 没有附件,返回真
return []
att_file_list = []
for tmp in att_info:
detail_list = tmp.find_all('td')
for detail in detail_list:
if not detail:
continue
try:
for con in detail.contents:
ip = con.attrs['href']
attach_file_name = con.text.strip()
if ip and attach_file_name:
att_file_list.append({'ip': ip, 'filename': attach_file_name})
else:
LogUtils.error(f"查找 ip:{ip}, file_name:{attach_file_name} 失败")
except IndexError:
continue
except KeyError:
continue
except Exception as err:
LogUtils.notset(f"{err}")
continue
return att_file_list
运行结果及报错内容
最终解析出的结果如下href为空,
<class 'dict'>: {'class': ['bizDownload'], 'href': '', 'id': 'E8D516E2245A268E4CFCF6190FB44A', 'title': '点击下载'}
我想要达到的结果
如何能够解析出正确的地址
href="//download.ccgp.gov.cn/oss/download?uuid=E8D516E2245A268E4CFCF6190FB44A"