已经解决,data-origina属性是一个网址,所以在这里要再requests.get(data).content这样才是图片内容。图片应该以二进制方式储存,所以应该以wb模式写入
python爬虫爬取斗图啦上的图片,打开爬取的图片显示图片错误
import requests,re,os
from bs4 import BeautifulSoup
def get_url(url):
headers={
'User_Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
'Referrer':url
}
res = requests.get(url,headers=headers)
text = res.text
soup = BeautifulSoup(text,'lxml')
divs = soup.find('div',class_='page-content text-center')
a_s = divs.find_all('a',attrs={'class': 'col-xs-6 col-sm-3'})
for a in a_s:
#print(a)
herf = a['href']
img = a.find('img')
print(img)
#获取最内层标签方法如下
if a.img['class']==['gif']:
pass
else:
alt = a.img['alt']
alt = re.sub(r'[,@??!!:。]','',alt)
#print(alt)
data = a.img['data-original']
print(data)
datastr = '.'+data.split('.')[-1]
filename = alt + datastr
#print(filename)
#print(os.getcwd())
if os.path.exists(os.getcwd() + "\斗图啦\\"+filename):
print('文件已经存在')
else:
filename = os.getcwd() + "\斗图啦\\"+filename
print(filename)
with open(filename,'w') as fp:
fp.write(data)
def main():
if os.path.exists(os.getcwd()+'\斗图啦\\'):
print('文件夹已存在')
else:
os.mkdir(os.getcwd() + "\斗图啦\\")
#for x in range(1,101):
# url = 'http://www.doutula.com/photo/list/?page=%d' %x
# get_url(url)
url = 'http://www.doutula.com/photo/list/?page=1'
get_url(url)
if __name__ == '__main__':
main()
- 点赞
- 写回答
- 关注问题
- 收藏
- 复制链接分享
- 邀请回答
3条回答
为你推荐
- python爬虫爬取斗图啦上的图片,打开爬取的图片显示图片错误
- python
- 3个回答