daobalong 2021-01-07 00:12 采纳率: 80%
浏览 28
已结题

python图片爬取,求大神帮忙看看问题在哪儿

试过好多次,能爬到大概2页多一点,没能爬完指定的页面就显示代码里面的except的内容,然后就停了,好像又没报错,实在不知道要怎么调试,(爬取的文件有点多,是不是需要打包成多线程,百度试过方法,没成功),然后需要把详情页里的一段信息和网址加到图片属性里,一点头绪也没有,门外汉一点点百度学的,请大神们不要嫌弃,求大神们赐教 

import traceback

from bs4 import BeautifulSoup
import requests
import os
import lxml
import json
import time
import re

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}


# 从缩略图网页里爬取整个图片集
def getPic(url):
    print("download pic url +==="+url)
    result = requests.get(url, headers=headers)
    result.encoding = 'utf-8'
    soup = BeautifulSoup(result.content, 'lxml')
    json_data = soup.find('div', attrs={'id': 'gallery-items'})
    name = soup.find('a').string

    simpleName = re.sub(r'[/:*?"<>|\\\\]+', '-', name)
    print(simpleName)
    path = 'f:/CodeWar/spider/Archdaily/'

    newPath = os.path.join(path, simpleName)
    os.makedirs(newPath, exist_ok=True)
    os.chdir(newPath)

    # print(newPath)
    figures = json.loads(json_data.get('data-images'))

    i = 1
    for figure in figures:
        # print(figure['url_large'])
        try:
            print('downloading number:' + str(i)+"====>>"+figure['url_large'])
            image = requests.get(url=figure['url_large'], headers=headers)
            if image.status_code == 200:
                # with open(simpleName + str(i) + '.jpg', 'wb') as f:
                with open(str(i) + '.jpg', 'wb') as f:
                    f.write(image.content)
            i += 1
        except:
            print("figure=======>>ZZzzzz...")
            time.sleep(5)
            print("===================e...")
            continue

# 从主网页获取单个网页的地址
def get_url(page):
    pageResult = requests.get(sourceWeb, headers=headers)
    pageSoup = BeautifulSoup(pageResult.content, 'lxml')

    for collection in pageSoup.find_all('a', class_='afd-title--black-link'):
        if 'href' in collection.attrs:
            sonLink = 'https://www.archdaily.com' + collection.attrs['href']
            sonResponde = requests.get(sonLink, headers=headers)
            sonResponde.encoding = 'utf-8'
            sonSoup = BeautifulSoup(sonResponde.content, 'lxml')
            thumb = sonSoup.find('a', class_='gallery-thumbs-link')
            if thumb:
                thumbLink = 'https://www.archdaily.com' + thumb.attrs['href']
                # print(thumbLink)
                try:
                    getPic(thumbLink)
                except:
                    print("ZZzzzz...")
                    time.sleep(5)
                    print("==xxxxxx=======e...")
                    continue
            # print(url_collections)
        print('—--------creat next folder—--------')

motherWeb = 'https://www.archdaily.com/page/'
n = 0
# 指定需要爬取页数
wanna_page = 10
while n <= wanna_page:
    n += 1
    sourceWeb = motherWeb + str(n)
    try:
        get_url(sourceWeb)
        print('this is page' + str(n))
    except:
        print("Connection refused by the server..")
        print("Let me sleep for 5 seconds")
        print("ZZzzzz...")
        time.sleep(5)
        print("Was a nice sleep, now let me continue...")
        continue

  • 写回答

1条回答 默认 最新

  • bj_0163_bj 2021-01-07 10:10
    关注

    是放在属性的详细信息里吧?  修改图片exif信息,把你的user-agent 放作者里了。

    from PIL import Image
    import piexif
    im = Image.open('4.jpg')
    exif_dict = piexif.load(im.info["exif"])
    exif_dict["0th"][piexif.ImageIFD.Artist] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36".encode()
    exif_bytes = piexif.dump(exif_dict)
    im.save("4.jpg", exif=exif_bytes)

    评论

报告相同问题?

悬赏问题

  • ¥15 安卓adb backup备份应用数据失败
  • ¥15 eclipse运行项目时遇到的问题
  • ¥15 关于#c##的问题:最近需要用CAT工具Trados进行一些开发
  • ¥15 南大pa1 小游戏没有界面,并且报了如下错误,尝试过换显卡驱动,但是好像不行
  • ¥15 没有证书,nginx怎么反向代理到只能接受https的公网网站
  • ¥50 成都蓉城足球俱乐部小程序抢票
  • ¥15 yolov7训练自己的数据集
  • ¥15 esp8266与51单片机连接问题(标签-单片机|关键词-串口)(相关搜索:51单片机|单片机|测试代码)
  • ¥15 电力市场出清matlab yalmip kkt 双层优化问题
  • ¥30 ros小车路径规划实现不了,如何解决?(操作系统-ubuntu)