LGDDDDDD 2018-12-01 11:21 采纳率: 20%
浏览 3152

爬取豆瓣电影存入数据库,报错TypeError: %d format: a number is required, not str

import requests
from lxml import etree
import pymysql
import re
import time
conn=pymysql.connect(host='localhost',user='root',passwd='123456',db='mydb',port='3306',charset='utf8')
cursor=conn.cursor()#连接数据库及光标
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
def get_movie_url(url):
    html=requests.get(url,headers=headers)
    selector=etree.HTML(html.text)
    movie_hrefs=selector.xpath('//div[@class="hd"/a/@href')
    for movie_href in movie_hrefs:
        get_movie_info(movie_href)

def get_movie_info(url):
    html = requests.get(url, headers=headers)
    selector = etree.HTML(html.text)
    try:
        name=selector.xpath('//div[@id="content"]/h1/span/text()')[0]
        director=selector.xpath('//div[@id="info"]/span[1]/span[2]/a/text()')[0]
        actors=selector.xpath('//div[@id="info"]/span[3]/span[2]/text()')[0]
        actor=actors.xpath('string(.)')
        style=re.findall('<span property="v:genre">(.*?)</span>',html.text,re.S)[0]
        country=re.findall('<span class="pl">制片国家/地区:</span>(.*?)<br>',html.text,re.S)[0]
        release_time=re.findall('上映日期:</span>.*?>(.*?)</span>',html.text,re.S)[0]
        time=re.findall('片长:</span>.*?>(.*?)</span>',html.text,re.S)[0]
        score=selector.xpath('//*[@id="interest_sectl"]/div[1]/div[2]/strong/text()"')[0]
        cursor.execute(
            "insert into doubanmovie (name,director,actor,style,country,release_time,time,score) values(%s,%s,%s,%s,%s,%s,%s,%s)",
            (str(name),str(director),str(actor),str(style),str(country),str(release_time),str(time),str(score)))


    except IndexError:
        pass

if __name__=='__main__':
    urls=['https://movie.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)]
    for url in urls:
        get_movie_url(url)
        time.sleep(2)
    conn.commit()



  • 写回答

3条回答

  • threenewbee 2018-12-01 16:06
    关注

    urls=['https://movie.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)]
    ->
    urls=['https://movie.douban.com/top250?start={}'.format(i) for i in range(0,250,25)]

    问题如果解决,请点下我回答左上角的采纳,谢谢

    评论

报告相同问题?

悬赏问题

  • ¥15 运筹学排序问题中的在线排序
  • ¥15 关于docker部署flink集成hadoop的yarn,请教个问题 flink启动yarn-session.sh连不上hadoop,这个整了好几天一直不行,求帮忙看一下怎么解决
  • ¥30 求一段fortran代码用IVF编译运行的结果
  • ¥15 深度学习根据CNN网络模型,搭建BP模型并训练MNIST数据集
  • ¥15 C++ 头文件/宏冲突问题解决
  • ¥15 用comsol模拟大气湍流通过底部加热(温度不同)的腔体
  • ¥50 安卓adb backup备份子用户应用数据失败
  • ¥20 有人能用聚类分析帮我分析一下文本内容嘛
  • ¥30 python代码,帮调试,帮帮忙吧
  • ¥15 #MATLAB仿真#车辆换道路径规划