LGDDDDDD 2018-12-01 11:21 采纳率: 20%
浏览 3153

爬取豆瓣电影存入数据库,报错TypeError: %d format: a number is required, not str

import requests
from lxml import etree
import pymysql
import re
import time
conn=pymysql.connect(host='localhost',user='root',passwd='123456',db='mydb',port='3306',charset='utf8')
cursor=conn.cursor()#连接数据库及光标
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
def get_movie_url(url):
    html=requests.get(url,headers=headers)
    selector=etree.HTML(html.text)
    movie_hrefs=selector.xpath('//div[@class="hd"/a/@href')
    for movie_href in movie_hrefs:
        get_movie_info(movie_href)

def get_movie_info(url):
    html = requests.get(url, headers=headers)
    selector = etree.HTML(html.text)
    try:
        name=selector.xpath('//div[@id="content"]/h1/span/text()')[0]
        director=selector.xpath('//div[@id="info"]/span[1]/span[2]/a/text()')[0]
        actors=selector.xpath('//div[@id="info"]/span[3]/span[2]/text()')[0]
        actor=actors.xpath('string(.)')
        style=re.findall('<span property="v:genre">(.*?)</span>',html.text,re.S)[0]
        country=re.findall('<span class="pl">制片国家/地区:</span>(.*?)<br>',html.text,re.S)[0]
        release_time=re.findall('上映日期:</span>.*?>(.*?)</span>',html.text,re.S)[0]
        time=re.findall('片长:</span>.*?>(.*?)</span>',html.text,re.S)[0]
        score=selector.xpath('//*[@id="interest_sectl"]/div[1]/div[2]/strong/text()"')[0]
        cursor.execute(
            "insert into doubanmovie (name,director,actor,style,country,release_time,time,score) values(%s,%s,%s,%s,%s,%s,%s,%s)",
            (str(name),str(director),str(actor),str(style),str(country),str(release_time),str(time),str(score)))


    except IndexError:
        pass

if __name__=='__main__':
    urls=['https://movie.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)]
    for url in urls:
        get_movie_url(url)
        time.sleep(2)
    conn.commit()



  • 写回答

3条回答 默认 最新

  • threenewbee 2018-12-01 16:06
    关注

    urls=['https://movie.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)]
    ->
    urls=['https://movie.douban.com/top250?start={}'.format(i) for i in range(0,250,25)]

    问题如果解决,请点下我回答左上角的采纳,谢谢

    评论

报告相同问题?

悬赏问题

  • ¥15 HFSS 中的 H 场图与 MATLAB 中绘制的 B1 场 部分对应不上
  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?