niexiaosu8167 2017-10-22 23:37 采纳率: 0%
浏览 2618

selenium 爬取动态加载网站中途停止,爬到第10个信息以后不能再继续

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
import time

driver = webdriver.Chrome()
driver.maximize_window()

def crawlHouseDetailForInvoke():
driver.find_element_by_class_name("collapsible-header").click()# price/tax history
time.sleep(5)
table = driver.find_element_by_xpath('//div[@id = "wrapper"]/div/div/div/div/div[@id = "detail-container-column"]/div/div/main/div/div/div/div/div/section[3]/div/div/div/table')
print(table.text)

def crawlRegion(url):
driver.get(url)
div_parent = driver.find_element_by_id('list-results')
a_link = div_parent.find_elements_by_xpath('//div[@id = "search-results"]/ul/li/article/div/a')
print("information in this page:%d" % len(a_link))
for i in range(len(a_link)):
try:
print(i)
print(a_link[i].get_attribute("href"))
a_link[i].click()
time.sleep(8)
crawlHouseDetailForInvoke()
except Exception as e:
continue
finally:
driver.back()

if name == "__main__":

regionUrl = "https://www.zillow.com/homes/recently_sold/Culver-City-CA/house,condo,apartment_duplex,townhouse_type/51617_rid/12m_days/globalrelevanceex_sort/34.05529,-118.33211,33.956531,-118.485919_rect/12_zm/"
print("crawler is started...")

crawlRegion(regionUrl)

driver.close()
driver.quit()
  • 写回答

1条回答 默认 最新

  • wujianqinjian 2018-11-24 08:09
    关注

    没有看到你用代理IP啊,网站几乎都会做反爬处理,你同一个IP不停爬,肯定会被服务器拒绝的!可以先学习下如何在代码中使用代理,实在搞不定代理的话,可以再加我QQ说明具体问题:775662401!

    评论

报告相同问题?

悬赏问题

  • ¥15 java 操作 elasticsearch 8.1 实现 索引的重建
  • ¥15 数据可视化Python
  • ¥15 要给毕业设计添加扫码登录的功能!!有偿
  • ¥15 kafka 分区副本增加会导致消息丢失或者不可用吗?
  • ¥15 微信公众号自制会员卡没有收款渠道啊
  • ¥15 stable diffusion
  • ¥100 Jenkins自动化部署—悬赏100元
  • ¥15 关于#python#的问题:求帮写python代码
  • ¥20 MATLAB画图图形出现上下震荡的线条
  • ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘