采用scrapy框架爬取二手房数据，显示没有爬取到页面和项目，不清楚问题原因

1.item

import scrapy
class LianjiaItem(scrapy.Item):
    # define the fields for your item here like:
    # 房屋名称
    name = scrapy.Field()
    # 房屋户型
    type = scrapy.Field()
    # 建筑面积
    area = scrapy.Field()
    # 房屋朝向
    direction = scrapy.Field()
    # 装修情况
    fitment = scrapy.Field()
    # 有无电梯
    elevator = scrapy.Field()
    # 房屋总价
    total_price = scrapy.Field()
    # 房屋单价
    unit_price = scrapy.Field()
    # 房屋产权
    property = scrapy.Field()

2.settings

    BOT_NAME = 'lianjia'
    SPIDER_MODULES = ['lianjia.spiders']
    NEWSPIDER_MODULE = 'lianjia.spiders'
    USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
    ROBOTSTXT_OBEY = False
    ITEM_PIPELINES = {
   'lianjia.pipelines.FilterPipeline': 100,
   'lianjia.pipelines.CSVPipeline': 200,
}

3.pipelines

import re
from scrapy.exceptions import DropItem
class FilterPipeline(object):
    def process_item(self,item,spider):
        item['area'] = re.findall(r"\d+\.?\d*",item["area"])[0]
        if item["direction"] == '暂无数据':
            raise DropItem("房屋朝向无数据，抛弃此项目：%s"%item)
        return item
class CSVPipeline(object):
    index = 0
    file = None
    def open_spider(self,spider):
        self.file = open("home.csv","a")
    def process_item(self, item, spider):
        if self.index == 0:
            column_name = "name,type,area,direction,fitment,elevator,total_price,unit_price,property\n"
            self.file.write(column_name)
            self.index = 1
        home_str = item['name']+","+item['type']+","+item['area']+","+item['direction']+","+item['fitment']+","+item['elevator']+","+item['total_price']+","+item['unit_price']+","+item['property']+"\n"
        self.file.write(home_str)
        return item
    def close_spider(self,spider):
        self.file.close()

4.lianjia_spider

import scrapy
from scrapy import Request
from lianjia.items import LianjiaItem

class LianjiaSpiderSpider(scrapy.Spider):
    name = 'lianjia_spider'
    # 获取初始请求
    def start_requests(self):
        # 生成请求对象
        url = 'https://bj.lianjia.com/ershoufang/'
        yield Request(url)
    # 实现主页面解析函数
    def parse(self, response):
        # 使用xpath定位到二手房信息的div元素,保存到列表中
        list_selector = response.xpath("//li/div[@class = 'info clear']")
        # 依次遍历每个选择器,获取二手房的名称,户型,面积,朝向等信息
        for one_selector in list_selector:
            try:
                name = one_selector.xpath("div[@class = 'title']/a/text()").extract_first()
                other = one_selector.xpath("div[@class = 'address']/div[@class = 'houseInfo']/text()").extract_first()
                other_list = other.split("|")
                type = other_list[0].strip(" ")
                area = other_list[1].strip(" ")
                direction = other_list[2].strip(" ")
                fitment = other_list[3].strip(" ")
                total_price = one_selector.xpath("//div[@class = 'totalPrice']/span/text()").extract_first()
                unit_price = one_selector.xpath("//div[@class = 'unitPrice']/@data-price").extract_first()
                url = one_selector.xpath("div[@class = 'title']/a/@href").extract_first()
                yield Request(url,meta={"name":name,"type":type,"area":area,"direction":direction,"fitment":fitment,"total_price":total_price,"unit_price":unit_price},callback=self.otherinformation)
            except:
                pass
        current_page = response.xpath("//div[@class = 'page-box house-lst-page-box']/@page-data").extract_first().split(',')[1].split(':')[1]
        current_page = current_page.replace("}", "")
        current_page = int(current_page)
        if current_page < 100:
            current_page += 1
            next_url = "https://bj.lianjia.com/ershoufang/pg%d/" %(current_page)
            yield Request(next_url,callback=self.otherinformation)
    def otherinformation(self,response):
        elevator = response.xpath("//div[@class = 'base']/div[@class = 'content']/ul/li[12]/text()").extract_first()
        property = response.xpath("//div[@class = 'transaction']/div[@class = 'content']/ul/li[5]/span[2]/text()").extract_first()
        item = LianjiaItem()
        item["name"] = response.meta['name']
        item["type"] = response.meta['type']
        item["area"] = response.meta['area']
        item["direction"] = response.meta['direction']
        item["fitment"] = response.meta['fitment']
        item["total_price"] = response.meta['total_price']
        item["unit_price"] = response.meta['unit_price']
        item["property"] = property
        item["elevator"] = elevator
        yield item

提示错误：

de - interpreting them as being unequal
  if item["direction"] == '鏆傛棤鏁版嵁':

2019-11-25 10:53:35 [scrapy.core.scraper] ERROR: Error processing {'area': u'75.6',
 'direction': u'\u897f\u5357',
 'elevator': u'\u6709',
 'fitment': u'\u7b80\u88c5',
 'name': u'\u6b64\u6237\u578b\u517113\u5957 \u89c6\u91ce\u91c7\u5149\u597d \u65e0\u786c\u4f24 \u4e1a\u4e3b\u8bda\u610f\u51fa\u552e',
 'property': u'\u6ee1\u4e94\u5e74',
 'total_price': None,
 'type': u'2\u5ba41\u5385',
 'unit_price': None}
Traceback (most recent call last):
  File "f:\python_3.6\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "F:\python_3.6\lianjia\lianjia\pipelines.py", line 25, in process_item
    home_str = item['name']+","+item['type']+","+item['area']+","+item['direction']+","+item['fitment']+","+item['elevator']+","+item['total_price']+","+item['unit_price']+
","+item['property']+"\n"
TypeError: coercing to Unicode: need string or buffer, NoneType found

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
蔡能教授，网站特聘专家 2019-11-25 12:48
关注
https://blog.csdn.net/weixin_41931602/article/details/80200695

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

scrapy-爬取京东笔记本电脑信息问题 chrome python selenium 开发语言
2020-09-01 19:12

回答 2 已采纳 ``` browser.quit() return HtmlResponse(url=request.url, body=browser.page_source, re
scrapy爬取图片，爬取不到 python 有问必答
2021-05-23 20:32

回答 2 已采纳你已经爬到图片连接了，这个看到的管道文件的代码怎样写，要对图片链接发送请求访问，然后保存才行
利用scrapy如何爬取图表中的数据 python 其他有问必答
2021-07-07 23:48

回答 1 已采纳 scrapy得出的响应内容是在network的doc里面，如图如果对你有帮助，可以点击我这个回答右上方的【采纳】按钮，给我个采纳吗，谢谢
基于Python Scrapy爬虫框架实现的链家二手房数据爬取系统的设计与实现毕业设计论文答辩用 1万+字共41页.pdf
2022-02-13 17:31

本系统采用Scrapy爬虫框架来开发，使用Xpath网页提取技术对下载网页进行内容解析，使用Redis做分布式，使用MongoDB对提取的数据进行存储，使用Django开发可视化界面对爬取的结果进行友好展示，设计并实现了针对链家...
scrapy框架+formdata+ajax爬取及翻页问题 python 数据挖掘测试用例
2020-03-25 14:18

回答 1 已采纳 def parse(self, response): result = eval(response.body.decode('utf-8')) 兄弟，你打印一下resu
在以瀑布流方式翻页的网站,使用scrapy网络爬虫,但是只爬取了第一页数据,没有爬取第二页. python 爬虫
2021-09-05 19:18

回答 2 已采纳那叫ajax，
请问Python爬虫如何把爬取数据存入csv文件中 python 开发语言有问必答爬虫
2021-11-21 21:19

回答 1 已采纳你用open打开csv文件，然后以字符串格式写入就行了，每个数据之间用英文逗号隔开即可
为什么我的scrapy爬不到数据了 python
2020-09-05 13:48

回答 1 已采纳 small_link = 'http:'+li.xpath('./@href').extract_first() 这里错了 response.urljoin(li.xpath('./@href')
Scrapy框架时爬取网页时报错 python 有问必答
2021-05-26 16:56

回答 2 已采纳你的数据清洗方法用错了，参考一下：https://blog.csdn.net/qq_43004728/article/details/84586628，如有帮助，望采纳
关于Scrapy 框架运行不出结果的问题，好像没有报错 python 正则表达式
2020-05-09 18:12

回答 3 已采纳如图：如果你完整的看完scrapy的日志（第一张图），根本原因：你设置了robotstxt服从为真，直接原因：目标网站的robot限制了你的访问
python Scrapy框架爬取58二手房信息（有遗留问题，有懂行大佬欢迎给建议）
2023-12-26 15:39

一个爱学习的菜鸡的博客爬取58信息，并进行入库操作。
如何利用scrapy爬取带标签的网页内容并保存到自己的服务器上？ mysql python sql
2018-02-09 09:34

回答 3 已采纳 1. 把整个爬取到的网页内容直接存储到数据库肯定是可以的，你之所以没有成功，应该是因为你的数据库中的相应字段错了，整个网页内容都比较长，一般都是要用text字段，甚至是LongText)（最大长度42
Python爬虫框架Scrapy入门（三）爬虫实战：爬取链家二手房多页数据使用Item Pipeline处理数据
2020-12-22 10:41

当Spider将收集到的数据封装为Item后，将会被传递到Item Pipeline（项目管道）组件中等待进一步处理。Scrapy犹如一个爬虫流水线，Item Pipeline是流水线的最后一道工序，但它是可选的，默认关闭，使用时需要将它激活...
python爬取二手房信息_使用Scrapy爬取链家二手房信息
2020-12-10 05:50

weixin_39669701的博客 Mysql数据库项目说明：本项目基于Python Scrapy爬虫框架对lianjia房产交易网站二手房小区、小区在售房屋数据进行爬取。数据爬取为三级页面递归爬取，整个爬取流程如下：1. 搜索小区名，在结果页面中找到小区名，...
没有解决我的问题, 去提问

悬赏问题

¥15 聚类分析或者python进行数据分析
¥15 如何用visual studio code实现html页面
¥15 逻辑谓词和消解原理的运用
¥15 三菱伺服电机按启动按钮有使能但不动作
¥15 js，页面2返回页面1时定位进入的设备
¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
¥15 （希望可以解决问题）ma和mb文件无法正常打开，打开后是空白，但是有正常内存占用，但可以在打开Maya应用程序后打开场景ma和mb格式。
¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
¥20 腾讯企业邮箱邮件可以恢复么
¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗？