九十辰 2021-12-29 20:56 采纳率: 100%
浏览 65
已结题

爬取网站时,xpath出错了

问题遇到的现象和发生背景

在第26行,xpath表达式不正确

问题相关代码,请勿粘贴截图

from lxml import etree

import requests

if __name__ == '__main__':
    url = 'https://m.58.com/bj/ershoufang/?reform=pcfront'
    # UA伪装
    head = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Mobile Safari/537.36'
    }
    # universal crawler
    page_text = requests.get(url=url, headers=head).text
    # xpath
    parser = etree.HTMLParser(encoding='utf-8')
    tree = etree.HTML(page_text, parser=parser)
    print(tree)
    li_list = tree.xpath('//ul[@class="list"]/li[@class="item-wrap"]')
    print(li_list)
    with open(r'../gotpages/58secondhand_houses.txt', 'w', encoding='utf-8') as stream:
        for li in li_list:
            house_name = li.xpath('./span[@class="content-title"]/text()]')
            #print(house_name)
            stream.write(house_name)
            print(house_name)



运行结果及报错内容
F:\pythonfiles\PycharmProjects\CRAWLER\venv\Scripts\python.exe "F:/pythonfiles/PycharmProjects/CRAWLER/focused crawler-Data analysis/crawler_58com realization in xpath.py"
Traceback (most recent call last):
  File "F:\pythonfiles\PycharmProjects\CRAWLER\focused crawler-Data analysis\crawler_58com realization in xpath.py", line 26, in <module>
    house_name = li.xpath('./span[@class="content-title"]/text()]')
  File "src\lxml\etree.pyx", line 1597, in lxml.etree._Element.xpath
  File "src\lxml\xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
  File "src\lxml\xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Invalid expression

Process finished with exit code 1



我的解答思路和尝试过的方法
我想要达到的结果
  • 写回答

4条回答 默认 最新

  • CSDN专家-showbo 2021-12-29 21:04
    关注

    多了个右中括号],删除,xpath也有问题

    img

    改下面这样就可以了,house_name = li.xpath('.//span[@class="content-title"]/text()')[0]

    img

    import requests
    from lxml import etree
    if __name__ == '__main__':
        url = 'https://m.58.com/bj/ershoufang/?reform=pcfront'
        # UA伪装
        head = {
            'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Mobile Safari/537.36'
        }
        # universal crawler
        page_text = requests.get(url=url, headers=head).text
        # xpath
        parser = etree.HTMLParser(encoding='utf-8')
        tree = etree.HTML(page_text, parser=parser)
        print(tree)
        li_list = tree.xpath('//ul[@class="list"]/li[@class="item-wrap"]')
        print(li_list)
        with open(r'gotpages/58secondhand_houses.txt', 'w', encoding='utf-8') as stream:
            for li in li_list:
                house_name = li.xpath('.//span[@class="content-title"]/text()')[0]
                #print(house_name)
                stream.write(house_name)
                print(house_name)
     
     
    
    

    img


    有帮助或启发麻烦点下【采纳该答案】

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论 编辑记录
查看更多回答(3条)

报告相同问题?

问题事件

  • 系统已结题 2月14日
  • 已采纳回答 2月6日
  • 创建了问题 12月29日

悬赏问题

  • ¥20 js怎么实现跨域问题
  • ¥15 C++dll二次开发,C#调用
  • ¥18 c语言期中复习题(求解)
  • ¥15 请教,如何使用C#加载本地摄像头进行逐帧推流
  • ¥15 Python easyocr无法顺利执行,如何解决?
  • ¥15 求一个十多年前的国产符号计算软件(MMP)+用户手册
  • ¥15 为什么会突然npm err!啊
  • ¥15 java服务连接es读取列表数据,服务连接本地es获取数据时的速度很快,但是换成远端的es就会非常慢,这是为什么呢
  • ¥15 vxworks交叉编译gcc报错error: missing binary operator before token "("
  • ¥15 JSzip压缩文件时如何设置解压缩密码