wggglggg 2021-03-27 21:13 采纳率: 100%
浏览 40
已采纳

有些源代码有四个a标签,有的只有两个 或者三个a标签如何正则抓?

html = '''<dl class="bigtr cl">
  <dt class="li01 ta_c"><b class="nob1">3</b></dt>
  <dt class="li02"><samp class="holdPIC"></samp></dt>
  <dt class="li03 oh"><a href="https://www.1905.com/vod/play/516604.shtml" target="_blank" title="杨贵妃" class=" pl28">杨贵妃</a></dt>
  <dt class="li04 oh"><span><a href="https://www.1905.com/mdb/star/2996732/" target="_blank" title="周洁">周洁</a>/<a href="https://www.1905.com/mdb/star/1973493/" target="_blank" title="刘文治">刘文治</a>/<a href="https://www.1905.com/mdb/star/1403/" target="_blank" title="濮存昕">濮存昕</a>/<a href="https://www.1905.com/mdb/star/2998726/" target="_blank" title="程文宽">程文宽</a></span></dt>
  <dt class="li05 ta_c"><span>39,189</span></dt>
</dl>


                              <dl class="cl">
  <dt class="li01 ta_c"><b class="ptnob">5</b></dt>
  <dt class="li02"><samp class="holdPIC"></samp></dt>
  <dt class="li03 oh"><a href="https://www.1905.com/vod/play/85426.shtml" target="_blank" title="神话" class=" pl28">神话</a></dt>
  <dt class="li04 oh"><span><a href="https://www.1905.com/mdb/star/242/" target="_blank" title="成龙">成龙</a>/<a href="https://www.1905.com/mdb/star/596/" target="_blank" title="金喜善">金喜善</a>/<a href="https://www.1905.com/mdb/star/1297/" target="_blank" title="梁家辉">梁家辉</a>/<a href="https://www.1905.com/mdb/star/1935/" target="_blank" title="于荣光">于荣光</a></span></dt>
  <dt class="li05 ta_c"><span>34,348</span></dt>
</dl>
                              <dl class="cl">
  <dt class="li01 ta_c"><b class="ptnob">6</b></dt>
  <dt class="li02"><samp class="holdPIC"></samp></dt>
  <dt class="li03 oh"><a href="https://www.1905.com/vod/play/85340.shtml" target="_blank" title="我和姐姐" class=" pl28">我和姐姐</a></dt>
  <dt class="li04 oh"><span><a href="https://www.1905.com/mdb/star/3065837/" target="_blank" title="张梦露">张梦露</a>/<a href="https://www.1905.com/mdb/star/3406/" target="_blank" title="刘洋">刘洋</a>/<a href="https://www.1905.com/mdb/star/3065838/" target="_blank" title="易含">易含</a></span></dt>
  <dt class="li05 ta_c"><span>30,709</span></dt>
  </dl>
  '''

如题, 如果都四位演员,我能正常抓, 如果少几个演员,或者多几个演员, 如何用正则抓取, 我写了一个长长的正则, 只能固定抓四位的

import requests, re




def one_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        #         response.encoding = 'utf8'
        return response.text
    return None


def parse_one_page(html):
    partter = re.compile(
        '<dl.*?"li01 ta_c".*?".*?">(.*?)</b>.*?"li03.*?href="(.*?)".*?pl28">(.*?)</a>.*?"li04.*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>.*?"li05.*?<span>(.*?)</span></dt>.*?</dl>',
        re.S)
    items = re.findall(partter, html)
    print(items)


def main():
    url = 'https://www.1905.com/vod/rank/tao1.shtml'
    html = one_page(url)

    parse_one_page(html)


if __name__ == '__main__':
    main()

结果 很多都漏抓了, 小白刚学习, 请高人指教

  • 写回答

3条回答 默认 最新

  • 关注
    import requests, re
    
    
    def one_page(url):
        response = requests.get(url)
        if response.status_code == 200:
            #         response.encoding = 'utf8'
            return response.text
        return None
    def parse_one_page(html):
        partter = re.compile(
            '<dl.*?"li01 ta_c".*?".*?">(.*?)</b>.*?"li03.*?href="(.*?)".*?pl28">(.*?)</a>.*?"li04.*?<span>(.*?)</span>.*?"li05.*?<span>(.*?)</span></dt>.*?</dl>',
            re.S)
        items = re.findall(partter, html)
        for i, item in enumerate(items):
            r = list(item)
            r[3] = re.findall(r'>(.*?)</a>', r[3])
            items[i] = r
        print(items)
    
    def main():
        url = 'https://www.1905.com/vod/rank/tao1.shtml'
        html = one_page(url)
        parse_one_page(html)
    
    if __name__ == '__main__':
        main()
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?
  • ¥15 c++头文件不能识别CDialog
  • ¥15 Excel发现不可读取的内容