获取BeautifulSoup以正确解析php标记或忽略它们

I currently need to parse a lot of .phtml files, get specific html tags and add a custom data attribute to them. I'm using python beautifulsoup to parse the entire document and add the tags, and this part works just fine.

The problem is that on the view files (phtml) there are tags that get parsed too. Below is an example of input-output

INPUT

<?php

$stars = $this->getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>
<header>
    <h3>
        <a href="<?php echo $viewAllUrl; ?>" class="noContentLink white">
        <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
        </a>
    </h3>

OUTPUT

<?php
$stars = $this->
getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this-&gt;getData('sideBarCoStarsCount');
$title = $this-&gt;getData('sideBarCoStarsTitle');
$viewAllUrl = $this-&gt;getData('sideBarCoStarsViewAllUrl');
$isDomain = $this-&gt;getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this-&gt;getData('emptyImageData');
?&gt;
<header>
 <h3>
  <a class="noContentLink white" href="&lt;?php echo $viewAllUrl; ?&gt;">
   <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
  </a>
 </h3>

I tried different ways, but didn't succeed on making beautifulsoup to ignore the PHP tags. Is it possible to get html.parser custom rules to ignore , or to beautifulsoup? Thanks!

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
doudouwd2017 2019-04-26 11:45
关注
Your best bet is to remove all of the PHP elements before giving it to BeautifulSoup to parse. This can be done using a regular expression to spot all PHP sections and replace them with safe placeholder text.

After carrying out all of your modifications using BeautifulSoup, the PHP expressions can then be replaced.

As the PHP can be anywhere, i.e. also within a quoted string, it is best to use a simple unique string placeholder rather than trying to wrap it in an HTML comment (see php_sig).

re.sub() can be given a function. Each time the a substitution is made, the original PHP code is stored in an array (php_elements). Then the reverse is done afterwards, i.e. search for all instances of php_sig and replace them with the next element from php_elements. If all goes well, php_elements should be empty at the end, if it is not then your modifications have resulted in a place holder being removed.

from bs4 import BeautifulSoup import re html = """<html> <body> <?php $stars = $this->getData('sideBarCoStars', []); if (!$stars) return; $sideBarCoStarsCount = $this->getData('sideBarCoStarsCount'); $title = $this->getData('sideBarCoStarsTitle'); $viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl'); $isDomain = $this->getData('isDomain'); $lazy_load = $lazy_load ?? 0; $imageSrc = $this->getData('emptyImageData'); ?> <header> <h3> <a href="<?php echo $viewAllUrl; ?>" class="noContentLink white"> <?php echo "{$title} ({$sideBarCoStarsCount})"; ?> </a> </h3> </body>""" php_sig = '!!!PHP!!!' php_elements = [] def php_remove(m): php_elements.append(m.group()) return php_sig def php_add(m): return php_elements.pop(0) # Pre-parse HTML to remove all PHP elements html = re.sub(r'<\?php.*?\?>', php_remove, html, flags=re.S+re.M) soup = BeautifulSoup(html, "html.parser") # Make modifications to the soup # Do not remove any elements containing PHP elements # Post-parse HTML to replace the PHP elements html = re.sub(php_sig, php_add, soup.prettify()) print(html)
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

Beautifulsoup解析网页获取到的标签属性缺失 python 爬虫
2023-04-23 14:32

回答 1 已采纳 soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')
用BeautifulSoup4 解析html的内容
2018-05-05 07:01

回答 2 已采纳 soup = BeautifulSoup(html, 'html.parser') # html为您的html内容 text = soup.find('div').text
使用BeautifulSoup或golang colly解析HTML时遇到问题 python
2018-07-12 07:23

回答 1 已采纳 It looks to me like the HTML is actually commented out, so that's why BeautifulSoup can't find it.
php 解析标记,让beauthulsoup正确解析php标记或忽略它们
2021-04-02 08:32

传奇panda的博客我现在需要解析很多.phtml文件，获取特定的html标记并向它们添加自定义的数据属性。我使用python beautifulsoup来解析整个文档并添加标记，这部分工作得很好。在问题是在视图文件(phtml)上也有标记被解析。下面是一...
python beautifulsoup 解析html无法获得全部html代码 python
2021-01-04 15:04

回答 3 已采纳因为这个div里面的内容是用ajax动态加载的，而用request获取的是网页的源代码（就是“右键菜单->查看网页源代码”的内容），不包含ajax动态加载的内容。所以要找到ajax加载数据的
BeautifulSoup4获取select标签的当前选项 python 爬虫
2022-06-29 19:23

回答 1 已采纳 select标签中的option元素是页面加载时就被选中的吗如果是页面加载之后才被选中的,需要在option元素被选中之后再获取wb.page_source并执行soup = bs(wb.page_s
采用BeautifulSoup库无法解析到网址信息，求解决方法 python 有问必答
2021-10-20 00:23

回答 2 已采纳因为这个网页中的公司网址是通过js代码来动态更新的。requests只能获取网页的静态源代码，动态更新的内容取不到。对于动态更新的内容要用selenium 来爬取。在页面上点击右键，右键菜单中选
爬虫路线Requests-Re-BeautifulSoup技术路线总结
2019-08-28 14:05

浩GE的博客爬虫路线Requests-Re-BeautifulSoup技术路线总结最近工作中需要用到爬虫，于是自己学习了一下，项目难度不算大，因此不需要用到框架，主要用到requests、bs4、re三个模块，正好最近爬取某某佳缘用户图片正好用到了...
Python爬虫 BeautifulSoup解析网页爬取内容为None python 有问必答
2021-08-31 14:07

回答 2 已采纳你抓的频率太快，IP被墙了
Python BeautifulSoup获取属性值怎么? python
2019-09-20 15:38

回答 1 已采纳 ``` from bs4 import BeautifulSoup html='' soup=BeautifulSoup(html,'lxml') imgs=soup.sele
beautifulsoup python
2023-02-22 12:27

回答 2 已采纳该回答内容部分引用GPT，GPT_Pro更好的解决问题上述代码有错误，主要是open的括号中的example htmi不是一个有效的文件路径，所以无法打开对应的文件；此外，BeautifulSoup函
Beautifulsoup官方文档
2019-09-26 11:57

diandinai8712的博客 Beautiful Soup 中文文档原文 byLeonard Richardson(leonardr@segfault.org)翻译 byRichie Yan(richieyan@gmail.com)###...###英文原文点这里 Beautiful Soup是用Python写的一个HTML/XML的解析器，它可以很好的...
请问用BeautifulSoup如何获取p标签内的值 python 爬虫
2015-10-11 14:32

回答 1 已采纳 http://zhidao.baidu.com/link?url=RwqRI-mffUi0v72naV59GVaAyDeFVECc6vtfaE82hwVWumkAUNGCSTGHi-et-WADdNO
BeautifulSoup 中文文档
2014-12-15 14:38

_宇宙浪子_的博客 Beautiful Soup使用XML或HTML文档以字符串的方式(或类文件对象)构造。它剖析文档并在内存中创建通讯的数据结构如果你的文档格式是非常标准的，解析出来的数据结构正如你的原始文档。但是如果你的文档有问题，...
【转载】关于Python Beautifulsoup的详细说明
2019-08-09 14:13

请叫我DJ的博客 Beautiful Soup使用XML或HTML文档以字符串的方式(或类文件对象)构造。它剖析文档并在内存中创建通讯的数据结构如果你的文档格式是非常标准的，解析出来的数据结构正如你的原始文档。但是如果你的文档有问题，...
没有解决我的问题, 去提问

悬赏问题

¥15 使用ue5插件narrative时如何切换关卡也保存叙事任务记录
¥20 软件测试决策法疑问求解答
¥15 win11 23H2删除推荐的项目，支持注册表等
¥15 matlab 用yalmip搭建模型，cplex求解，线性化处理的方法
¥15 qt6.6.3 基于百度云的语音识别不会改
¥15 关于#目标检测#的问题：大概就是类似后台自动检测某下架商品的库存，在他监测到该商品上架并且可以购买的瞬间点击立即购买下单
¥15 神经网络怎么把隐含层变量融合到损失函数中？
¥15 lingo18勾选global solver求解使用的算法
¥15 全部备份安卓app数据包括密码，可以复制到另一手机上运行
¥20 测距传感器数据手册i2c

获取BeautifulSoup以正确解析php标记或忽略它们

1条回答 默认 最新

悬赏问题

1条回答默认最新