dtwncxs3547 2019-04-24 15:55
浏览 183

获取BeautifulSoup以正确解析php标记或忽略它们

I currently need to parse a lot of .phtml files, get specific html tags and add a custom data attribute to them. I'm using python beautifulsoup to parse the entire document and add the tags, and this part works just fine.

The problem is that on the view files (phtml) there are tags that get parsed too. Below is an example of input-output

INPUT

<?php

$stars = $this->getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>
<header>
    <h3>
        <a href="<?php echo $viewAllUrl; ?>" class="noContentLink white">
        <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
        </a>
    </h3>

OUTPUT

<?php
$stars = $this->
getData('sideBarCoStars', []);

if (!$stars) return;

$sideBarCoStarsCount = $this-&gt;getData('sideBarCoStarsCount');
$title = $this-&gt;getData('sideBarCoStarsTitle');
$viewAllUrl = $this-&gt;getData('sideBarCoStarsViewAllUrl');
$isDomain = $this-&gt;getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this-&gt;getData('emptyImageData');
?&gt;
<header>
 <h3>
  <a class="noContentLink white" href="&lt;?php echo $viewAllUrl; ?&gt;">
   <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
  </a>
 </h3>

I tried different ways, but didn't succeed on making beautifulsoup to ignore the PHP tags. Is it possible to get html.parser custom rules to ignore , or to beautifulsoup? Thanks!

  • 写回答

1条回答 默认 最新

  • doudouwd2017 2019-04-26 11:45
    关注

    Your best bet is to remove all of the PHP elements before giving it to BeautifulSoup to parse. This can be done using a regular expression to spot all PHP sections and replace them with safe placeholder text.

    After carrying out all of your modifications using BeautifulSoup, the PHP expressions can then be replaced.

    As the PHP can be anywhere, i.e. also within a quoted string, it is best to use a simple unique string placeholder rather than trying to wrap it in an HTML comment (see php_sig).

    re.sub() can be given a function. Each time the a substitution is made, the original PHP code is stored in an array (php_elements). Then the reverse is done afterwards, i.e. search for all instances of php_sig and replace them with the next element from php_elements. If all goes well, php_elements should be empty at the end, if it is not then your modifications have resulted in a place holder being removed.

    from bs4 import BeautifulSoup
    import re
    
    html = """<html>
    <body>
    
    <?php 
    $stars = $this->getData('sideBarCoStars', []);
    
    if (!$stars) return;
    
    $sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
    $title = $this->getData('sideBarCoStarsTitle');
    $viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
    $isDomain = $this->getData('isDomain');
    $lazy_load = $lazy_load ?? 0;
    $imageSrc = $this->getData('emptyImageData');
    ?>
    
    <header>
        <h3>
            <a href="<?php echo $viewAllUrl; ?>" class="noContentLink white">
            <?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
            </a>
        </h3>
    
    </body>"""
    
    php_sig = '!!!PHP!!!'
    php_elements = []
    
    def php_remove(m):
        php_elements.append(m.group())
        return php_sig
    
    def php_add(m):
        return php_elements.pop(0)
    
    # Pre-parse HTML to remove all PHP elements
    html = re.sub(r'<\?php.*?\?>', php_remove, html, flags=re.S+re.M)
    
    soup = BeautifulSoup(html, "html.parser")
    
    # Make modifications to the soup
    # Do not remove any elements containing PHP elements
    
    # Post-parse HTML to replace the PHP elements
    html = re.sub(php_sig, php_add, soup.prettify())
    
    print(html)
    
    评论

报告相同问题?

悬赏问题

  • ¥15 使用ue5插件narrative时如何切换关卡也保存叙事任务记录
  • ¥20 软件测试决策法疑问求解答
  • ¥15 win11 23H2删除推荐的项目,支持注册表等
  • ¥15 matlab 用yalmip搭建模型,cplex求解,线性化处理的方法
  • ¥15 qt6.6.3 基于百度云的语音识别 不会改
  • ¥15 关于#目标检测#的问题:大概就是类似后台自动检测某下架商品的库存,在他监测到该商品上架并且可以购买的瞬间点击立即购买下单
  • ¥15 神经网络怎么把隐含层变量融合到损失函数中?
  • ¥15 lingo18勾选global solver求解使用的算法
  • ¥15 全部备份安卓app数据包括密码,可以复制到另一手机上运行
  • ¥20 测距传感器数据手册i2c