dongshuo8756
dongshuo8756
2017-04-27 11:44
浏览 62
已采纳

我只想在XPath中仅检索body元素的文本时仅排除JavaScript标记内容

I want to exclude only the JavaScript tag contents when retrieving only the text of the body element in XPath

▼index.html

<body>

  I want to acquire only "text excluding HTML tag" included in this part.

  <script language="JavaScript" type="text/javascript">
      var foo = 42;
  </script>

</body>

I have created the following code with DomCrawler. But, because it contains JavaScript tag contents, I could not get the intended results..

<?php

$crawler->filterXPath('//body')->each(function (DomCrawler $node) use ($url) {
    $result = trim($node->text());
});

图片转代码服务由CSDN问答提供 功能建议

我只想在XPath中仅检索body元素的文本时仅排除JavaScript标记内容 /strong>

nn

▼index.html

nn
<body>
nn我只想获取“不包含HTML标签的文字 “包含在此部分。
 
&lt; script language =”JavaScript“type =”text / javascript“&gt; 
 var foo = 42; 
&lt; / script&gt; 
 
&lt; / body&gt; \  n   
 
 

我使用DomCrawler创建了以下代码。 但是,因为它包含JavaScript标记内容,我无法获得预期的结果..

 &lt;?php 
 
 $ crawler-&gt; filterXPath  ('// body') - &gt;每个(函数(DomCrawler $ node)使用($ url){
 $ result = trim($ node-&gt; text()); 
}); 
   
 
  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

2条回答 默认 最新

  • dongpengyu1363
    dongpengyu1363 2017-04-27 11:54
    已采纳

    I would like to suggest you use DomXpath in which you can filter the content. by query. I am not pretty sure about Domcrawler.

    <?php
    // to retrieve selected html data, try these DomXPath examples:
    
    $file = $DOCUMENT_ROOT. "test.html";
    $doc = new DOMDocument();
    $doc->loadHTMLFile($file);
    
    $xpath = new DOMXpath($doc);
    
    // example 1: for everything with an id
    //$elements = $xpath->query("//*[@id]");
    
    // example 2: for node data in a selected id
    //$elements = $xpath->query("/html/body/script");
    
    // example 3: same as above with wildcard
    $elements = $xpath->query("*/script");
    
    if (!is_null($elements)) {
      foreach ($elements as $element) {
        echo "<br/>[". $element->nodeName. "]";
    
        $nodes = $element->childNodes;
        foreach ($nodes as $node) {
          echo $node->nodeValue. "
    ";
        }
      }
    }
    ?>
    
    点赞 评论
  • doufu8588
    doufu8588 2017-04-27 11:58

    Give this a try:

    <?php
    
    $x = '<body>
    
      I want to acquire only "text excluding HTML tag" included in this part.
    
      <script language="JavaScript" type="text/javascript">
          var foo = 42;
      </script>
    
    </body>';
    
    $dom = new DOMDocument();
    $dom->loadHTML($x);
    $script = $dom->getElementsByTagName('script')->item(0);
    $script->parentNode->removeChild($script);
    $body = $dom->getElementsByTagName('body')->item(0);
    echo $body->nodeValue;
    

    Working example HERE https://3v4l.org/n2UQT

    点赞 评论

相关推荐