获取元素的innerHTML，但不是元素本身

I am working on extracting the data from a 2 column table. The first column is the variable name and the second column is the data for that variable.

I have this almost working, but some data may contain HTML and is often wrapped in a DIV. I want to get the HTML inside the DIV, but not the DIV itself. I know regex might be an solution, but I'd like to better understand DOMDocument.

This is the code I have so far:

private function readHtml()
{

    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $htmlData = curl_exec($curl);
    curl_close($curl);

    $dom        = new \DOMDocument();
    $html       = $dom->loadHTML($htmlData);
    $dom->preserveWhiteSpace = false;

    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');

    $table = [];
    $key = null;
    $value = null;

    foreach ($rows as $i => $row){

        //skip the heading columns
        if($i <= 1 ) continue;

        $cols = $row->getElementsByTagName('td');

        foreach ($cols as $count => $node) {

            if($count == 0) {

                $key = strtolower(str_replace(' ', '_',$node->textContent));

            } else {

               $htmlNode = $node->getElementsByTagName('div');

                if($htmlNode->length >=1) {

                    $innerHTML= '';

                    foreach ($htmlNode as $innerNode) {

                        $innerHTML .= $innerNode->ownerDocument->saveHTML( $innerNode );
                    }

                    $value = $innerHTML;

                } else {

                    $value = $node->textContent;
                }
            }
        }

        $table[$key] = $value;
    }

    return $table;
}

My output is correct, but I'd like to not include the wrapper DIV of the data that contains HTML:

    Array
    (
        [type] => raw
        [direction] => north
        [intro] => Welcome to the test. 
        [html_body] => <div class="softmerge-inner" style="width: 5653px; left: -1px;">Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut <span style="font-weight:bold;">aliquip</span> ex ea commodo consequat. Duis aute irure dolor in <span style="text-decoration:underline;">reprehenderit</span> in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, <span style="font-style:italic;">sunt in</span> culpa qui officia deserunt mollit anim id est laborum.</div>
        [count] => 1003
    )

UPDATE

Based on some feedback and ideas in the answers this is the current iteration of the function, which is slimmer and is returning the desired output. I don't feel too good about the double regex but its working.

private function readHtml()
{

    # the url given in your example
    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

    $dom = new \DOMDocument();
    $dom->loadHTMLFile($url);
    $dom->preserveWhiteSpace = false;

    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');

    $table = [];
    $key = null;
    $value = null;

    foreach ($rows as $i => $row){

        //skip the heading columns
        if($i <= 1 ) continue;

        $cols = $row->getElementsByTagName('td');

        foreach ($cols as $count => $node) {

            if($count == 0) {

                $key = strtolower(str_replace(' ', '_',$node->textContent));

            } else {

                $value = $node->ownerDocument->saveHTML( $node );

                $value = preg_replace('/(<div.*?>|<\/div>)/','',$value);
                $value = preg_replace('/(<td.*?>|<\/td>)/','',$value);
            }
        }

        $table[$key] = $value;
    }

    return $table;
}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douyijin7741 2016-04-27 19:16
关注
Use preg_replace! Like this:

$table['html_body']=preg_replace('/(<div.*?>|<\/div>)/','',$table['html_body']);

See here for preg_replace. See here for regex usage.

OR! You could use simple_html_dom.php like this:

<?php include 'simple_html_dom.php';//<--- Must download to current directory $url = 'https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml'; $html = file_get_html( $url ); foreach ( $html->find( "div[class=softmerge-inner]" ) as $element ) { echo $element->innertext; //See http://simplehtmldom.sourceforge.net/manual.htm for usage } ?>
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

获取元素的innerHTML，但不是元素本身 php
2016-04-27 18:37

回答 2 已采纳 Use preg_replace! Like this: $table['html_body']=preg_replace('/(<div.*?>|<\/div>)/',
javascript获取元素 html javascript
2022-07-21 09:09

回答 4 已采纳这两种都行 // var tb = document.querySelector("tbody"); // console.log(tb) // tb.inner
innerHTML返回的元素怎么取value 属性值 html5
2016-12-05 08:40

回答 3 已采纳 element.children[0].attributes["value"].nodeValue
html中不支持什么元素,HTML中不支持静态Expando的元素的问题
2021-06-14 02:49

坚强努力地活下去的博客 HTML中不支持静态Expando的...这些附加的属性是我在服务器端通过Attributes集合添加的，可是运行的结果和我的期望老是相去甚远，怎么我在TITLE元素里写入的自定义属性老是空值呢？客服端和服务器端代码分别是：AS...
如何将.innerHTML转换为PHP变量 html5 javascript php
2016-08-13 20:13

回答 1 已采纳 A simple option would be to set a cookie through JavaScript and read it back with PHP. JavaScript
关于元素.innerHTML和this.innerHTML的问题 html javascript
2021-11-28 23:44

回答 2 已采纳具体解释可查看链接：https://www.cnblogs.com/qlb-7/p/14500958.html var lis=document.getElementsByClassName("li
js获取追加的数组元素 javascript
2022-10-09 13:02

回答 2 已采纳已解决
JS操作DOM元素代码的编写
2018-11-11 10:03

wespten的博客 JS操作DOM元素代码的编写 DOM树传统的html文档顺序是:document->html->(head,body) 根据 DOM，HTML 文档中的每个成分都是一个节点。整个文档是一个文档节点每个 HTML 标签是一个元素节点包含在 HTML ...
如何使用php通过classname或id获取innerhtml php xml
2014-03-11 16:19

回答 2 已采纳 Here is a function DOMDocument::saveHTML(). In the current php versions, this can take a node you
在PHP中使用innerHTML加载TinyMCE javascript php
2019-02-21 13:54

回答 1 已采纳 First try to load tinymce in the beginning and try to open it to check if exist <script src="
从外部php文件更改innerHTML javascript jquery php
2013-05-14 03:42

回答 2 已采纳 You are reinventing jQuery's load() function fetchContent() { $("#foo").load('php.php');
网络安全--PHP编程与系统开发-06-JavaScript元素定位
2022-08-09 01:00

认证搬砖大师的博客 JavaScript直接操作页面元素的方法集合，称为DOM(document object model)，是一套JS代码接口。另外，还有一套BOM(brower object model)，用于通过JS直接操作浏览器，比如前进，后退，历史，导航，刷新等。
我已经修改了元素的innerHTML，但是页面上没有变化，请问如何解决？ html javascript 前端
2022-04-27 10:04

回答 3 已采纳 const startButton = document.getElementsByClassName("start-button");返回的是dom数组，要专为dom对象才有innerHTML属性c
html5 新增元素以及css3新特性
2021-05-27 15:02

饭小粒的博客 HTML5 1.HTML5新元素 HTML5提供了新的元素来创建更好的页面结构：标签描述 ... 允许您设置一段文本，使其脱离其父元素的文本方向设置。 <command> .
php面试整理
2021-11-15 00:55

一条程序龙o^v^o的博客 php面试整理 *什么是操作系统？知道那些概念（来源于百度）：操作系统（operating system，简称OS）是管理计算机硬件与软件资源的计算机程序。操作系统需要处理如管理与配置内存、决定系统资源供需的优先次序、...
PHP的那些事儿
2022-07-07 15:12

木吉-子的博客那些年被问的php干货，主要是针对常用数组，socket编程函数，类，防范攻击等的常遇问题
php跑马灯html跑马灯,实现文字跑马灯的三种方式介绍（代码实例）
2021-04-14 04:54

weixin_39642998的博客尽管一些浏览器仍然支持它，但它不是必须的。此外，使用这个元素基本上是你可以对你的用户做最糟糕的事情之一，所以请不要这样做。所以，根据咱们IT圈内的紧跟文档标准的原则，对marquee标签，我们在项目中请尽量...
【php毕业设计】基于php+mysql+mvc的网上留言管理系统设计与实现（毕业论文+程序源码）——网上留言管理系统
2022-06-29 14:05

毕业设计方案专家的博客大家好，今天给大家介绍基于php+mysql+mvc的网上留言管理系统设计与实现，文章末尾附有本毕业设计的论文和源码下载地址哦。文章目录：项目难度：中等难度适用场景：相关题目的毕业设计配套论文字数：12890个字23...
PHP-待续...
2020-04-08 18:17

纳米606的博客 PHP 1. 认识PHP 1.1 什么是PHP PHP 是 “PHP Hypertext Preprocessor(超级文本预处理器)” 的首字母缩略词，动态网页编程语言 PHP文件可包含文本，HTML，JavaScript代码和PHP代码 PHP代码在服务器上执行，结果以纯...
php跑马灯,前端实现文字跑马灯的三种方式
2021-04-16 16:06

老铁爱金衫的博客尽管一些浏览器仍然支持它，但它不是必须的。此外，使用这个元素基本上是你可以对你的用户做最糟糕的事情之一，所以请不要这样做。所以，根据咱们IT圈内的紧跟文档标准的原则，对marquee标签，我们在项目中请尽量...
没有解决我的问题, 去提问

悬赏问题

¥15 聚类分析或者python进行数据分析
¥15 逻辑谓词和消解原理的运用
¥15 三菱伺服电机按启动按钮有使能但不动作
¥15 js，页面2返回页面1时定位进入的设备
¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
¥15 （希望可以解决问题）ma和mb文件无法正常打开，打开后是空白，但是有正常内存占用，但可以在打开Maya应用程序后打开场景ma和mb格式。
¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
¥20 腾讯企业邮箱邮件可以恢复么
¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗？
¥15 错误 LNK2001 无法解析的外部符号

获取元素的innerHTML，但不是元素本身

2条回答 默认 最新

Use preg_replace! Like this:

OR! You could use simple_html_dom.php like this:

悬赏问题

2条回答默认最新

Use `preg_replace`! Like this: