doubinchou4219 2016-04-27 18:37
浏览 51
已采纳

获取元素的innerHTML,但不是元素本身

I am working on extracting the data from a 2 column table. The first column is the variable name and the second column is the data for that variable.

I have this almost working, but some data may contain HTML and is often wrapped in a DIV. I want to get the HTML inside the DIV, but not the DIV itself. I know regex might be an solution, but I'd like to better understand DOMDocument.

This is the code I have so far:

private function readHtml()
{

    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $htmlData = curl_exec($curl);
    curl_close($curl);

    $dom        = new \DOMDocument();
    $html       = $dom->loadHTML($htmlData);
    $dom->preserveWhiteSpace = false;

    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');

    $table = [];
    $key = null;
    $value = null;

    foreach ($rows as $i => $row){

        //skip the heading columns
        if($i <= 1 ) continue;

        $cols = $row->getElementsByTagName('td');

        foreach ($cols as $count => $node) {

            if($count == 0) {

                $key = strtolower(str_replace(' ', '_',$node->textContent));

            } else {

               $htmlNode = $node->getElementsByTagName('div');

                if($htmlNode->length >=1) {

                    $innerHTML= '';

                    foreach ($htmlNode as $innerNode) {

                        $innerHTML .= $innerNode->ownerDocument->saveHTML( $innerNode );
                    }

                    $value = $innerHTML;

                } else {

                    $value = $node->textContent;
                }
            }
        }

        $table[$key] = $value;
    }

    return $table;
}

My output is correct, but I'd like to not include the wrapper DIV of the data that contains HTML:

    Array
    (
        [type] => raw
        [direction] => north
        [intro] => Welcome to the test. 
        [html_body] => <div class="softmerge-inner" style="width: 5653px; left: -1px;">Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut <span style="font-weight:bold;">aliquip</span> ex ea commodo consequat. Duis aute irure dolor in <span style="text-decoration:underline;">reprehenderit</span> in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, <span style="font-style:italic;">sunt in</span> culpa qui officia deserunt mollit anim id est laborum.</div>
        [count] => 1003
    )

UPDATE

Based on some feedback and ideas in the answers this is the current iteration of the function, which is slimmer and is returning the desired output. I don't feel too good about the double regex but its working.

private function readHtml()
{

    # the url given in your example
    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

    $dom = new \DOMDocument();
    $dom->loadHTMLFile($url);
    $dom->preserveWhiteSpace = false;

    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');

    $table = [];
    $key = null;
    $value = null;

    foreach ($rows as $i => $row){

        //skip the heading columns
        if($i <= 1 ) continue;

        $cols = $row->getElementsByTagName('td');

        foreach ($cols as $count => $node) {

            if($count == 0) {

                $key = strtolower(str_replace(' ', '_',$node->textContent));

            } else {

                $value = $node->ownerDocument->saveHTML( $node );

                $value = preg_replace('/(<div.*?>|<\/div>)/','',$value);
                $value = preg_replace('/(<td.*?>|<\/td>)/','',$value);
            }
        }

        $table[$key] = $value;
    }

    return $table;
}
  • 写回答

2条回答 默认 最新

  • douyijin7741 2016-04-27 19:16
    关注

    Use preg_replace! Like this:

    $table['html_body']=preg_replace('/(<div.*?>|<\/div>)/','',$table['html_body']);
    

    See here for preg_replace. See here for regex usage.


    OR! You could use simple_html_dom.php like this:

    <?php
    include 'simple_html_dom.php';//<--- Must download to current directory
    $url = 'https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml';
    $html = file_get_html( $url );
    foreach ( $html->find( "div[class=softmerge-inner]" ) as $element ) {
        echo $element->innertext;
        //See http://simplehtmldom.sourceforge.net/manual.htm for usage
    }
    ?>
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 使用ue5插件narrative时如何切换关卡也保存叙事任务记录
  • ¥20 软件测试决策法疑问求解答
  • ¥15 win11 23H2删除推荐的项目,支持注册表等
  • ¥15 matlab 用yalmip搭建模型,cplex求解,线性化处理的方法
  • ¥15 qt6.6.3 基于百度云的语音识别 不会改
  • ¥15 关于#目标检测#的问题:大概就是类似后台自动检测某下架商品的库存,在他监测到该商品上架并且可以购买的瞬间点击立即购买下单
  • ¥15 神经网络怎么把隐含层变量融合到损失函数中?
  • ¥15 lingo18勾选global solver求解使用的算法
  • ¥15 全部备份安卓app数据包括密码,可以复制到另一手机上运行
  • ¥20 测距传感器数据手册i2c