doubinchou4219 2016-04-27 18:37
浏览 51
已采纳

获取元素的innerHTML,但不是元素本身

I am working on extracting the data from a 2 column table. The first column is the variable name and the second column is the data for that variable.

I have this almost working, but some data may contain HTML and is often wrapped in a DIV. I want to get the HTML inside the DIV, but not the DIV itself. I know regex might be an solution, but I'd like to better understand DOMDocument.

This is the code I have so far:

private function readHtml()
{

    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $htmlData = curl_exec($curl);
    curl_close($curl);

    $dom        = new \DOMDocument();
    $html       = $dom->loadHTML($htmlData);
    $dom->preserveWhiteSpace = false;

    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');

    $table = [];
    $key = null;
    $value = null;

    foreach ($rows as $i => $row){

        //skip the heading columns
        if($i <= 1 ) continue;

        $cols = $row->getElementsByTagName('td');

        foreach ($cols as $count => $node) {

            if($count == 0) {

                $key = strtolower(str_replace(' ', '_',$node->textContent));

            } else {

               $htmlNode = $node->getElementsByTagName('div');

                if($htmlNode->length >=1) {

                    $innerHTML= '';

                    foreach ($htmlNode as $innerNode) {

                        $innerHTML .= $innerNode->ownerDocument->saveHTML( $innerNode );
                    }

                    $value = $innerHTML;

                } else {

                    $value = $node->textContent;
                }
            }
        }

        $table[$key] = $value;
    }

    return $table;
}

My output is correct, but I'd like to not include the wrapper DIV of the data that contains HTML:

    Array
    (
        [type] => raw
        [direction] => north
        [intro] => Welcome to the test. 
        [html_body] => <div class="softmerge-inner" style="width: 5653px; left: -1px;">Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut <span style="font-weight:bold;">aliquip</span> ex ea commodo consequat. Duis aute irure dolor in <span style="text-decoration:underline;">reprehenderit</span> in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, <span style="font-style:italic;">sunt in</span> culpa qui officia deserunt mollit anim id est laborum.</div>
        [count] => 1003
    )

UPDATE

Based on some feedback and ideas in the answers this is the current iteration of the function, which is slimmer and is returning the desired output. I don't feel too good about the double regex but its working.

private function readHtml()
{

    # the url given in your example
    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

    $dom = new \DOMDocument();
    $dom->loadHTMLFile($url);
    $dom->preserveWhiteSpace = false;

    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');

    $table = [];
    $key = null;
    $value = null;

    foreach ($rows as $i => $row){

        //skip the heading columns
        if($i <= 1 ) continue;

        $cols = $row->getElementsByTagName('td');

        foreach ($cols as $count => $node) {

            if($count == 0) {

                $key = strtolower(str_replace(' ', '_',$node->textContent));

            } else {

                $value = $node->ownerDocument->saveHTML( $node );

                $value = preg_replace('/(<div.*?>|<\/div>)/','',$value);
                $value = preg_replace('/(<td.*?>|<\/td>)/','',$value);
            }
        }

        $table[$key] = $value;
    }

    return $table;
}
  • 写回答

2条回答 默认 最新

  • douyijin7741 2016-04-27 19:16
    关注

    Use preg_replace! Like this:

    $table['html_body']=preg_replace('/(<div.*?>|<\/div>)/','',$table['html_body']);
    

    See here for preg_replace. See here for regex usage.


    OR! You could use simple_html_dom.php like this:

    <?php
    include 'simple_html_dom.php';//<--- Must download to current directory
    $url = 'https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml';
    $html = file_get_html( $url );
    foreach ( $html->find( "div[class=softmerge-inner]" ) as $element ) {
        echo $element->innertext;
        //See http://simplehtmldom.sourceforge.net/manual.htm for usage
    }
    ?>
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 聚类分析或者python进行数据分析
  • ¥15 逻辑谓词和消解原理的运用
  • ¥15 三菱伺服电机按启动按钮有使能但不动作
  • ¥15 js,页面2返回页面1时定位进入的设备
  • ¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
  • ¥15 (希望可以解决问题)ma和mb文件无法正常打开,打开后是空白,但是有正常内存占用,但可以在打开Maya应用程序后打开场景ma和mb格式。
  • ¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?
  • ¥15 错误 LNK2001 无法解析的外部符号