drb56625 2014-06-11 12:12
浏览 75
已采纳

解析链接和信息的嵌套HTML

I am trying to parse a website (files.minecraftforge.net) and grab both the download links as well as information such as the version and build time for each one. I am using the Simple HTML DOM Parser and its been working great thus far, however I have been tinkering with the documentation and can't fully understand it.

Each table row has 5 TDs. I need to grab the data from the first 4 (Promotion, Version, Minecraft, Time) as well as the data I am already collecting from the URLs. The following code works to grab the URL and title (innertext) but how do I also grab the td information for the row as well?

I think the best approach would be to use foreach() to grab the rows, then do a foreach inside of that for each td inside that TR. I unfortunately can't figure out how to run a foreach on what is returned from html->find();

foreach($html->find('table#promotions_table a') as $e)
{
    echo $e->innertext . '<br>';
    echo $e->href . '<br>';
}

A snippet of the HTML that I am trying to parse appears as so...

  <table border="0" id="promotions_table">
    <tr>
      <th>Promotion</th>
      <th>Version</th>
      <th>Minecraft</th>
      <th>Time</th>
      <th>Downloads</th>
    </tr>
    <tr>
      <td>1.6.4-Latest</td>
      <td>9.11.1.965</td>
      <td>1.6.4</td>
      <td>11/21/2013 02:31:00 PM</td>
       <td>
      (<a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-changelog.txt">Changelog</a>)
      (<a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-installer.jar">Installer</a>)
      <a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-installer.jar">*</a>
      (<a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-javadoc.zip">Javadoc</a>)
      <a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-javadoc.zip">*</a>
      (<a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-src.zip">Src</a>)
      <a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-src.zip">*</a>
      (<a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-universal.jar">Universal</a>)
      <a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-universal.jar">*</a>
      </td>
    </tr>
    <tr>
      <td>1.6.4-Recommended</td>
      <td>9.11.1.965</td>
      <td>1.6.4</td>
      <td>11/21/2013 02:31:00 PM</td>
       <td>
      (<a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-changelog.txt">Changelog</a>)
      (<a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-installer.jar">Installer</a>)
      <a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-installer.jar">*</a>
      (<a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-javadoc.zip">Javadoc</a>)
      <a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-javadoc.zip">*</a>
      (<a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-src.zip">Src</a>)
      <a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-src.zip">*</a>
      (<a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-universal.jar">Universal</a>)
      <a href="http://files.minecraftforge.net/maven/net/minecraftforge/forge/1.6.4-9.11.1.965/forge-1.6.4-9.11.1.965-universal.jar">*</a>
      </td>
    </tr>
  • 写回答

1条回答 默认 最新

  • dongpiao0731 2014-06-11 12:40
    关注

    I figured out how to do so with further experimenting. Here is how I did it in case anyone else encounters the same question.

    foreach($html->find('table#promotions_table tr') as $tr)
    {
        $details = array();
        $count = 0;
    
        foreach ($tr->find('td') as $td)
        {
            switch ($count)
            {
                case 0:
                {
                    $details['title'] = $td->innertext;
                    echo "TITLE: " . $details['title'] . "</br>";
                    break;
                }
                case 1:
                {
                    $details['build'] = $td->innertext;
                    echo "BUILD: " . $details['build'] . "</br>";
                    break;
                }
                case 2:
                {
                    $details['version'] = $td->innertext;
                    echo "VERSION: " . $details['version'] . "</br>";
                    break;
                }
                case 3:
                {
                    $details['time'] = $td->innertext;
                    echo "TIME: " . $details['time'] . "</br>";
                    break;
                }
                case 4:
                {
                    foreach ($td->find('a') as $a)
                    {
                        if ($a->innertext == "Installer")
                        {
                            $url = $a->href;
    
                            // Strip the "adf.ly" URL from the beginning of the text
                            preg_match("#https?://(www\.)?adf\.ly/\d+/(.*)#i", $url, $matches);
                            echo "URL: " . $matches[2] . "</br>";
    
                            $details['url'] = $a->href;
                        }
                    }
                }
            }
    
            $count++;
        }
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 HFSS 中的 H 场图与 MATLAB 中绘制的 B1 场 部分对应不上
  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?