doutan3192 2018-01-21 17:55
浏览 17
已采纳

带有不同标签的preg_match [关闭]

I'm in need of some assistance. I'm trying to scrape some specify data from a website.

<tbody>
    <tr style="mso-yfti-irow: 1;">
        <td style="width: 184.4pt; border: none; border-left: solid windowtext 1.5pt; padding: 0cm 5.4pt 0cm 5.4pt;" valign="top" width="307">
            <p class="MsoNormal" style="margin-bottom: .0001pt; line-height: normal;">Certifikat springer 1000m</p>
        </td>

        <td style="width: 44.7pt; border: none; border-right: solid windowtext 1.5pt; padding: 0cm 5.4pt 0cm 5.4pt;" valign="top" width="75">
            <p class="MsoNormal" style="margin-bottom: .0001pt; text-align: right; line-height: normal;" align="right">90,-</p>
        </td>
    </tr>

    <tr style="mso-yfti-irow: 2;">
        <td style="width: 184.4pt; border: none; border-left: solid windowtext 1.5pt; padding: 0cm 5.4pt 0cm 5.4pt;" valign="top" width="307">
            <p class="MsoNormal" style="margin-bottom: .0001pt; line-height: normal;">Certifikat springer 1200m</p>
        </td>

        <td style="width: 44.7pt; border: none; border-right: solid windowtext 1.5pt; padding: 0cm 5.4pt 0cm 5.4pt;" valign="top" width="75">
            <p class="MsoNormal" style="margin-bottom: .0001pt; text-align: right; line-height: normal;" align="right">100,-</p>
        </td>
    </tr>   
</tbody>

what I want is to get the "Certifikat springer 1000" from mos-yfti-irow1 and the 90,- from the next TD. but I don't want to get the data from mos-yfti-irow2 in this output.

I'm want to build something where people can compare prices on some activities on our sports group with different clubs. I'm not really sure how to.

This is what I have for now, but can't really get it to work

    <?php 

    $file_string = file_get_contents('http://www.mfkviborg.dk/index.php?    option=com_content&view=article&id=21&Itemid=151');

    preg_match_all('/<p class="MsoNormal" style="margin-bottom: .0001pt;(.*)">(.*)<\/p>/i', $file_string, $links);

    ?>

    <p><strong>Links:</strong> <em>(Name - Link)</em><br />
    <?php
    echo '<ol>';
    for($i = 0; $i < count($links[1]); $i++) {
        echo '<li>' . $links[2][$i] . ' - ' . $links[1][$i] . '</li>';
    }
    echo '</ol>';
    ?>
</p>

Any clues?

  • 写回答

2条回答 默认 最新

  • douzhi6160 2018-01-21 18:09
    关注

    A few issues:

    • The . does not match with newlines, unless you specify the s modifier at the end of your regex. So that should be added.

    • The .* is greedy, so it will match as much as possible including some intermediate </p>. It should not do that, so add a ? (in both cases)

    Less of a problem, but still worth changing:

    • The first capture group probably does not give you useful information, so remove the parentheses there.

    • The . in .0001 is taken as any character, so you should escape it. One way is to put it as [.]

    This gives you this line of code:

    preg_match_all('/<p class="MsoNormal" style="margin-bottom: [.]0001pt;.*?">(.*?)<\/p>/is', 
                 $file_string, $links);
    

    Use DOM parser

    Note that if your source HTML only changes slightly (with extra spacing or changing double to single quotes, or swaps the position of attributes ...) you will bump into issues, and be called to adapt the code.

    It is much better to use the DOMDocument interface together with a DOMXPath query. Here is how that could work:

    $doc = new DOMDocument();
    libxml_use_internal_errors(true);
        $doc->loadHTML($file_string, LIBXML_NOCDATA | LIBXML_NOWARNING | LIBXML_NOERROR );
    libxml_use_internal_errors(false);
    $xpath = new DOMXPath($doc);
    $nodes = $xpath->query("//p[contains(@class, 'MsoNormal') and contains(@style, 'margin-bottom: .0001pt')]");
    foreach ($nodes as $node) {
        echo $node->textContent . "
    ";
    }
    

    Instead of the loadHTML method you can also use the load method, and pass the URL as first argument.

    Follow-up

    You asked in comments to further filter the output by tr with mso-yfti-irow in the style attribute:

    $nodes = $xpath->query("//tr[contains(@style, 'mso-yfti-irow')]//p[contains(@class, 'MsoNormal') and contains(@style, 'margin-bottom: .0001pt')]");
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 关于#hadoop#的问题
  • ¥15 (标签-Python|关键词-socket)
  • ¥15 keil里为什么main.c定义的函数在it.c调用不了
  • ¥50 切换TabTip键盘的输入法
  • ¥15 可否在不同线程中调用封装数据库操作的类
  • ¥15 微带串馈天线阵列每个阵元宽度计算
  • ¥15 keil的map文件中Image component sizes各项意思
  • ¥20 求个正点原子stm32f407开发版的贪吃蛇游戏
  • ¥15 划分vlan后,链路不通了?
  • ¥20 求各位懂行的人,注册表能不能看到usb使用得具体信息,干了什么,传输了什么数据