带有不同标签的preg_match [关闭]

I'm in need of some assistance. I'm trying to scrape some specify data from a website.

<tbody>
    <tr style="mso-yfti-irow: 1;">
        <td style="width: 184.4pt; border: none; border-left: solid windowtext 1.5pt; padding: 0cm 5.4pt 0cm 5.4pt;" valign="top" width="307">
            <p class="MsoNormal" style="margin-bottom: .0001pt; line-height: normal;">Certifikat springer 1000m</p>
        </td>

        <td style="width: 44.7pt; border: none; border-right: solid windowtext 1.5pt; padding: 0cm 5.4pt 0cm 5.4pt;" valign="top" width="75">
            <p class="MsoNormal" style="margin-bottom: .0001pt; text-align: right; line-height: normal;" align="right">90,-</p>
        </td>
    </tr>

    <tr style="mso-yfti-irow: 2;">
        <td style="width: 184.4pt; border: none; border-left: solid windowtext 1.5pt; padding: 0cm 5.4pt 0cm 5.4pt;" valign="top" width="307">
            <p class="MsoNormal" style="margin-bottom: .0001pt; line-height: normal;">Certifikat springer 1200m</p>
        </td>

        <td style="width: 44.7pt; border: none; border-right: solid windowtext 1.5pt; padding: 0cm 5.4pt 0cm 5.4pt;" valign="top" width="75">
            <p class="MsoNormal" style="margin-bottom: .0001pt; text-align: right; line-height: normal;" align="right">100,-</p>
        </td>
    </tr>   
</tbody>

what I want is to get the "Certifikat springer 1000" from mos-yfti-irow1 and the 90,- from the next TD. but I don't want to get the data from mos-yfti-irow2 in this output.

I'm want to build something where people can compare prices on some activities on our sports group with different clubs. I'm not really sure how to.

This is what I have for now, but can't really get it to work

    <?php 

    $file_string = file_get_contents('http://www.mfkviborg.dk/index.php?    option=com_content&view=article&id=21&Itemid=151');

    preg_match_all('/<p class="MsoNormal" style="margin-bottom: .0001pt;(.*)">(.*)<\/p>/i', $file_string, $links);

    ?>

    <p><strong>Links:</strong> <em>(Name - Link)</em><br />
    <?php
    echo '<ol>';
    for($i = 0; $i < count($links[1]); $i++) {
        echo '<li>' . $links[2][$i] . ' - ' . $links[1][$i] . '</li>';
    }
    echo '</ol>';
    ?>
</p>

Any clues?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douzhi6160 2018-01-21 18:09
关注
A few issues:

The . does not match with newlines, unless you specify the s modifier at the end of your regex. So that should be added.

The .* is greedy, so it will match as much as possible including some intermediate </p>. It should not do that, so add a ? (in both cases)

Less of a problem, but still worth changing:

The first capture group probably does not give you useful information, so remove the parentheses there.

The . in .0001 is taken as any character, so you should escape it. One way is to put it as [.]

This gives you this line of code:

preg_match_all('/<p class="MsoNormal" style="margin-bottom: [.]0001pt;.*?">(.*?)<\/p>/is', $file_string, $links);

Use DOM parser

Note that if your source HTML only changes slightly (with extra spacing or changing double to single quotes, or swaps the position of attributes ...) you will bump into issues, and be called to adapt the code.

It is much better to use the DOMDocument interface together with a DOMXPath query. Here is how that could work:

$doc = new DOMDocument(); libxml_use_internal_errors(true); $doc->loadHTML($file_string, LIBXML_NOCDATA | LIBXML_NOWARNING | LIBXML_NOERROR ); libxml_use_internal_errors(false); $xpath = new DOMXPath($doc); $nodes = $xpath->query("//p[contains(@class, 'MsoNormal') and contains(@style, 'margin-bottom: .0001pt')]"); foreach ($nodes as $node) { echo $node->textContent . " "; }

Instead of the loadHTML method you can also use the load method, and pass the URL as first argument.

Follow-up

You asked in comments to further filter the output by tr with mso-yfti-irow in the style attribute:

$nodes = $xpath->query("//tr[contains(@style, 'mso-yfti-irow')]//p[contains(@class, 'MsoNormal') and contains(@style, 'margin-bottom: .0001pt')]");
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

带有不同标签的preg_match [关闭] php
2018-01-21 17:55

回答 2 已采纳 A few issues: The . does not match with newlines, unless you specify the s modifier at the end o
带有if / else条件的php preg_match_all php
2016-06-17 17:05

回答 1 已采纳 $question = Trim(stripslashes($_POST['Message'])); $badwords = array("caca","poopoo","pipi");
带有PHP preg_match_all的条件RgEx php
2016-02-08 15:14

回答 1 已采纳 Conditionals won't help you much in this case. The simplest way is to use lookarounds: (?<!["
php带换行preg_match_all,PHP REGEX preg_match_all在特定行之后的每一行
2021-05-07 07:54

Rilakkimura的博客 Here is the sample string with my regex and code:這是我的正則表達式和代碼的示例字符串:$str = "Supp Fees:----------Oral GlucoseGlucagonOXYGEN";...preg_match_all($ptn,$str,$matches);echo"...
带有反斜杠的正则表达式在preg_match_all中不起作用 php
2015-09-23 17:43

回答 1 已采纳 Added \h to match horizontal spaces. $ptn2 = '~\\[a-zA-Z\h]+\\~'; preg_match_all($ptn2, $line, $m
如何提取某些HTML标记，例如 <ul>在PHP中使用带有preg_match_all的Regex？ html php
2014-01-07 06:42

回答 2 已采纳 As the comments stated already, it's generally not recommended to parse html with regex. In my opi
使用PHP使用带有$ _GET的preg_match和目录名称时出现问题 php
2014-02-18 06:02

回答 1 已采纳 Directory name should start with a-zA-z according to regex and added space,_,- also valid ///////
php preg_match 漏洞,PHP preg_match()函数信息泄露漏洞
2021-04-20 14:45

郭之然的博客发布日期：2009-09-27更新...PHP所使用的preg_match()函数从用户输入字符串获得参数，如果所传送的值为数组而不是字符串就会生成警告，警告消息中包含有当前运行脚本的完整路径。链接：http://marc.info/?l=bugtr...
带有preg_match的PHP菜单 php
2011-08-16 13:50

回答 5 已采纳 I think basename() in combination with parse_url() would do the job. It returns the filename of an
带有preg_match（）的PHP正则表达式 php
2012-07-12 20:29

回答 2 已采纳 preg_match('/^[\s\pL'-]+$/',$string) This is the way I'd do it //EDIT Maybe if you have a mini
带有var_dump的preg_replace_callback（）的异常输出 php
2017-08-19 07:48

回答 2 已采纳 This should explain the regex part. Now to the echo part where -w is missing: as you can see, preg
PHP中preg_match正则匹配中的/u、/i、/s含义
2021-01-20 00:16

PHP中preg_match正则匹配的/u /i /s是什么意思 /u 表示按unicode(utf-8)匹配（主要针对多字节比如汉字） /i 表示不区分大小写（如果表达式里面有 a，那么 A 也是匹配对象） /s 表示将字符串视为单行来匹配您...
带有空格，下划线，短划线和圆点的php中的preg_match [重复] php
2014-12-29 03:13

回答 2 已采纳 You should use: if (!preg_match("/^[\w\s\.-]*$/",$string)) { #show error } It will also mat
php正则preg_match_all,php正则表达式中preg_match_all函数的详解
2021-04-09 12:50

weixin_39932458的博客今天我们就带大家了解php正则表达式中preg_match_all函数的详解！了解正则表达式之前，须要掌握一些常用的正则表达式的基础知识，这些如果能记得最好记得，记不住须要用的时候能查到就行，就多个特殊字符，所以说...
php ereg preg_match,正则表达式 preg_match()与ereg()函数
2021-04-12 20:14

鸭梨梨呐的博客作用：分割，匹配，查找，替换例如：验证邮箱地址格式，手机号码格式等等php中常用的正则函数：preg_match(mode, string subject, array matches); 更加规范执行效率越高ereg(mode, string ...
php 正则表达式-preg_match/preg_match_all
2019-07-21 09:13

名称正在更新……的博客 preg_match_all('/php\w+php/U','php123phpphp456php',$arr); print_r($arr); //结果为Array ( [0] => Array ( [0] => php123php [1] => php456php ) ) //禁止贪婪匹配后，会就近匹配三、Perl风格函数 ...
CTF 总结02：preg_match()绕过
2023-01-18 18:53

梅头脑_的博客 preg_match()绕过，小白总结，欢迎留言~
php preg match 变量,php preg_match – >在变量中保存匹配的值？
2021-04-23 12:40

克勒kk的博客我的$content变量存储了youtube视频链接...._=&;]*))(\])/si"; $video = preg_match($youtubeurl, $content , $found);... 标签：php,preg-match 来源： https://codeday.me/bug/20190621/1255963.html
php preg_match 只匹配第一个字符_PHP面试之一：PHP基础知识点
2020-11-19 22:56

weixin_39586683的博客引用意味着用不同的名字访问同一个内容定义引用变量：使用&引用变量的工作原理普通变量的工作原理0,引用变量的工作原理0,注意：1、引用变量一旦定义，此变量永远是引用变量，不可能再变回普通变量2、引用变量...
php正则表达式preg_match,phppreg_match正则表达式函数实例
2021-04-20 11:51

weixin_39948210的博客正则表达式几乎在所有编程语言里面都会用到，本实例介绍php中正则表达式preg_match函数的应用。preg_match() 函数用于进行正则表达式匹配，成功返回 1 ，否则返回 0 。preg_match() 匹配成功一次后就会停止匹配，...
没有解决我的问题, 去提问

悬赏问题

¥15 关于#hadoop#的问题
¥15 (标签-Python|关键词-socket)
¥15 keil里为什么main.c定义的函数在it.c调用不了
¥50 切换TabTip键盘的输入法
¥15 可否在不同线程中调用封装数据库操作的类
¥15 微带串馈天线阵列每个阵元宽度计算
¥15 keil的map文件中Image component sizes各项意思
¥20 求个正点原子stm32f407开发版的贪吃蛇游戏
¥15 划分vlan后，链路不通了？
¥20 求各位懂行的人，注册表能不能看到usb使用得具体信息，干了什么，传输了什么数据

带有不同标签的preg_match [关闭]

2条回答 默认 最新

Use DOM parser

Follow-up

悬赏问题

2条回答默认最新