在PHP中需要有关正则表达式的帮助

I am trying to index some content from a series of .html's that share the same format.

So I get a lot of lines like this: <a href="meh">[18] blah blah blah < a...

And the idea is to extract the number (18) and the text next to it (blah...). Furthermore, I know that every qualifying line will start with "> and end with either <a or </p. The issue stems from the need to keep all other htmHTML tags as part of the text (<i>, <u>, etc.).

So then I have something like this:

$docString = file_get_contents("http://whatever.com/some.htm");
$regex="/\">\ [(.*?)\ ] (<\/a>)(.) *?(<)/";
preg_match_all($regex,$docString,$match);

Let's look at $regex for a sec. Ignore it's spaces, I just put them here because else some characters disappear. I specify that it will start with ">. Then I do the numbers inside the [] thing. Then I single out the </a>. So far so good.

At the end, I do a (.)*?(<). This is the turning point. By leaving the last bit, (<) like that, The text will be interrupted when an underline or italics tag is found. However, if I put (<a|</p) the resulting array ends up empty. I've tried changing that to only (<a), but it seems that 2 characters mess up the whole ting.

What can I do? I've been struggling with this all day.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douxu4610 2010-11-10 19:13
关注
As you've found, using a regex to parse HTML is not very easy. This is because HTML is not particularly regular.

I suggest using an XML parser such as PHP's DomDocument.

Create an object, then use the loadHTMLFile method to open the file. Extract your a tags with getElementsByTagName, and then extract the content as the NodeValue property.

It might look like

// Create a DomDocument object $html = new DOMDocument(); // Load the url's contents into the DOM $html->loadHTMLFile("http://whatever.com/some.htm"); // make an array to hold the text $anchors = array(); //Loop through the a tags and store them in an array foreach($html->getElementsByTagName('a') as $link) { $anchors[] = $link->nodeValue; }

One alternative to this style of XML/HTML parser is phpquery. The documentation on their page should do a good job of explaining how to extract the tags. If you know jQuery, the interface may seem more natural.
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

如何在正则表达式中使用变量？ javascript 前端正则表达式
2022-01-09 11:44

回答 1 已采纳 /regex\d/g您可以构造一个新的RegExp对象，而不使用语法：var replace = "regex\d";var re = new RegExp(replace,"g"); 您可以通过这种
正则表达式如何写，在一段字符串中提取指定的内容。 python 正则表达式
2022-05-03 20:38

回答 8 已采纳 import re text = """福建省2022年道路交通事故人身损害赔偿相关数据【福建一般地区（除厦门外）】 1、全省城镇居民人均年可支配收入 51140元2、全省农村居民人均年可支配收
求一个php正则表达式 php 正则表达式
2022-01-23 19:47

回答 1 已采纳试试这个import repattern = re.compile (r'(?:money=)\d+.?\d*')pattern.findall(string)
PHP正则表达式
2022-01-04 21:38

正在过坎的博客什么是正则表达式？ 正则表达式，又称规则表达式。（英语：Regular Expression，在代码中常简写为regex、regexp或RE），计算机科学的一个概念。正则表达式通常被用来检索、替换那些符合某个模式(规则)的文正则...
在PHP中使用正则表达式进行用户名验证 php
2017-07-08 07:51

回答 3 已采纳 The following pattern will work: ^[a-z0-9][a-z0-9_]*[a-z0-9]$ ^[a-z0-9]: first character may not
正则表达式 匹配正负整数和正负小数或者空有问必答正则表达式
2021-08-25 15:28

回答 6 已采纳已私聊解决
正则表达式拼接变量到表达式当中 javascript vue.js 正则表达式
2018-12-21 10:49

回答 1 已采纳 ``` new RegExp("([1-9]\\d{0,"+roundNumberLength+"}\\.\\d{0,2})|(0\\.\\d{0,2})|([1-9]\\d{0,"+round
php正则表达式怎么验证,关于php：我如何验证正则表达式？
2021-04-12 15:03

weixin_39612110的博客我想在PHP中测试正则表达式的有效性，最好是在使用它之前。唯一的方法是尝试一个preg_match()，看看它是否返回FALSE？是否有更简单/正确的方法来测试有效的正则表达式？您的意思是：stackoverflow....
想使用正则表达式匹配，提取文本中特定的内容。 python 正则表达式
2022-01-19 16:23

回答 2 已采纳这应该就是你想要的功能： import os, re def GetMiddleStr(content,startStr,endStr): '''提取字符串content当中，startStr
C#正则表达式查找非纯数字的字符 c# 正则表达式
2022-04-27 01:53

回答 6 已采纳 (([a-zA-Z_])([a-zA-Z0-9_])+)|(([0-9])([a-zA-Z_])+)
中文日期的正则表达式 python 有问必答正则表达式
2021-07-25 15:15

回答 5 已采纳 import re asd = '我的火车五月三号开走' pattern = re.findall('\u6211\u7684\u706b\u8f66([\u4e00\u4e8c\u4e09\u56
正则表达式使用指南
2022-06-12 18:42

后海 0_o的博客 正则表达式（Regular Expression，在代码中常简写为regex、regexp或RE）使用单个字符串来描述、匹配一系列符合某个句法规则的字符串搜索模式。搜索模式可用于文本搜索和文本替换。它用一系列字符定义搜索模式。...
请教一个PHP正则表达式的问题 php 有问必答正则表达式
2021-08-24 09:13

回答 2 已采纳这样？有帮助麻烦点个采纳【本回答右上角】，谢谢~~ <?php $s=<<<str 1.\$foo->\$bar['baz'] 主要想用两个正则表达式，放入编辑器以查询
正则表达式 linux 路径,正则表达式-linux路径匹配
2021-05-13 04:32

少吃菜多吃肉的博客如何使用正则表达式校验一个linux路径符合我们的格式要求呢？格式要求：必须'/'开头字符串只允许字母、数字、下划线正确格式如下/data//home/conf123/data/nginx_conf/错误格式如下nginx_conf//data//...
php简单正则表达式函数,前端学PHP之正则表达式函数
2021-04-21 09:42

weixin_39627481的博客前面的话正则表达式不能独立使用，它只是一种用来定义字符串的规则模式，必须在相应的正则表达式函数中应用，才能实现对字符串的匹配、查找、替换及分割等操作。前面介绍了正则表达式的基础语法，本文将详细介绍正则...
没有解决我的问题, 去提问

悬赏问题

¥20 求各位懂行的人，注册表能不能看到usb使用得具体信息，干了什么，传输了什么数据
¥15 个人网站被恶意大量访问，怎么办
¥15 Vue3 大型图片数据拖动排序
¥15 Centos / PETGEM
¥15 划分vlan后不通了
¥20 用雷电模拟器安装百达屋apk一直闪退
¥15 算能科技20240506咨询（拒绝大模型回答）
¥15 自适应 AR 模型参数估计Matlab程序
¥100 角动量包络面如何用MATLAB绘制
¥15 merge函数占用内存过大

在PHP中需要有关正则表达式的帮助

3条回答 默认 最新

悬赏问题

3条回答默认最新