douaonong7807 2010-11-10 19:10
浏览 17

在PHP中需要有关正则表达式的帮助

I am trying to index some content from a series of .html's that share the same format.

So I get a lot of lines like this: <a href="meh">[18] blah blah blah < a...

And the idea is to extract the number (18) and the text next to it (blah...). Furthermore, I know that every qualifying line will start with "> and end with either <a or </p. The issue stems from the need to keep all other htmHTML tags as part of the text (<i>, <u>, etc.).

So then I have something like this:

$docString = file_get_contents("http://whatever.com/some.htm");
$regex="/\">\ [(.*?)\ ] (<\/a>)(.) *?(<)/";
preg_match_all($regex,$docString,$match);

Let's look at $regex for a sec. Ignore it's spaces, I just put them here because else some characters disappear. I specify that it will start with ">. Then I do the numbers inside the [] thing. Then I single out the </a>. So far so good.

At the end, I do a (.)*?(<). This is the turning point. By leaving the last bit, (<) like that, The text will be interrupted when an underline or italics tag is found. However, if I put (<a|</p) the resulting array ends up empty. I've tried changing that to only (<a), but it seems that 2 characters mess up the whole ting.

What can I do? I've been struggling with this all day.

  • 写回答

3条回答 默认 最新

  • douxu4610 2010-11-10 19:13
    关注

    As you've found, using a regex to parse HTML is not very easy. This is because HTML is not particularly regular.

    I suggest using an XML parser such as PHP's DomDocument.

    Create an object, then use the loadHTMLFile method to open the file. Extract your a tags with getElementsByTagName, and then extract the content as the NodeValue property.

    It might look like

    // Create a DomDocument object 
    $html = new DOMDocument(); 
    
    // Load the url's contents into the DOM 
    $html->loadHTMLFile("http://whatever.com/some.htm"); 
    
    // make an array to hold the text 
    $anchors = array(); 
    
    //Loop through the a tags and store them in an array 
    foreach($html->getElementsByTagName('a') as $link) { 
        $anchors[] = $link->nodeValue;
        } 
    

    One alternative to this style of XML/HTML parser is phpquery. The documentation on their page should do a good job of explaining how to extract the tags. If you know jQuery, the interface may seem more natural.

    评论

报告相同问题?

悬赏问题

  • ¥20 求各位懂行的人,注册表能不能看到usb使用得具体信息,干了什么,传输了什么数据
  • ¥15 个人网站被恶意大量访问,怎么办
  • ¥15 Vue3 大型图片数据拖动排序
  • ¥15 Centos / PETGEM
  • ¥15 划分vlan后不通了
  • ¥20 用雷电模拟器安装百达屋apk一直闪退
  • ¥15 算能科技20240506咨询(拒绝大模型回答)
  • ¥15 自适应 AR 模型 参数估计Matlab程序
  • ¥100 角动量包络面如何用MATLAB绘制
  • ¥15 merge函数占用内存过大