douke3442 2014-09-22 12:03
浏览 22
已采纳

用PHP替换{*}和{%}的正则表达式或网页抓取

Trying to build a web scraping script like feed43.com. Details: I have an html code as follows.

<div id="latest_header" onclick="getNews('79');">
                <img src="home_images/arrow.gif">&nbsp;2 DAY SEMINAR <br> <label id="news_pagedesp"><img src="home_images/li_desp.gif">NATIONAL SEMINAR..</label><label id="date_label">13th August 2014</label></div>
<div id="latest_header" onclick="getNews('78');">
                <img src="home_images/arrow.gif">&nbsp;2 DAYS WORKSHOP <br> <label id="news_pagedesp"><img src="home_images/li_desp.gif">INTERNATIONAL WOR..</label><label id="date_label">8th August 2014</label></div>

I write an expression like the following..

<div id="latest_header"{*}getNews('{%}'){*}&nbsp;{%}<br>{*}.gif">{%}..</label>

The result should be as per the following rules:

{*} - ignore everything {%} - use this as a value for a variable

that is the result should be all the occurrences of the given pattern. In above case:

{%1} - 79 {%2} - 2 DAY SEMINAR {%3} - NATIONAL SEMINAR

{%1} - 78 {%2} - 2 DAYS WORKSHOP {%3} - INTERNATIONAL WOR

I wasn't able to implement regular expressions and read at many places that it is not feasible to traverse html pages. I moved to simple_html_dom , but had no luck to get the above thing done in such an easy way. At-least, it wasn't possible for me to simulate the above thing.

The variables {*} & {%} are used to create a pattern when one uses feed43.com to create a feed of some website.

  • 写回答

2条回答 默认 最新

  • doufendi9063 2014-10-22 06:02
    关注

    This probably might be irrelevant but the following open source project achieves what i wanted to..

    hFeeds

    And all i actually wanted to was to be able to create RSS feeds for any webpage like Feed43.com And hFeeds works exactly like Feed43 .com and is as easy to use. The only difference being it use {h} in place of {%} and {i} in place of {*}. It generates the regular expression as i see it.

    But thanks all for ur answers

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 Llama如何调用shell或者Python
  • ¥20 谁能帮我挨个解读这个php语言编的代码什么意思?
  • ¥15 win10权限管理,限制普通用户使用删除功能
  • ¥15 minnio内存占用过大,内存没被回收(Windows环境)
  • ¥65 抖音咸鱼付款链接转码支付宝
  • ¥15 ubuntu22.04上安装ursim-3.15.8.106339遇到的问题
  • ¥15 blast算法(相关搜索:数据库)
  • ¥15 请问有人会紧聚焦相关的matlab知识嘛?
  • ¥15 网络通信安全解决方案
  • ¥50 yalmip+Gurobi