Trying to build a web scraping script like feed43.com. Details: I have an html code as follows.
<div id="latest_header" onclick="getNews('79');">
<img src="home_images/arrow.gif"> 2 DAY SEMINAR <br> <label id="news_pagedesp"><img src="home_images/li_desp.gif">NATIONAL SEMINAR..</label><label id="date_label">13th August 2014</label></div>
<div id="latest_header" onclick="getNews('78');">
<img src="home_images/arrow.gif"> 2 DAYS WORKSHOP <br> <label id="news_pagedesp"><img src="home_images/li_desp.gif">INTERNATIONAL WOR..</label><label id="date_label">8th August 2014</label></div>
I write an expression like the following..
<div id="latest_header"{*}getNews('{%}'){*} {%}<br>{*}.gif">{%}..</label>
The result should be as per the following rules:
{*} - ignore everything {%} - use this as a value for a variable
that is the result should be all the occurrences of the given pattern. In above case:
{%1} - 79 {%2} - 2 DAY SEMINAR {%3} - NATIONAL SEMINAR
{%1} - 78 {%2} - 2 DAYS WORKSHOP {%3} - INTERNATIONAL WOR
I wasn't able to implement regular expressions and read at many places that it is not feasible to traverse html pages. I moved to simple_html_dom , but had no luck to get the above thing done in such an easy way. At-least, it wasn't possible for me to simulate the above thing.
The variables {*} & {%} are used to create a pattern when one uses feed43.com to create a feed of some website.