I want to parse different web pages so that I can form an inverted index. I want to read only the text, not the a tag elements,menu, etc. Is it possible to do this? Here is what I have so far:
<?php
$ch = curl_init("http://en.wikipedia.org/wiki/Agile_software_development");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
$c1 = curl_exec($ch);
$dom = new DOMDocument();
@$dom->loadHTML($c1);
$links = $dom->getElementsByTagName("body");
echo "<br>";
foreach($links as $links) {
$title = $links->getElementsBytagName("a");
$l= $title->length;
echo $link->nodeValue;
echo"<br>";
} ?>