I've developed a web scraper on one server, which works and does what I want it to do. Now I have to implement it in another environment and I've stumbled on an issue I did not have when developing, which I am having a hard to identifying.
The only real error I have to go on is (from JS console):
POST http://my.cool.page/pro/company/scrape 502 (Bad Gateway)
The development server (where it works) is using PHP 5.4.16, implementation server is on PHP 5.4.45. I am using the same versions of external code on both servers.
The circumstances for launching the scraper are a bit different in implementation, it's now being loaded through Ajax rather than as its own page.
The ajax call:
$("#showScraperButton").click(function(){
$.post('/pro/company/scrape',
{
'url': url
},
function(result){
//code...
}
);
});
Function + case for scraping anchor tags, using Fabpot/Goutte:
function _getTagContent($crawler = '', $toScrape = '', $contentPatterns = '')
{
$tagContent = array();
ChromePhp::log("Hello _getTagContent");
foreach($toScrape as $tag) {
$i = 0;
switch ($tag) {
case 'a':
$n = $i;
$crawler->filter($tag)->each(
function ($node) use(&$tagContent, &$n, &$tag, &$crawler)
{
$nodeText = trim($node->text());
$tagContent[$tag][$n]['value'] = $nodeText;
$linksCrawler = $crawler->selectLink($nodeText);
try {
$link = $linksCrawler->link();
$magicDidHappen = true;
}
catch(Exception $e) {
$magicDidHappen = false;
}
if ($magicDidHappen) {
$uri = $link->getUri();
}
else {
$uri = $node->attr('href');
}
$tagContent[$tag][$n]['uri'] = $uri;
$n++;
});
break;
default:
break;
}
}
return $tagContent;
}
This results in the error described above.
By commenting out each line in the case, I found that the error does not show until
$n++;
is called. If
$n++;
is NOT included, the final a element is indeed present in $tagContent.
This led me to believe that the attempt at iteration is the problem in this case, and that the code otherwise does not throw errors. I then tried with a different html tag, using similar syntax:
case 'h3':
$n = $i;
$crawler->filter($tag)->each(
function ($node) use(&$tagContent, &$n, &$tag)
{
$tagContent[$tag][$n] = trim($node->text());
$n++;
});
break;
However, this works as intended, giving me all 40 instances of h3 on the page I'm scraping.
From this I have some questions: Please help? Could it be related to PHP versions? Is there a way to print the "standard" PHP errors when doing Ajax calls (instead of/in addition to http response codes), as I'm sure there is a hint to be found there as to what is failing. Thanks much for any help!