I have configured allow_url_fopen=0
to prevent scrapping tools. The configuration is done on global mode and I'm not allowing an override to the local php.ini file. However, I've noted that the flag can be bypassed if the scrapping tool is based on cURL. Look at the given page copier function below, I successfully copied the page from the server having configuration allow_url_fopen=0
using the given function.
public function handle()
{
try{
if( ini_get('allow_url_fopen') ) {
Log::info('Flag allow_url_fopen is enabled');
$html = new Htmldom('page_url_here');
} else {
Log::info('Flag allow_url_fopen is disabled trying with cURL');
$webpage = EventCron::get_web_page('page_url_here');
$html = new Htmldom($webpage['content']);
}
/*Doing some magical stuff with the site content */
$agenda = $html->find('div.articles' , 0);
Log::info('success');
}catch(\Exception $e){
Log::error('Event Cron Error '.$e->getMessage());
}
}
public static function get_web_page( $url, $cookiesIn = '' ){
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_ENCODING => "",
CURLOPT_AUTOREFERER => true,
CURLOPT_CONNECTTIMEOUT => 120,
CURLOPT_TIMEOUT => 120,
CURLOPT_MAXREDIRS => 10,
CURLINFO_HEADER_OUT => true,
CURLOPT_SSL_VERIFYPEER => true,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_COOKIE => $cookiesIn
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$rough_content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header_content = substr($rough_content, 0, $header['header_size']);
$body_content = trim(str_replace($header_content, '', $rough_content));
$pattern = "#Set-Cookie:\\s+(?<cookie>[^=]+=[^;]+)#m";
preg_match_all($pattern, $header_content, $matches);
$cookiesOut = implode("; ", $matches['cookie']);
$page['errno'] = $err;
$page['errmsg'] = $errmsg;
$page['headers'] = $header_content;
$page['content'] = $body_content;
$page['cookies'] = $cookiesOut;
return $page;
}
Now the question is, how to prevent the page being cronned/scrapped? if there is no such thing allow us to do so, probably, it's a security issue in PHP. I found an alternative to prevent this from being happened by disabling the cURL
library but it's not the proper solution. Some of my hosted projects require the library cURL
as it's most used one and popular among web developers.