检测启用了Cookie的蜘蛛或浏览器

Lots of spiders/crawlers visit our news site. We depend on GeoIP services to identify our visitors' physical location and serve them related content. So we developed a module with module_init() function that sends IP to MaxMind and sets cookies with location information. To avoid sending requests on each page view, we first check whether the cookie is set, and if not, we send for information and set the cookie. This works fine with regular clients but doesn't work as well when a spider crawls through the site. Each pageview prompts a query to MaxMind and this activity becomes somewhat expensive. We are looking for a solution to identify crawlers or, if easier, legit browsers with cookies enabled and query MaxMind only when it's useful.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongzou3751 2010-08-20 20:29
关注
Well there is not just one thing to do to be honest. I would suggest what I have done in the past to combat this same issue. use a browser detection script there are a tone of classes out there for detecting browsers. Then check the browser against a db of known browsers. Then if the browser is in your list allow the call to the service if not use a "best guess" script.

By this I mean something like this:

Generic ip lookup class

So what you are doing is in the event that a browser type is not in your list it will not use your paid services DB instead it uses this class which can get as close as possible. This way you get the best of both worlds bots are not racking up hits on your ip service and if a user does slip past your browser check for some reason they will most likely get a correct location and thus appearing as normal on your site.

This is a little jumpy I know I just hope you get what I am trying to say here.

The real answer is that there is no easy answer or 100% right answer to this issue, I have done many sites with the same situation and have went insane trying to figure it out and this is as close to perfect as I have come. Since 99% of most ligit crawlers will have a value like so:

$_SERVER['HTTP_USER_AGENT'] = 'Googlebot', 'Yammybot', 'Openbot', 'Yahoo'... etc.

A simple browser check will do but it is the shady ones that may respond with IE6 or something.

I really hope this helps like I said there is no real answer here at least not that I have found to be 100%, it's kind of like finding out if a user is on a hand-held these days you can get 99% there but never 100% and it always works out that the client uses that 1% that doesn't work lol.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(2条)

报告相同问题？

关注问题

检测启用了Cookie的蜘蛛或浏览器 php
2010-08-20 19:41

回答 3 已采纳 Well there is not just one thing to do to be honest. I would suggest what I have done in the past
使用PHP启用/禁用htaccess？ html5 php
2019-01-25 23:38

回答 2 已采纳 You'll want to do this: Basically redirect all requests to a php file which then does the access
无法在PHP中启用mb_string php ubuntu
2019-07-28 20:44

回答 1 已采纳 After over 3,5 hours of search, I found the solution!!! I HOPE IT WILL HELP SOMEONE ELSE !! THE E
php如何防止恶意DDoS攻击，避免带宽占用问题方法
2020-10-19 11:30

LuHai3005151872的博客 php如何防止恶意DDoS攻击：我们知道拒绝服务攻击，即DDOS攻击会导致带宽被占用，让正常用户无法访问网站。“居安思危”是十分有必要的，我们需要积累一些相关防御DDoS的知识。 DDoS分布拒绝攻击，简略的来说DDoS主要...
Localhost - 您的浏览器阻止或不支持Cookie php
2014-09-09 09:47

回答 1 已采纳 You said "I tried the wp-config" define hack. I'm not sure what you mean by that but in my case it
为PHP启用sqlite3 FTS5 centos php sqlite
2019-07-07 00:49

回答 1 已采纳 I found that I have to generate header files before the last step so the full steps looks like th
如何在当前启用的PHP版本中安装imagick？ apache debian php
2019-07-30 21:24

回答 1 已采纳 Debian 10 uses PHP 7.3 as its "standard" PHP version, so all of its PHP extension packages -- incl
echo二次开发 ecshop 函数列表
2013-07-22 16:35

huidaoli的博客 lib_time.php (时间函数) gmtime() P: 获得当前格林威治时间的时间戳 /$0 server_timezone() P: 获得服务器的时区 /$0 local_mktime($hour = NULL , $minute= NULL, $second = NULL, $month = NULL, $day = NULL, $...
无法在php curl中启用cookie php
2009-11-28 20:09

回答 1 已采纳 I don't see where is the problem. The CURL is doing exactly what it supposes to do. setcookie() s
服务器上未启用PHP扩展名mcrypt php
2017-02-04 21:53

回答 2 已采纳 PHP doesn't compile mcrypt by default. you first have to have mcrypt installed You need to compi
在现有PHP安装中启用FTP功能 php
2016-10-03 23:25

回答 2 已采纳 as of this link you have to add this to your php.ini: extension=php_ftp.dll
php discuz核心类函数分析_PHP教程
2018-01-27 22:26

weixin_30539835的博客 ] (C)2001-2099 Comsenz Inc.* This is NOT a freeware, use is subject to license terms** $Id: class_core.php 6914 2010-03-26 12:52:36Z cnteacher $*////TODO 是将要完成的功能，包括禁止ip和禁止...
如何在php中启用自动换行？ css mysql php
2016-11-25 03:39

回答 1 已采纳 Use white-space: normal; & word-break: break-all; property to make it wrap around the parent conta
Ecshop中重要文件init.php文件代码分析
2014-06-02 08:14

阿送的博客 <?php /** * ECSHOP 前台公用文件 */ //防止非法调用 defined-判断常量是否已定义，如果没返回false if (!defined('IN_ECS')) { die('Hacking attempt');//die-直接终止程序并输出 } //报告所有错误 error_...
Python爬虫开发学习全教程第二版，爆肝十万字【建议收藏】
2021-10-17 13:35

五包辣条！的博客大家好，我是辣条。...网络爬虫（又被称为网页蜘蛛，网络机器人）就是模拟客户端(主要指浏览器)发送网络请求，接收请求响应，一种按照一定的规则，自动地抓取互联网信息的程序。原则上,只要是客户端(浏
没有解决我的问题, 去提问

悬赏问题

¥15 如何在scanpy上做差异基因和通路富集？
¥20 关于#硬件工程#的问题，请各位专家解答！
¥15 关于#matlab#的问题：期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707，使系统具有较小的超调量
¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
¥30 截图中的mathematics程序转换成matlab
¥15 动力学代码报错，维度不匹配
¥15 Power query添加列问题
¥50 Kubernetes&Fission&Eleasticsearch
¥15 報錯：Person is not mapped，如何解決？
¥15 c++头文件不能识别CDialog

检测启用了Cookie的蜘蛛或浏览器

3条回答 默认 最新

悬赏问题

3条回答默认最新