dt20081409 2010-08-20 19:41
浏览 114
已采纳

检测启用了Cookie的蜘蛛或浏览器

Lots of spiders/crawlers visit our news site. We depend on GeoIP services to identify our visitors' physical location and serve them related content. So we developed a module with module_init() function that sends IP to MaxMind and sets cookies with location information. To avoid sending requests on each page view, we first check whether the cookie is set, and if not, we send for information and set the cookie. This works fine with regular clients but doesn't work as well when a spider crawls through the site. Each pageview prompts a query to MaxMind and this activity becomes somewhat expensive. We are looking for a solution to identify crawlers or, if easier, legit browsers with cookies enabled and query MaxMind only when it's useful.

  • 写回答

3条回答 默认 最新

  • dongzou3751 2010-08-20 20:29
    关注

    Well there is not just one thing to do to be honest. I would suggest what I have done in the past to combat this same issue. use a browser detection script there are a tone of classes out there for detecting browsers. Then check the browser against a db of known browsers. Then if the browser is in your list allow the call to the service if not use a "best guess" script.

    By this I mean something like this:

    Generic ip lookup class

    So what you are doing is in the event that a browser type is not in your list it will not use your paid services DB instead it uses this class which can get as close as possible. This way you get the best of both worlds bots are not racking up hits on your ip service and if a user does slip past your browser check for some reason they will most likely get a correct location and thus appearing as normal on your site.

    This is a little jumpy I know I just hope you get what I am trying to say here.

    The real answer is that there is no easy answer or 100% right answer to this issue, I have done many sites with the same situation and have went insane trying to figure it out and this is as close to perfect as I have come. Since 99% of most ligit crawlers will have a value like so:

    $_SERVER['HTTP_USER_AGENT'] = 'Googlebot', 'Yammybot', 'Openbot', 'Yahoo'... etc.
    

    A simple browser check will do but it is the shady ones that may respond with IE6 or something.

    I really hope this helps like I said there is no real answer here at least not that I have found to be 100%, it's kind of like finding out if a user is on a hand-held these days you can get 99% there but never 100% and it always works out that the client uses that 1% that doesn't work lol.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?
  • ¥15 c++头文件不能识别CDialog