doutan2111 2013-07-30 08:41
浏览 62

我如何只允许抓取工具访问网站的一部分?

I've got an ajax rich website which has extensive _escaped_fragment_ portions for Ajax indexing. While all my _escaped_fragment_ urls do 301 redirects to a special module which then outputs the HTML snapshots the crawlers need (i.e. mysite.com/#!/content redirects to mysite.com/?_escaped_fragment_=/content which in turn 301s to mysite.com/raw/content), I'm somewhat afraid of users stumbling on those "raw" URLs themselves and making them appear in search engines.

In PHP, how do I make sure only robots can access this part of the website? (much like StackOverflow disallows its sitemap to normal users, and only lets robots access it)

  • 写回答

1条回答 默认 最新

  • dongyi2889 2013-07-30 08:55
    关注

    You can't, at least not reliably.

    robots.txt asks spiders to keep out of parts of a site, but there is no equivalent for regular user agents.

    The closest you could come would be to try to keep a whitelist of acceptable ip addresses or user agents and serve different content based on that … but that risks false positives.

    Personally, I'd stop catering for old-IE, scrap the #! URIs and the escaped_fragment hack, switch to using pushState and friends, and have the server build the initial view for any given page.

    评论

报告相同问题?

悬赏问题

  • ¥15 Python输入字符串转化为列表排序具体见图,严格按照输入
  • ¥20 XP系统在重新启动后进不去桌面,一直黑屏。
  • ¥15 opencv图像处理,需要四个处理结果图
  • ¥15 无线移动边缘计算系统中的系统模型
  • ¥15 深度学习中的画图问题
  • ¥15 java报错:使用mybatis plus查询一个只返回一条数据的sql,却报错返回了1000多条
  • ¥15 Python报错怎么解决
  • ¥15 simulink如何调用DLL文件
  • ¥15 关于用pyqt6的项目开发该怎么把前段后端和业务层分离
  • ¥30 线性代数的问题,我真的忘了线代的知识了