dongxi5494 2014-01-26 19:05
浏览 204
已采纳

漂亮的URL和robots.txt

Let's assume we are using pretty URLs with mod_rewrite or something similar and have the following two routes:

  • /page
  • /page-two

Now we want to disallow only the first route (/page) to be crawled by robots.

# robots.txt
User-agent: *
Disallow: /page

Disallow (http://www.robotstxt.org/orig.html):

... For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

So the above robots.txt example is disallowing /page-two too, correct?

What is the correct way to get this done?

May it be the following code?

# robots.txt
User-agent: *
Disallow: /page/
  • 写回答

3条回答 默认 最新

  • dongyipa0028 2014-01-26 19:18
    关注

    From Google's robots.txt specifications:

    At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined.

    This means that it doesn't matter in what order you define them. In your case this should work:

    User-agent: *
    Disallow: /page
    Allow: /page-
    

    To make it more clear: Every url is matched against all paths. /page will match /page/123, /page/subdirectory/123/whateverishere.html, /page-123 and /page. The directive with the longest path that matches will be used. If both /page and /page- match, then the directive for /page- is used (Allow). If /page matches, but /page- doesn't match, the directive for /page is used (Disallow). If neither /page and /page- match, the default is assumed (Allow).

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 如何用Labview在myRIO上做LCD显示?(语言-开发语言)
  • ¥15 Vue3地图和异步函数使用
  • ¥15 C++ yoloV5改写遇到的问题
  • ¥20 win11修改中文用户名路径
  • ¥15 win2012磁盘空间不足,c盘正常,d盘无法写入
  • ¥15 用土力学知识进行土坡稳定性分析与挡土墙设计
  • ¥70 PlayWright在Java上连接CDP关联本地Chrome启动失败,貌似是Windows端口转发问题
  • ¥15 帮我写一个c++工程
  • ¥30 Eclipse官网打不开,官网首页进不去,显示无法访问此页面,求解决方法
  • ¥15 关于smbclient 库的使用