nginx如何屏蔽爬虫(YisouSpider、Googlebot等)，有酬谢

网站被爬虫爬得扛不住了。。想把这些爬虫都屏蔽了。。

nginx下配有多个站点。。nginx.conf里是没有server字段的。。只有各个站点的.conf文件里有。。

我按照网上的说法，在nginx.conf的server字段(nginx.conf的server字段也是我自己加的)中加入了以下内容：
if ($http_user_agent ~* "qihoobot|Baiduspider|Googlebot|Googlebot-Mobile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot") {

return 403;

}

但是用curl -I -A "Googlebot" www.XXX.com，仍然没有返回403。。
太不靠谱了。。
求高手指点。。

另外robots.txt也不好使。。那个东西全靠自觉。。我想要能主动禁止他们。。。因为有的流氓爬虫显然没法用那个解决

nginx.conf内容如下：

#user nobody;
worker_processes 2;

#error_log logs/error.log;
#error_log logs/error.log notice;
#error_log logs/error.log info;

#pid文件的位置
pid nginx.pid;

events {
worker_connections 10240;
}

http {

include       mime.types;
default_type  application/octet-stream;

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                  '$status $body_bytes_sent "$http_referer" '
                  '"$http_user_agent" "$http_x_forwarded_for"';

#access_log  logs/access.log  main;

sendfile        on;
#tcp_nopush     on;

#keepalive_timeout  0;
keepalive_timeout  65;

#gzip  on;

    #   open(OUTFILE, ">>/home/wamdm/perl_learn/a");
    #   print OUTFILE ($r->uri,"\n");
    #   close (OUTFILE);


perl_set $fix_upper_lower_case '
    use File::Basename;
    sub {
        my $r = shift;
        my $uri = $r->uri;
        my $filepath = $r->filename; 
        my $uri_prefix = substr($uri, 0, rindex($uri, "/") + 1);
        my $dir = dirname($filepath);
        my $filename = basename($filepath);
        opendir(my $dh, $dir) || die ("~~fail to open dir $dir");
        my @files = grep { /$filename/i && -f "$dir/$_" } readdir($dh);
        closedir($dh);
        if (@files > 0) {
            return "$uri_prefix@files[0]";
        }
        return $r->uri;
    }   
';  

server {

       if ($http_user_agent ~* "MJ12bot|qihoobot|Baiduspider|Googlebot|Googlebot-Mobile|YandexBot|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot") 
        { 
            return 403; 

        }

listen 80;

# server_name localhost;

    #charset koi8-r;

    #access_log  logs/host.access.log  main;

# location / {
# root html;
# index index.html index.htm;
# }

    #error_page  404              /404.html;

    # redirect server error pages to the static page /50x.html
    #

# error_page 500 502 503 504 /50x.html;
# location = /50x.html {
# root html;
# }

    # proxy the PHP scripts to Apache listening on 127.0.0.1:80
    #
    #location ~ \.php$ {
    #    proxy_pass   http://127.0.0.1;
    #}

    # pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
    #
    #location ~ \.php$ {
    #    root           html;
    #    fastcgi_pass   127.0.0.1:9000;
    #    fastcgi_index  index.php;
    #    fastcgi_param  SCRIPT_FILENAME  /scripts$fastcgi_script_name;
    #    include        fastcgi_params;
    #}

    # deny access to .htaccess files, if Apache's document root
    # concurs with nginx's one
    #
    #location ~ /\.ht {
    #    deny  all;
    #}
}


# another virtual host using mix of IP-, name-, and port-based configuration
#
#server {
#    listen       8000;
#    listen       somename:8080;
#    server_name  somename  alias  another.alias;

#    location / {
#        root   html;
#        index  index.html index.htm;
#    }
#}


# HTTPS server
#
#server {
#    listen       443;
#    server_name  localhost;

#    ssl                  on;
#    ssl_certificate      cert.pem;
#    ssl_certificate_key  cert.key;

#    ssl_session_timeout  5m;

#    ssl_protocols  SSLv2 SSLv3 TLSv1;
#    ssl_ciphers  HIGH:!aNULL:!MD5;
#    ssl_prefer_server_ciphers   on;

#    location / {
#        root   html;
#        index  index.html index.htm;
#    }
#}

}

站点的conf文件如下：
server {
listen 80;
server_name computer.cdblp.cn;
access_log /home/wamdm/sites/logs/computer.access.log main;
error_log /home/wamdm/sites/logs/computer.error.log error;

root /home/wamdm/sites/searchscholar/computer;
index index.php index.html index.htm;

rewrite  "^/conference/([^/]+)$" /con_detail.php?con_title=$1 last;
rewrite  "^/conference/([^/]+)/$" /con_detail.php?con_title=$1 last;

if ($http_user_agent ~* "qihoobot|Baiduspider|Googlebot|Googlebot-Mobile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot") {

return 403;

}

#大小转换的补丁，处理从windows平台（大小写不敏感）迁移到ubuntu（大小写敏感）的站点
#对于需要url重写生效的请求失效
#if ( !-e $request_filename ) {
#   rewrite ^(.*)$ $fix_upper_lower_case last;
#}

#location /{
 #   include agent_deny.conf;

# }

#favicon.ico不用打日志
location = /favicon.ico {
    log_not_found off;
    access_log off;
}

#不允许访问隐藏文件
location ~ /\. {
    deny all;
    access_log off;
    log_not_found off;
}

#访问图片，flash文件等不用打日志
location ~ .*\.(gif|jpg|jpeg|png|bmp|swf)$ {
    expires      7d; #文件返回的过期时间是7天
    access_log off;
}

#访问js和css文件不用打日志
location ~ .*\.(js|css)?$ {
    expires      1d; #文件返回的过期时间是1天
    access_log off;
}


#设置php-cgi
location ~ [^/]\.php(/|$) {
    fastcgi_split_path_info ^(.+?\.php)(/.*)$;
    #拦截不存在的php页面请求
    if (!-f $document_root$fastcgi_script_name) {
        return 404;
    }

}

}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

7条回答

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
Go 旅城通票 2014-12-14 09:33
关注
配置robots.txt禁止爬虫来爬就好了吧。。不过要是碰到流氓爬虫不理会robots.txt的配置，谷歌百度搜狗这种大部分是遵守的

如何使用robots.txt及其详解

User-agent: * Disallow: /
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

请教爬虫报错Nginx forbidden nginx python 有问必答爬虫
2021-11-17 00:00

回答 3 已采纳一是网址写错了，将鼠标放到新房链接图片上就可以看到正确链接，即：http://www.tmsf.com/hzweb/newhouse/二是在请求头里要添加cookie。
nginx配置文件支持php环境 linux nginx php
2023-03-09 08:33

回答 5 已采纳该回答引用ChatGPT可以使用fastcgi_pass指令将PHP请求传递给PHP处理器来配置nginx来提供PHP文件的服务。以下是一个用于从目录中提供PHP文件的示例配置块：配置如下： p
nginx怎么实现访问index.php的时候301跳转 nginx php
2021-11-23 15:12

回答 1 已采纳在location中添加默认文件“ index index.php”类似修改成如下样式： location / { root /usr/share/nginx/html;
Nginx服务器屏蔽与禁止屏蔽网络爬虫的方法
2021-01-18 22:41

m0_53869923的博客其实Nginx可以非常容易地根据User-Agent过滤请求，我们只需要在需要URL入口位置通过一个简单的正则表达式就可以过滤不符合要求的爬虫请求： location / { if ($http_user_agent ~* “python|curl|java|wget|...
require（）不工作 - NGINX PHP配置问题 nginx php
2019-07-19 14:22

回答 1 已采纳 SOLVED: File was not correctly named (header.php instead of layout.php) As treyBake discovered th
关于NGINX运作机理的讨问。(语言-php) nginx php 服务器
2022-06-21 14:03

回答 1 已采纳发问内容被错框为代码，对不住。赶紧补贴如下|： nginx真的很厉害，通过改变nginx.conf的(宝塔面板)方式，俺滋润的实现了多个子域名不填端口访问同一服务器上的不同站点(参见上一
请教关于 nginx限制搜索引擎爬虫IP白名单 nginx 搜索引擎爬虫
2018-06-13 14:38

回答 2 已采纳大的搜索引擎，它们的服务器太多了，靠ip限制不过来，最好是useragent判断参考：https://www.jb51.net/article/52569.htm
利用nginx来屏蔽网页爬虫
2019-06-25 11:13

征尘bjajmd的博客利用nginx来屏蔽指定的user_agent的访问以及根据user_agent做跳转转自：https://www.cnblogs.com/hh2737/p/6784864.html 对于做国内站的我来说，我不希望国外蜘蛛来访问我的网站，特别是个别垃圾蜘蛛，它们访问...
Nginx + PHP-FPM重定向到静态PHP文件 nginx php
2018-05-20 11:48

回答 1 已采纳 The simplest solution would be to hardwire SCRIPT_FILENAME with a value of /app/webroot/index.php
php+nginx 搭建的网站，如何共享子域名间的session nginx php
2019-11-15 14:05

回答 2 已采纳 CI支持多种session存储方式，默认是file类型。其他的驱动方式database 、redis 、memcache 都可实现session共享。 CI官方参考文档：[https://codei
NGINX下载PHP文件而不是显示 linux nginx php
2018-01-18 22:03

回答 1 已采纳 Please try this smaller example. location ~ \.php$ { fastcgi_pass unix:/run/php/php7.1-fp
Nginx屏蔽谷歌等站点爬虫
2015-04-30 13:39

weixin_34334744的博客游戏测试环境使用的是Nginx，被爬，为此在Nginx上做如下限制，拒绝可恶的爬虫访问. 修改Nginx.conf文件，具体的配置信息如下： server { listen 80; server_name test.game.com; if ($http_user_agent ~* ...
nginx和index.php下载 nginx php
2017-08-19 16:08

回答 1 已采纳 Sometimes it can be as simple as clearing the browser cache. If there was some misconfiguration an
[Nginx] nginx屏蔽某个url和指定参数访问
2021-12-28 16:57

程序员老狼的博客有个地址总是被人恶意访问，可以配置nginx屏蔽这个请求域名/chatIndex?kefu_id=l5702123&ent_id=324 location ~ / { if ( $query_string ~* ^(.*)?kefu_id=l5702123&ent_id=324 ){ return 403...
Nginx设置防爬虫策略
2019-04-24 19:39

shangrila_kun的博客有助于网站的爬虫可以提升网站排名，比如百度蜘蛛。但有些爬虫对服务器恶意获取网站信息，不遵守robots规则，我们需要进行拦截。可以禁止某些User Agent抓取网站。新建配置配置文件（例如进入到nginx安装目录...
没有解决我的问题, 去提问

悬赏问题

¥60 版本过低apk如何修改可以兼容新的安卓系统
¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
¥50 有数据，怎么建立模型求影响全要素生产率的因素
¥50 有数据，怎么用matlab求全要素生产率
¥15 TI的insta-spin例程
¥15 完成下列问题完成下列问题
¥15 C#算法问题, 不知道怎么处理这个数据的转换
¥15 YoloV5 第三方库的版本对照问题
¥15 请完成下列相关问题！
¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像，如何解决？