dtkyayvldeaqhl7151 2019-02-19 19:21
浏览 45

如何检测社交媒体巨头机器人和精炼php中的useragent?

I am trying to build the script that will capture the USER-AGENT of the users.That can easily be done using $_SERVER['HTTP_USER_AGENT']

example: Below are all the twitter Bots that detect by $_SERVER['HTTP_USER_AGENT']

I just simple post the link of php script on twitter and it detect the bots: here

Here are the Bots thats Captured by HTTP_USER_AGENT of twitter network.

1

Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.1.2) Gecko/20090729 Firefox/52.0

2

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)

3

Mozilla/5.0 (compatible; AhrefsBot/6.1; News; +http://ahrefs.com/robot/)

4

Mozilla/5.0 (compatible; TrendsmapResolver/0.1)

5 (Not sure its bot or Normal Agent)

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36

6

Twitterbot/1.0

7

 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)

Now I want to Refine/filter the Bots name from the detected HTTP_USER_AGENT

example:

rv:1.9.1.2
Trident/4.0
(compatible; AhrefsBot/6.1; News; +http://ahrefs.com/robot/)
(compatible; TrendsmapResolver/0.1)
Twitterbot/1.0
(Applebot/0.1; +http://www.apple.com/go/applebot)

What I have tried so far:

if (
    strpos($_SERVER["HTTP_USER_AGENT"], "Twitterbot/1.0") !== false ||          
    strpos($_SERVER["HTTP_USER_AGENT"], "Applebot/0.1") !== false
) {
    $file =fopen("crawl.txt","a");
    fwrite($file,"TW-bot detected.
");
    echo "TW-bot detected.";
}
else {
     $file =fopen("crawl.txt","a");
    fwrite($file,"Nothing found.
");
    echo "Nothing";
}

But somehow the above code is not working.let me know where I am getting wrong and in the crawl.txt always shows Nothing found let me know the proper/better/best way to detect bots or any direction or guidence is apprecheated.

  • 写回答

1条回答 默认 最新

  • dongtangxi1584 2019-02-19 19:38
    关注

    You might find that its easy to spot the bots which capture simple website previews, but the user-agents of bots which scrape for restricted content are a lot more difficult.

    You'd have to do more than just parse the UA. Interrogating the REMOTE_ADDR will be necessary also. You'd fire each request through something like http://ip-api.com to determine if its coming from a datacenter. Be careful of users with proxies, they will trigger false positives. You could go further and investigate the browser capabilities with Javascript, but be aware this is a difficult problem and its a constant arms-race between a providers detection tools and (usually) black-hat advertisers.

    评论

报告相同问题?

悬赏问题

  • ¥15 多址通信方式的抗噪声性能和系统容量对比
  • ¥15 winform的chart曲线生成时有凸起
  • ¥15 msix packaging tool打包问题
  • ¥15 finalshell节点的搭建代码和那个端口代码教程
  • ¥15 用hfss做微带贴片阵列天线的时候分析设置有问题
  • ¥15 Centos / PETSc / PETGEM
  • ¥15 centos7.9 IPv6端口telnet和端口监控问题
  • ¥20 完全没有学习过GAN,看了CSDN的一篇文章,里面有代码但是完全不知道如何操作
  • ¥15 使用ue5插件narrative时如何切换关卡也保存叙事任务记录
  • ¥20 海浪数据 南海地区海况数据,波浪数据