doupa2871 2013-06-12 00:11 采纳率: 100%
浏览 9
已采纳

为什么有些网站不可擦除?

I have just started to learn how to use regular expressions to extract data from websites. The first goal of mine is to extract the title of a website. Here is what my code is like:

<?php 
    $data = file_get_contents('http://bctia.org');
    $regex = '/<title>(.+?)<\/title>/';
    preg_match($regex,$data,$match);
    var_dump($match); 
?>

The result of var_dump is empty:

array(0) { }

At first I thought, "maybe bctia.org does not have a title"? However, this is not the case, as I have checked the source of bctia.org, and it does have content between <title> and </title>.

Then I thought, maybe my code does not work? However, this is not the case either, as I have substituted bctia.org with other websites, say, bing.com, or apple.com, and they both returned correct results. For example, with apple.com I get the correct result

array(2) { [0]=> string(20) "" [1]=> string(5) "Apple" }

So I have to come to the conclusion that bctia.org is a very special website that prevents me from extracting its title...

I am wondering if that is actually the case? Or maybe my code has some problems that I have not identified?

Thank you in advance!

  • 写回答

3条回答 默认 最新

  • doupang4126 2013-06-12 00:56
    关注

    This specific website's server-side code assumes that the client sends a User-Agent header, and apparently, your PHP installation is not configured to send one. So a 500 Internal Server Error is returned, causing file_get_contents to return false.

    Source Error:
    Line 66: //LOAD: Compatibility Mode
    Line 67: //<meta http-equiv="X-UA-Compatible" content="IE=7,IE=9" />
    Line 68: string BrowserOS = Request.ServerVariables["HTTP_USER_AGENT"].ToString();
    Line 69: HtmlMeta compMode = new HtmlMeta();
    Line 70: compMode.Content = "IE=7,IE=9";
    
    
    Source File: c:\inetpub\wwwroot\BCTIA\Website\bctia\layouts\Main Layout.aspx.cs   
    Line: 68
    
    Stack Trace:
    [NullReferenceException: Object reference not set to an instance of an object.]
       Layouts.Main_Layout.Page_Load(Object sender, EventArgs e) in c:\inetpub\wwwroot\BCTIA\Website\bctia\layouts\Main Layout.aspx.cs:68
       System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr fp, Object o, Object t, EventArgs e) +24
       System.Web.UI.Control.LoadRecursive() +70
       System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) +3063
    

    To work around this issue, you can just set a user-agent string before making the request:

    ini_set('user_agent', 'Mozilla/5.0 (compatible; Examplebot/0.1; +http://www.example.com/bot.html)');
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 求帮我调试一下freefem代码
  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
  • ¥15 用windows做服务的同志有吗
  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图