douyunjiaok300404 2013-10-08 00:40
浏览 162

从网站获取元数据的最快方法

I'm trying to get 'title' from websites, at the moment I'm using preg_match to get the title but it's very slow to load.

What I have at the moment:

This passes links through to a function:

<?php 
foreach($savedLinks as $s)
{
    echo "<div class='savedLink'>";
        echo "<h5>" . getMetaData($s) . "</h5>";
        echo "<a href='" . $s . "'>" . $s . "</a><br />";
    echo "</div>";
}
?>

This function grabs the title from each website passed in:

function getMetaData($url)
{
    if(!@file_get_contents($url))
    {
        return "";
    }
    else
    {
        if(preg_match('/<title>(.+)<\/title>/',file_get_contents($url),$matches) && isset($matches[1]))
            return $matches[1];
        else
            return "Not Found";  
    }
}

Is there a fast way to get 'title' from each page?

  • 写回答

2条回答 默认 最新

  • dongqi4085 2013-10-08 00:49
    关注

    I'm going to go out on a limb and guess that the file_get_contents is taking a lot longer than the preg_match, which I would expect to be pretty fast.

    If you're doing this across a lot of sites, this method may not work, but you might want to look into byte range requests. If you can predict that the tag is within the first X bytes of the HTML response, you can do a partial request with byte-range and avoid having to move the whole document over the wire just to get the title tag. If the pages are dynamically generated it would require that the code on the server support this. If they're static docs, chances are good that byte range requests are supported.

    https://serverfault.com/questions/398219/how-can-i-enable-byte-range-request

    As this example suggests in the second answer, also try enabling keepalive by changing "Connection: close" to "Connection: keep-alive". Again, this will only work if you're hitting the same server multiple times and if the server has it enabled. Those two things together could save a lot of time per request.

    评论

报告相同问题?

悬赏问题

  • ¥20 docker里部署springboot项目,访问不到扬声器
  • ¥15 netty整合springboot之后自动重连失效
  • ¥15 悬赏!微信开发者工具报错,求帮改
  • ¥20 wireshark抓不到vlan
  • ¥20 关于#stm32#的问题:需要指导自动酸碱滴定仪的原理图程序代码及仿真
  • ¥20 设计一款异域新娘的视频相亲软件需要哪些技术支持
  • ¥15 stata安慰剂检验作图但是真实值不出现在图上
  • ¥15 c程序不知道为什么得不到结果
  • ¥40 复杂的限制性的商函数处理
  • ¥15 程序不包含适用于入口点的静态Main方法