double2022 2015-08-15 21:33
浏览 16

如何提取网页摘要?

I am writing a code to extract the abstract from the arxiv page, for example the page http://arxiv.org/abs/1207.0102, I am interested in extracting the text from "We study a model of..." to "...compass-Heisenberg model." my code currently looks like

$url="http://arxiv.org/abs/1207.0102";
$options = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko
"
  )
);
$context = stream_context_create($options);
$str = file_get_contents($url, false, $context);

if (preg_match('~<body[^>]*>(.*?)</body>~si', $str, $body))
{
    echo $body[1];
}

The problem with this is that it extracts everything in the body tag. Is there a way to extract the abstract only?

  • 写回答

1条回答 默认 最新

  • dragon456101 2015-08-15 21:38
    关注

    The best option would be to use a DOM Parser, php has one built in at http://php.net/manual/en/class.domdocument.php but there is also tons of classes out there that do something similar.

    Using DOM Document you would do something like this:

    <?php
      $doc = new DOMDocument();
      $doc->loadHTML("<html><body>Test<br></body></html>");
      $text = $doc->getElementById("abstract");
    ?>
    

    The other option is to use regex, which seems like what you're already doing. As you can tell it is a little bit more messy and requires some learning, http://www.regular-expressions.info/tutorial.html

    Thanks.

    评论

报告相同问题?

悬赏问题

  • ¥15 Vue3 大型图片数据拖动排序
  • ¥15 划分vlan后不通了
  • ¥15 GDI处理通道视频时总是带有白色锯齿
  • ¥20 用雷电模拟器安装百达屋apk一直闪退
  • ¥15 算能科技20240506咨询(拒绝大模型回答)
  • ¥15 自适应 AR 模型 参数估计Matlab程序
  • ¥100 角动量包络面如何用MATLAB绘制
  • ¥15 merge函数占用内存过大
  • ¥15 使用EMD去噪处理RML2016数据集时候的原理
  • ¥15 神经网络预测均方误差很小 但是图像上看着差别太大