如何提取网页摘要？

I am writing a code to extract the abstract from the arxiv page, for example the page http://arxiv.org/abs/1207.0102, I am interested in extracting the text from "We study a model of..." to "...compass-Heisenberg model." my code currently looks like

$url="http://arxiv.org/abs/1207.0102";
$options = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko
"
  )
);
$context = stream_context_create($options);
$str = file_get_contents($url, false, $context);

if (preg_match('~<body[^>]*>(.*?)</body>~si', $str, $body))
{
    echo $body[1];
}

The problem with this is that it extracts everything in the body tag. Is there a way to extract the abstract only?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dragon456101 2015-08-15 21:38
关注
The best option would be to use a DOM Parser, php has one built in at http://php.net/manual/en/class.domdocument.php but there is also tons of classes out there that do something similar.

Using DOM Document you would do something like this:

<?php $doc = new DOMDocument(); $doc->loadHTML("<html><body>Test<br></body></html>"); $text = $doc->getElementById("abstract"); ?>

The other option is to use regex, which seems like what you're already doing. As you can tell it is a little bit more messy and requires some learning, http://www.regular-expressions.info/tutorial.html

Thanks.
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

PHP正则提取中文日期，要怎么写？ php 正则表达式
2021-11-01 12:31

回答 2 已采纳 $str='内容内容内容内容内容内容内容内容内容二零二零年九月二十九日内容内容内容内容内容内容'; $pattern='/([一二三四五六七八九零]+年[一二三四五六七八九十零]+月[一二三四五六七八
PHP怎么提取图片上的文字？ php
2015-04-10 04:25

回答 2 已采纳 http://blog.csdn.net/mlks_2008/article/details/7776288 或者自己google ocr
如何在RabbitMQ中手动提取队列数据？ linux php rabbitmq
2016-08-14 11:18

回答 1 已采纳 the AMQP protocol defines a "basic consume", which is where consumers automatically receive messag
php网页正文提取,通用网页正文抓取工具_任意网页正文提取API
2021-03-26 13:53

weixin_39714849的博客 ArticleExtractor 智能提取任意网页正文内容无需任何规则，输入目标内容面url地址(网站首页、列表页面除外)，可轻松实现对任意新闻网页正文智能提取，并去除广告等与正文无关的内容。提取准确率达95% 以上。您只...
PHP使用mysql函数后网页一片空白如何解决？ apache mysql php sql
2020-04-14 08:27

回答 2 已采纳页面全部空白一般是报错了 1. 可以到服务器上的日志里看php的错误日志 2. php页面最开头加上 ``` ini_set("display_errors", "On")
php如何判断单双大小？ php
2019-11-17 23:50

回答 2 已采纳 ``` $arr = [4,5,6,7,8]; //判断数字奇偶性 $single = $double = []; fo
php 如何调用shell执行docker相关命令？ linux php
2022-04-28 12:07

回答 2 已采纳 php执行docker相关命令方法1.通过exec或者shell_exec函数调用docker命令。exec函数语法如下，如果提供了$output参数，执行结果会以数组形式完全返回到$output变量
网页爬虫php,八款不错的PHP网页爬虫库
2021-03-23 15:41

weixin_39968319的博客在后端开发中，抓取爬虫非常流行，也有一些开发者在...在PHP中，我们使用以下库进行数据和内容抓取：GoutteSimple HTML DOMhtml SQLcURLRequestHTTPfulBuzzGuzzle1. Goutte说明：[list=square][*]Goutte库很有用，它...
php怎么获取网页中播放器里面的动态的token 播放地址？ php 有问必答
2022-03-17 23:26

回答 3 已采纳 token不是在列表里面，直接请求每个频道对应的页面再用正则提取下，示例如下 <meta charset="utf-8"> <?php $url="http://iptv.ever
如何使用AJAX调用PHP函数？ ajax php
2019-04-27 01:21

回答 1 已采纳 You can call the function when the get emaiAddress is detected on your database.php page. For exam
php代码显示在网页上 html php
2016-05-18 05:22

回答 2 已采纳 this code <form method = "post" action = ">?php echo htmlspecialchars($_SERVER["PHP_SELF"])
关键字摘要智能提取 API 接口
2021-05-10 14:08

DevOpenClub的博客关键字摘要智能提取 API 接口 NLP 智能提取。 1. 产品功能毫秒级提取性能；基于 NLP 算法智能提取；可返回摘要、关键字类型数据；摘要、关键字最多 5 个结果值；全接口支持 HTTPS（TLS v1.0 / v1.1 / v1.2 / ...
如何为php项目设置端口？ php ubuntu
2019-06-13 19:04

回答 1 已采纳 For HTTP, browser by default is sending request to port 80, if you want run your project under ano
php 计数txt数据库,php提取txt数据库
2021-04-11 13:55

淨梧的博客网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是... 文章行者武松 2017-10-10 4582浏览量渗透测试第一弹：信息刺探渗透第一弹刺探信息：freebuf 1,分析目标网站内容及...
php 动态控件,PHP技术在动态网页表单控件提取中的应用研究
2021-05-07 01:52

知性人生的博客该篇文章就针对PHP这种技术在动态网页表单控件提取中的应用进行详细的阐述。关键词：PHP;动态网页;表单中图分类号：TP311 文献标识码：A文章编号：1009-3044(2020)06-0217-021背景由于企业电商的国际化，我国动态...
没有解决我的问题, 去提问

悬赏问题

¥15 Vue3 大型图片数据拖动排序
¥15 划分vlan后不通了
¥15 GDI处理通道视频时总是带有白色锯齿
¥20 用雷电模拟器安装百达屋apk一直闪退
¥15 算能科技20240506咨询（拒绝大模型回答）
¥15 自适应 AR 模型参数估计Matlab程序
¥100 角动量包络面如何用MATLAB绘制
¥15 merge函数占用内存过大
¥15 使用EMD去噪处理RML2016数据集时候的原理
¥15 神经网络预测均方误差很小但是图像上看着差别太大

如何提取网页摘要？

1条回答 默认 最新

悬赏问题

1条回答默认最新