使用PHP的Text Parser，如Instapaper

I'm trying to write a text parser with PHP, like Instapaper did. What I want to do is; get a webpage and parse it in text-only mode.

It's simple to get the webpage with cURL and strip HTML tags. But every webpage have some common areas; like header, navigation, sidebar, footer, banners etc. I only want to get the article in text mode and exclude all other parts. It's also simple to exclude those parts if I know the "id" or "class" info. But I'm trying to automatize this process and apply for any page, like Instapaper.

I get all the content between but I don't know how to exclude header, sidebar or footer and get only the main article body. I have to develop a logic to get only the main article part.

It's not important for me to find the exact code. It would also be useful to understand how to exclude unnecessary parts as I can try to write my own code with PHP. It would also be useful if there any examples in other languages.

Thanks for helping.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

5条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dsuoedtom207012191 2010-01-24 01:11
关注
You might try looking at the algorithms behind this bookmarklet, readability - It's got a decent success rate for extracting content among on all web page rubbish.

Friend of mine made it, that's why I'm recommending it - since I know it works, and I'm aware of the many techniques he's using to parse the data. You could apply these techniques for what your asking.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(4条)

报告相同问题？

关注问题

使用PHP的Text Parser，如Instapaper php
2010-01-24 01:03

回答 5 已采纳 You might try looking at the algorithms behind this bookmarklet, readability - It's got a decent s
使用php Simple HTML DOM Parser php
2018-11-06 13:51

回答 2 已采纳 The animal names are in the attributes, you can use getAttribute: $html = file_get_html('zoo.xml'
使用PHP Simple DOM Parser进行递归 php
2016-06-04 06:39

回答 1 已采纳 Don't use print_r() or var_dump() on DOM objects. The DOM object has properties that refer to its
PHP之PHP-Parser安装与使用
2021-04-08 16:46

jiet07的博客 PHP-Parser 链接 http://github.com/nikic/PHP-Parser 安装环境（windows下安装） // 安装composer composer: PHP中用来管理依赖关系的工具选择安装路径 curl -s http://getcomposer.org/installer | php //安装...
使用PHP Simple HTML DOM Parser提取HTML纯 html php
2016-09-25 15:51

回答 1 已采纳 $escapedHtmlChars = ""; $htmlElements = ""; $html = file_get_html('https://my.playstation.com/obai
使用PHP Simple HTML DOM Parser从html中提取dom元素 html php
2016-01-05 19:48

回答 1 已采纳 There are several problems: getElementsByTagName apparently returns a single node, not an array,
如何使用php连接wsdl api php xml
2018-10-11 06:19

回答 1 已采纳 I wrote a class that helps to simplify generating SOAP XML for a Laravel package. Here is the clas
使用Druid SQL Parser解析SQL
2022-06-16 16:12

c.的博客在以前的博文《使用Spring Boot JPA Specification实现使用JSON数据来查询实体数据》中讲到了目前业务上的需求就是以前老系统是通过配置SQL去抽取一些业务数据的，但现在新系统想通过页面的一些配置化实现跟配置SQL...
使用php将空子添加到xml php xml
2017-04-06 10:18

回答 2 已采纳 If you really need an empty node like <node></node> do this: $child = $sxe->addCh
首次使用AJAX从PHP获取数据 ajax jquery json php
2016-05-30 13:04

回答 3 已采纳 I think your problem is understanding that AJAX is not synced! That means your code will keep run
PHP：Simple DOM Parser如何迭代这个html代码 php
2018-04-20 06:22

回答 1 已采纳 As an alternative, since you're targeting that ID, you don't need to have a foreach on the parent
php使用PdfParser搭配tcpdf解析pdf文件
2018-01-12 09:36

卖茶叶蛋的奥特曼的博客我的思路是后台发布文章时循环检测每一个附件的格式，若附件格式为pdf的话就将里面的文字读取出来追加到一个$string变量里，然后将$string的作为一个file_text字段的信息保存到数据库后面做搜索的时候使用。...
在PHP中使用python脚本解析XML php python xml
2017-01-29 07:55

回答 1 已采纳 You are only capturing the last line of the Python print when assigning exec() to a variable. And
一文打尽PHP代码加密方式
2020-08-14 17:16

@日月空@的博客原文地址 ... 我们能做的只是增加破解时间。如果这个破解时间大于一个人的寿命，那么这个加密方式肯定是成功的。对于加密的程序，就像破解...有扩展的加密：php-beast、php_screw、screw_plus、ZoeeyGuard、tonyenc等市面
Android中使用apk-parser解析apk
2019-09-29 14:32

潇曜的博客一、前言 Apk Parser是一个优秀的解码二进制文件与获取apk各种信息的开源库，具备众多实用的特性：获取apk各种元数据信息...不过， Apk Parser无论是该开源库的作者或者是网上有的教程，都只是说java se怎么使用...
php读取pdf文字内容
2022-05-10 15:12

天青色在等你的博客安装第三方库 composer require smalot/pdfparser ...下载后引入 alt_autoload.php-dist 文件开始使用 <?php // Parse PDF file and build necessary ...$parser = new \Smalot\PdfParser\Parser(); $pdf = $pars
使用PHP-Parser生成AST抽象语法树
2015-01-22 23:05

隐形人真忙的博客 Yacc和Lex什么的就不再考虑了，查了一天的资料，发现两款比较适合，一款是Java下的ANTLR，另一款是专门做PHP AST生成的PHP-Parser。 ANTLR是编译原理领域比较著名的工具了，相对于Yacc和Lex，
php使用jwt的例子
2017-11-14 11:31

_Royal的博客），关于php使用jwt的相关的使用却少的可怜，有的也看的模模糊糊，于是就自己整理，也方便自己以后进行查看。jwt版本php中jwt有3个版本：2.0、2.2、3.0。so！我们选择的是3.0的版本。别问为什么，你买电子产品都是买...
JavaParser使用指南
2022-09-06 15:13

emgexgb_sef的博客 JavaParser使用指南前言入门-Start JavaParser Class CompilationUnit Class Visitor Classes A Simple Visitor Comments Pretty Printing and Lexical Preservation Javaparser-Solving Symbols and References ...
没有解决我的问题, 去提问

悬赏问题

¥15 有偿求跨组件数据流路径图
¥15 写一个方法checkPerson，入参实体类Person，出参布尔值
¥15 我想咨询一下路面纹理三维点云数据处理的一些问题，上传的坐标文件里是怎么对无序点进行编号的，以及xy坐标在处理的时候是进行整体模型分片处理的吗
¥15 CSAPPattacklab
¥15 一直显示正在等待HID—ISP
¥15 Python turtle 画图
¥15 关于大棚监测的pcb板设计
¥15 stm32开发clion时遇到的编译问题
¥15 lna设计源简并电感型共源放大器
¥15 如何用Labview在myRIO上做LCD显示？(语言-开发语言)

使用PHP的Text Parser，如Instapaper

5条回答 默认 最新

悬赏问题

5条回答默认最新