当strip_tags（）烧掉干草堆时

I've got a list of websites for each US Congress member that I'm programmatically crawling to scrape addresses. Many of the sites vary in their underlying markup, but this wasn't initially a problem until I started seeing that hundreds of sites were not giving the expected results for the script I had written.

After taking some more time to evaluate potential causes, I found that calling strip_tags() on the results of file_get_contents() was erasing most of the source of the page many times! This was not only removing the HTML, it was removing the non-HTML that I wanted to scrape!

So I removed the call to strip_tags(), substituted a call to remove all non-alphanumeric characters and gave the process another run. It turned up other results, but still lacked many. This time it was because my regular expressions weren't matching the desired patterns. After looking at the returned code, I realized that I had the remnants of HTML attributes interspersed throughout the text, breaking my patterns.

Is there a way around this? Is it the result of malformed HTML? Can I do anything about it?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duanou1904 2009-12-29 01:21
关注
There's a warning in the PHP manual that reads:

Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.

Since you are scraping many different sites, and you can't account for the validity of their HTML, this will always be a problem. Unfortunately, regexps aren't going to do it for you either, as regexps simply aren't cut out to be document parsers.

I would use something like PHP Simple HTML DOM Parser, or even the built-in DOMDocument->loadHTML() method.

You could keep a small database that recorded each page you wanted to scrape, and where the information was found in the structure of that page. Each time you scraped it, you could do a quick check to see if the structure had changed, in which case you could update your database with the new path location for your DOM parser, and get it on the next scrape.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

当strip_tags（）烧掉干草堆时 html php
2009-12-29 01:15

回答 2 已采纳 There's a warning in the PHP manual that reads: Because strip_tags() does not actually valid
Php使用strip_tags忽略<a>标签中的文本 php
2017-09-30 12:21

回答 1 已采纳 If it's only <a href tags you don't like as commented in the comments above this should clear t
strip_tags + html实体只获取数字 php
2019-06-19 12:34

回答 3 已采纳 You need to remove also the special pieces of text used to define entities, so you need at least a
strip_tags() 函数
2019-05-10 10:10

冷月醉雪的博客查看更多 https://www.yuque.com/docs/share/65ffa720-c919-4410-a790-0993eb500f78
strip_tags和nl2br的问题 html php
2014-02-21 02:27

回答 2 已采纳 If you're using strip_tags you don't need htmlentites. You output is being converted to entites fi
是否有替代PHP strip_tags（） php
2012-03-20 14:21

回答 4 已采纳 Updated 2012-06-23; major security flaw. Here's a class from another project that should do what
使用strip_tags函数 php
2011-10-23 01:57

回答 1 已采纳 It doesn't look like your regular expression allows for the < and > characters, also, if it
TP5 使用strip_tags过滤html标签不起作用的解决方法
2019-01-17 10:41

L·S·P的博客在文章保存过程中需要获取前端由Uediter编辑器编辑的html内容中的文本，基本思路是使用PHP自带函数strip_tags()直接过滤于是直接编辑如下： $data = $this->request->param(); $data['post']['...
strip_tags不起作用 html mysql php
2011-12-20 18:30

回答 4 已采纳 ...But, do you mean: $user = $_POST["user"]; // Get username from <form> $user = mysql_real
strip_tags和trim无法正常工作 php
2012-10-11 20:06

回答 2 已采纳 you could try this? $text_top = strip_tags(trim(html_entity_decode($_POST['text_top'], ENT_QUOTES
一个更好的strip_tags替代方案？ [关闭] php
2013-08-08 14:41

回答 1 已采纳 php's strip_tags supports allowed tags example direct from php docs site: $text = '<p>Test
strip_tags()函数使用注意
2015-08-14 15:39

veaglefly的博客 strip_tags ()函数用来从字符串中去除 HTML 和 PHP 标记。...echo strip_tags ( $text ); echo "\n" ; // 允许和 echo strip_tags ( $text , '' ); ?> 今天在写博客项目的时候，发现使用此函数时，会
Strip_tags除了php中的href标签外？ [重复] php
2012-11-22 22:54

回答 2 已采纳 strip_tags($row->message, '<a>'); Second argument is made for exceptions.
Python 实现类似PHP的strip_tags函数功能，并且可以自定义设置保留标签
2014-09-29 18:07

神神的蜗牛的博客最近在研究 Python ，发现用的还是很不习惯，很多PHP里面很简单的功能在Python 里面都得找半天，而且很多功能都得自己实现。今天做个采集，需要过滤内容中的标签，搞了一下午，...def strip_tags(html, save=None):
TP5 使用strip_tags过滤html标签不起作用
2019-02-20 22:43

xingnang2008的博客 TP5默认对前端传过来的字符串使用了htmlspecialchars转换为 HTML 实体，因此，我的解决办法是对已经转换的实体进行反转，使用...$data['post']['content'] = strip_tags(htmlspecialchars_decode($data['post']...
PHP的转义函数 htmlspecialchars、strip_tags、addslashes解释
2017-01-29 20:31

whatday的博客第一个函数：strip_tags，去掉 HTML 及 PHP 的标记注意：本函数可去掉字串中包含的任何 HTML 及 PHP 的标记字串。若是字串的 HTML 及 PHP 标签原来就有错，例如少了大于的符号，则也会传回错误。而本函数和 fgetss...
Python 清理HTML标签类似PHP的strip_tags函数功能（二）
2014-09-30 16:11

神神的蜗牛的博客没有发现Python 有现成的类似功能模块，所以昨天写了个简单的 strip_tags 但还有些问题，今天应用到采集上时进行了部分功能的完善， 1. 对自闭和标签处理 2. 以及对标签参数的过滤 def strip_tags(html, save_...
strip_tags() 的实际使用
2016-11-21 18:33

Fly_out的博客 echo strip_tags("Hello <b>world!</b>"); ?>运行后会输出Hello world! strip_tags() 函数剥去字符串中的 HTML、XML 以及 PHP 的标签，并且始终会剥离HTML中的注释，函数将返回剥离后的字符串。语法： strip_tags...
php函数strip_tags标签未闭合的情况
2015-07-25 21:08

dongxie548的博客在前一阵在的面试过程中面试官曾经问我，如果字符串中的标签未闭合，strip_tags会如何处理？在php的官方文档中对strip_tags的表述如下； strip_tags — 从字符串中去除 HTML 和 PHP 标记注意文档下方的...
没有解决我的问题, 去提问

悬赏问题

¥15 ogg dd trandata 报错
¥15 高缺失率数据如何选择填充方式
¥50 potsgresql15备份问题
¥15 Mac系统vs code使用phpstudy如何配置debug来调试php
¥15 目前主流的音乐软件，像网易云音乐，QQ音乐他们的前端和后台部分是用的什么技术实现的?求解！
¥60 pb数据库修改与连接
¥15 spss统计中二分类变量和有序变量的相关性分析可以用kendall相关分析吗？
¥15 拟通过pc下指令到安卓系统，如果追求响应速度，尽可能无延迟，是不是用安卓模拟器会优于实体的安卓手机？如果是，可以快多少毫秒？
¥20 神经网络Sequential name=sequential, built=False
¥16 Qphython 用xlrd读取excel报错

当strip_tags（）烧掉干草堆时

2条回答 默认 最新

悬赏问题

2条回答默认最新