dqf42223 2009-12-29 01:15
浏览 31
已采纳

当strip_tags()烧掉干草堆时

I've got a list of websites for each US Congress member that I'm programmatically crawling to scrape addresses. Many of the sites vary in their underlying markup, but this wasn't initially a problem until I started seeing that hundreds of sites were not giving the expected results for the script I had written.

After taking some more time to evaluate potential causes, I found that calling strip_tags() on the results of file_get_contents() was erasing most of the source of the page many times! This was not only removing the HTML, it was removing the non-HTML that I wanted to scrape!

So I removed the call to strip_tags(), substituted a call to remove all non-alphanumeric characters and gave the process another run. It turned up other results, but still lacked many. This time it was because my regular expressions weren't matching the desired patterns. After looking at the returned code, I realized that I had the remnants of HTML attributes interspersed throughout the text, breaking my patterns.

Is there a way around this? Is it the result of malformed HTML? Can I do anything about it?

  • 写回答

2条回答 默认 最新

  • duanou1904 2009-12-29 01:21
    关注

    There's a warning in the PHP manual that reads:

    Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.

    Since you are scraping many different sites, and you can't account for the validity of their HTML, this will always be a problem. Unfortunately, regexps aren't going to do it for you either, as regexps simply aren't cut out to be document parsers.

    I would use something like PHP Simple HTML DOM Parser, or even the built-in DOMDocument->loadHTML() method.

    You could keep a small database that recorded each page you wanted to scrape, and where the information was found in the structure of that page. Each time you scraped it, you could do a quick check to see if the structure had changed, in which case you could update your database with the new path location for your DOM parser, and get it on the next scrape.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 ogg dd trandata 报错
  • ¥15 高缺失率数据如何选择填充方式
  • ¥50 potsgresql15备份问题
  • ¥15 Mac系统vs code使用phpstudy如何配置debug来调试php
  • ¥15 目前主流的音乐软件,像网易云音乐,QQ音乐他们的前端和后台部分是用的什么技术实现的?求解!
  • ¥60 pb数据库修改与连接
  • ¥15 spss统计中二分类变量和有序变量的相关性分析可以用kendall相关分析吗?
  • ¥15 拟通过pc下指令到安卓系统,如果追求响应速度,尽可能无延迟,是不是用安卓模拟器会优于实体的安卓手机?如果是,可以快多少毫秒?
  • ¥20 神经网络Sequential name=sequential, built=False
  • ¥16 Qphython 用xlrd读取excel报错