从网站上多个页面上发生的DIV中提取文本，然后输出到.txt？

Just to note from the start, the content is uncopyrighted and I would like to automate the process of acquiring the text for the purpose of a project.

I'd like to extract the text from a particular and recurring DIV (that is attributed with it's own 'class', in case that makes it easier) sitting in each page on a simply designed website.

There is a single archive page on the site with a list of all of the pages containing the content I would like.

The site is www.zenhabits.net

I imagine this could be achieved with some sort of script, but have no idea where to start.

I appreciate any help.

-Nathan.

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

dongpao1905 2012-04-24 13:19

关注

This is pretty straight forward.

Firstly, get all the links from this site, and throw them all into an array:

set_time_limit(0);//this could take a while...

ignore_user_abort(true);//in case browser times out


$html_output=file_get_contents("http://zenhabits.net/archives/");

# -- Do a preg_match on the html, and grab all links:
if(preg_match_all('/<a href=\"http:\/\/zenhabits.net\/(.*)\">/',$html_output,$matches)) {
# -- Append Data To Array
foreach($matches[1] as $secLink) {  
    $links[] = "http://zenhabits.net/".$secLink;
}
    }

I tested this for you, and:

//first 3 are returning something weird, but you don't need them - so I shall remove them xD
unset($links[0]);
unset($links[1]);
unset($links[2]);

No that's all done, time to go through all of THOSE links (in the array $links), and take its content:

foreach($links as $contLink){

$html_output_c=file_get_contents("$contLink");


    if(preg_match('|<div class=\"post\">(.*)</div>|s',$html_output_c,$c_matches)) {
    # -- Append Data To Array   
echo"data found <br>";
    $contentFromPage[] = $c_matches[1];
    }
else{echo "no content found in: $contLink -- <br><br><br>";}
}//end of foreach

I've basically just written a whole crawler script for you..

And now, loop the content array, and do whatever you want with it(here we shall put it into a text file):

//$contentFromPage now contains all of div class="post" content (in an array) - so do what you want with it

    foreach($contentFromPage as $content){

    # -- We need a name for each text file --
$textName=rand()."_content_".rand().".txt";//we'll just use some numbers and text

//define file path (where you want the txt file to be saved)
$path="../";//we'll just put it in a folder above the script
$full_path=$path.$textName; 

// now save the file..

file_put_contents($full_path,$content);

//and that's it

    }//end of foreach

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

网页div内容导出Word（可用版）.rar
2021-10-15 17:03

网页中的div元素是HTML语言中的一种布局容器，用于组织页面上的内容。将div内容导出为Word文档是一项常见的需求，特别是在网页编辑、内容管理和数据迁移等场景中。本项目提供的"网页div内容导出Word（可用版）...
2024前端面试题总汇（持续更新中...）
2023-09-26 06:11

小菜猿_的博客前端面试八股文大全！！！！！
2022年最新前端面试题（大前端时代来临卷起来吧小伙子们..持续维护走到哪记到哪）
2021-12-25 22:22

有两把刷子的博客 2022年最新前端面试题热门题目（面经最全版）（蛋糕）...持续维护中，HTML和Css部分vue部分，Null，undefined，Boolean，Number，String，BigInt 、BigInt 、Object、1. **解构赋值、扩展运算符、class类、模块化 ...
Web前端面试题(附答案及解析)（2026.1月最新版）
2023-03-27 14:31

Komorebi ঞ꧔ꦿ的博客【Web 前端面试必备】精选高频面试题 + 详细答案解析，搭配实战总结心得，帮你高效复盘考点、轻松应对面试！内容会持续更新，你的每一份支持都是我更新的动力，感谢关注！
前端实现在页面上预览excel
2022-06-06 16:55

weixin_49459644的博客前端实现在页面上预览excel文件流
《黑马前端ajax+node.js+webpack+git教程》（笔记）——黑马ajax教程（axios教程）
2025-03-13 16:20

Dontla的博客黑马程序员前端AJAX入门到实战全套教程，包含学前端框架必会的（ajax+node.js+webpack+git），一套全覆盖文章目录框架前置导学 AJAX-Day01-01.AJAX入门与axios使用什么是AJAX 如何使用AJAX 使用axios获取数据案例...
浏览器从输入URL到页面渲染加载的过程（浏览器知识体系整理）
2023-04-23 10:38

铁锤妹妹@的博客记得最开始学前端知识时，是一点一点的积累，一个知识点一个知识点的攻克。就这样，虽然在很长一段时间内积累了不少的知识，但是，总是无法将它串联到一起。每次梳理时都是很分散的，无法保持思路连贯性。
2022年前端Vue常见面试题大全（三万长文）持续更新...
2022-01-17 00:07

月苏西的博客支持请求取消可以转换请求数据和响应数据，并对响应回来的内容自动转换成 JSON类型的数据批量发送多个请求安全性更高，客户端支持防御 XSRF，就是让你的每个请求都带一个从cookie中拿到的key, 根据浏览器同源策略...
前端面试八股文（详细版）—上
2022-11-13 17:06

小镇庆山的博客前端面试八股文，知识点广而全，内容会及时更新
ckeditor-bootstrap-visibility:将控件添加到 Bootstraps 响应可见性类的几个不同元素上
2021-06-10 17:51

CKEditor 的 Bootstrap 可见性插件CKEditor 的 ...CKEditor 的 Forms 插件（通常默认安装）您网站的前端必须已经在使用 Bootstrap 3.X 框架 [ 了解更多信息）安装说明提取下载的存储库将bootstrapVisibility文件夹复制
没有解决我的问题, 去提问

码龄粉丝数原力等级 --

从网站上多个页面上发生的DIV中提取文本，然后输出到.txt？

2条回答默认最新

码龄粉丝数原力等级 --

从网站上多个页面上发生的DIV中提取文本，然后输出到.txt？

2条回答 默认 最新

2条回答默认最新