dongshao2967 2012-04-23 20:55
浏览 103

使用Curl从html表中获取信息

i need to get some information about some plants and put it into mysql table. My knowledge on Curl and DOM is quite null, but i've come to this:

    set_time_limit(0);
include('simple_html_dom.php');


$ch = curl_init ("http://davesgarden.com/guides/pf/go/1501/"); 

curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;     rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Accept-Language: es-es,en"));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,0); 
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec ($ch); 
curl_close ($ch); 


$html= str_get_html($data);


$e = $html->find("table", 8);

 echo $e->innertext;

now, i'm really lost about how to move in from this point, can you please guide me?

Thanks!

  • 写回答

4条回答 默认 最新

  • douyimiao1993 2012-04-23 22:08
    关注

    This is a mess.

    But at least it's a (somewhat) consistent mess.

    If this is a one time extraction and not a rolling project, personally I'd use quick and dirty regex on this instead of simple_html_dom. You'll be there all day twiddling with the tags otherwise.

    For example, this regex pulls out the majority of title/data pairs:

    $pattern = "/<b>(.*?)</b>\s*<br>(.*?)</?(td|p)>/si";
    

    You'll need to do some pre and post cleaning before it will get them all though.

    I don't envy you having this task...

    评论

报告相同问题?

悬赏问题

  • ¥15 echarts动画效果失效的问题。官网下载的例子。
  • ¥60 许可证msc licensing软件报错显示已有相同版本软件,但是下一步显示无法读取日志目录。
  • ¥15 Attention is all you need 的代码运行
  • ¥15 一个服务器已经有一个系统了如果用usb再装一个系统,原来的系统会被覆盖掉吗
  • ¥15 使用esm_msa1_t12_100M_UR50S蛋白质语言模型进行零样本预测时,终端显示出了sequence handled的进度条,但是并不出结果就自动终止回到命令提示行了是怎么回事:
  • ¥15 前置放大电路与功率放大电路相连放大倍数出现问题
  • ¥30 关于<main>标签页面跳转的问题
  • ¥80 部署运行web自动化项目
  • ¥15 腾讯云如何建立同一个项目中物模型之间的联系
  • ¥30 VMware 云桌面水印如何添加