dtx9931 2013-06-30 13:18
浏览 22

编写Web机器人[关闭]

Today it came to my mind to write a web bot/crawler/spider/etc in PHP that only crawls News websites. First of all I read articles about crawlers and then encountered with this issue:

How can a bot recognize a URL/post/article/text as it's related to News!

The only soultion I came with, is to check them for some particular keywords, but No! I don't think that's a good and workable practice. At least not perfect!

So any ideas about better sloutions, is appreciated.

  • 写回答

2条回答 默认 最新

  • dougan1465 2013-06-30 13:21
    关注

    You could use preg_match for matching the keywords and the technique is pretty awesome and working:

    $text = "News: Flooding is expected today" ;
    $news_found = preg_match("/(news|sensation|discovery)/i", $text);
    

    No reason to think that is not a good solution.

    评论

报告相同问题?

悬赏问题

  • ¥15 表达式必须是可修改的左值
  • ¥15 如何绘制动力学系统的相图
  • ¥15 对接wps接口实现获取元数据
  • ¥20 给自己本科IT专业毕业的妹m找个实习工作
  • ¥15 用友U8:向一个无法连接的网络尝试了一个套接字操作,如何解决?
  • ¥30 我的代码按理说完成了模型的搭建、训练、验证测试等工作(标签-网络|关键词-变化检测)
  • ¥50 mac mini外接显示器 画质字体模糊
  • ¥15 TLS1.2协议通信解密
  • ¥40 图书信息管理系统程序编写
  • ¥20 Qcustomplot缩小曲线形状问题