weixin_39652136
2020-12-01 23:30crawler replacement
Hey, Amazing project, I wish I did more research before I begin my quest.
as it stands I have a functional crawler network handling the queues, through a series of proxies and grabbing the content and extracting the readability version of the article and dumps it into elasticsearch
but I'd like to utilize media clouds amazing search UI and CLIPPER, I'm wondering if my approach is correct and you guys can provide some insight on how to proceed.
Do I - write a custom Crawler::Handler to pipe the raw html and extracted metadata and content to it and let it handle the rest of the logic? - make a exported to match the MediaWords::ImportStories::ScrapeHTML and let it handle imports from there? - a better approach?
Ideally I'd like the least resistance path, my perl is entry level at best.
Any help would be appreciated, Thanks again for the amazing work!
该提问来源于开源项目:mediacloud/backend
- 点赞
- 回答
- 收藏
- 复制链接分享
7条回答
为你推荐
- 小白写python网络爬虫权威指南出错,求大佬们看一下
- python
- 2个回答
- 取消HTTP请求时关闭所有goroutines
- it技术
- 互联网问答
- IT行业问题
- 计算机技术
- 编程语言问答
- 1个回答
- 如何使用Goutte登录Amazon SellerCentral
- amazon
- php
- 2个回答
- filterXpath和filter有什么区别? [关闭]
- php
- 1个回答
- 各位大佬,这个问题怎样用python语言解决哇?
- python
- 1个回答