weixin_39652136
weixin_39652136
2020-12-01 23:30

crawler replacement

Hey, Amazing project, I wish I did more research before I begin my quest.

as it stands I have a functional crawler network handling the queues, through a series of proxies and grabbing the content and extracting the readability version of the article and dumps it into elasticsearch

but I'd like to utilize media clouds amazing search UI and CLIPPER, I'm wondering if my approach is correct and you guys can provide some insight on how to proceed.

Do I - write a custom Crawler::Handler to pipe the raw html and extracted metadata and content to it and let it handle the rest of the logic? - make a exported to match the MediaWords::ImportStories::ScrapeHTML and let it handle imports from there? - a better approach?

Ideally I'd like the least resistance path, my perl is entry level at best.

Any help would be appreciated, Thanks again for the amazing work!

该提问来源于开源项目:mediacloud/backend

  • 点赞
  • 回答
  • 收藏
  • 复制链接分享

7条回答

为你推荐

换一换