如何构建多个源feed解析和数据整合守护进程？

I am given task to write a script (or better yet, a daemon), that has to do several things:

Crawl most recent data from several input xml feeds. There are, like, 15-20 feeds for the time being, but I believe number might go up to 50 in future. Feed size varies between 500 KB and 5 MB (it most likely won't go over 10 MB). Since feeds are not in a standardized format, there has to be a feed parser for each feed from given source, so that data is unified into single, common format.
Store data into database, in such a way that every single unit of data that is extracted from feeds is still available.
Since data changes over time (say, information is updated at least once per hour), it is necessary to keep archive of changed data.

One other thing that has proven to be difficult to manage in this case (I already hacked together some solution) is that during step 2 database begins to slow down to a crawl, because of volume of SQL queries that insert data in several tables, which affects rest of the system that relies on database (it's a dedicated server with several sites hosted). And I couldn't even get to step 3...

Any hints on how should I approach this problem? Caveats to pay attention to? Anything that would help me in solving this problem is more than welcome.

Thanks!

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongpu8935 2010-08-27 18:18
关注
Some of my ideas:

You can devise a clever way to use database transactions if your database supports transactions. I've only experimented with database transactions but they say it can increase insert speeds up to 40% (mysql.com) and it doesn't lock tables.

You can append data to a temp file, even in a sql friendly format and load data into your database at once. Using LOAD DATA INFILE is usually 20 times faster (mysql), I've used to quickly insert over a 1 million entries and it was pretty quick.

Setup some kind of queing system.

Put a sleep or wait on each query (in python, time.sleep(1) will make the process wait 1 second)

I'm not exactly sure what db you're using but here are some pointers in optimizing inserts:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

如何构建多个源feed解析和数据整合守护进程？ mysql php
2010-08-27 18:08

回答 1 已采纳 Some of my ideas: You can devise a clever way to use database transactions if your database supp
如何管理来自多个不同来源的Feed？ [关闭] android java php swift
2015-12-04 17:45

回答 1 已采纳 Not using a database would make your app slower. Without a database every time you restart the app
数据是如何feed到神经元里面的人工智能深度学习神经网络
2022-11-24 16:42

回答 3 已采纳怎么传入神经元，这个涉及到接口的概念，你可以简单点的认为就是一个函数，这个函数的参数按照固定的格式传入。举例来说,c语言函数：int func(int arg,char *argc){ ... ret
面经-hangzhou
2022-07-08 20:27

云F的博客接收方原因： TCP将接收到的数据包保存在接收缓存里，如果TCP接收数据包到缓存的速度大于应用程序从缓存中读取数据包的速度，多个包就会被缓存，应用程序就有可能读取到多个首尾相接粘到一起的包。解决粘包问题： ...
将mysql数据作为json feed发送给另一个开发人员 json mysql php
2018-02-28 06:05

回答 1 已采纳 You need to create an array, where you can add status like fail/success and corresponding messages
如何加入两个或更多Wordpress RSS源？ php
2014-07-03 16:30

回答 2 已采纳 I've found a good solution! The script was running slow because I wasn't using cache correctly. Y
使用PHP解析XML Feed以查找是否存在特定数据 php xml
2014-12-03 22:10

回答 2 已采纳 Use XPath. Selecting nodes on a DOM tree is easy with it. (SimpleXML uses a DOM in the background)
20个必不可少的Python库也是基本的第三方库
2022-02-20 11:45

孤心上月的博客今天我将介绍20个属于我常用工具的Python库，我相信你看完之后也会觉得离不开它们。他们是： Requests.Kenneth Reitz写的最富盛名的http库。每个Python程序员都应该有它。 Scrapy.如果你从事爬虫相关的工作，那么...
如何定义一个类获取JSON数据? json swift
2016-12-14 10:44

回答 3 已采纳直接用swiftjson处理。然后用字典方式取对应的数据
RSS feed没有返回PHP的任何内容？ php
2015-11-07 04:47

回答 1 已采纳 loadHTML is used to load html content, to read rss use below solution Method 1 $feed_url = 'http
Tensorflow建一个神经网络，输出数据只有一个谱型，且杂乱神经网络
2018-12-12 13:59

回答 2 已采纳可以试着调整神经层的结构和模型的超参数，试着多次调整达到对每个参数的理解，建议可以先简要看看莫烦的视频教程，对各个参数有个大致了解，这是链接https://morvanzhou.github.io/
Linux学习笔记
2022-07-09 22:47

奋斗的青春，加油的博客语法格式：cat [参数] 文件常用参数： -n 显示行数（空行也编号） -s 显示行数（多个空行算一个编号） -b 显示行数（空行不编号） -E 每行结束处显示$符号 -T 将TAB字符显示为 ^I符号 -v 使用 ^ 和 M- 引用，除了 ...
具有多个goroutine的内存池和缓冲通道
2017-07-28 06:04

回答 1 已采纳 The problem is that you have created a buffered channel with var record = make(chan List, 3). Henc
Python常用的标准库以及第三方库有哪些？
2019-05-23 09:36

'初十一的博客 20个必不可少的Python库也是基本的第三方库读者您好。今天我将介绍20个属于我常用工具的Python库，我相信你看完之后也会觉得离不开它们。他们是： Requests.Kenneth Reitz写的最富盛名的http库。每个Python程序员...
提升逼格.Summary.提升逼格的那些运维开发资料汇总?
2019-09-18 10:41

chunnidong6528的博客 buildout：一个构建系统，从多个组件来创建，组装和部署应用。官网 BitBake：针对嵌入式 Linux 的类似 make 的构建工具。官网 fabricate：对任何语言自动找到依赖关系的构建工具。官网 PlatformIO：多平台...
20个必不可少的Python库也是基本的第三方库(转载)
2021-03-03 10:46

kiki0530的博客 20个必不可少的Python库也是基本的第三方库读者您好。今天我将介绍20个属于我常用工具的Python库，我相信你看完之后也会觉得离不开它们。他们是： Requests.Kenneth Reitz写的最富盛名的http库。每个Python程序员都...
哪些 Python 库让你相见恨晚？
2018-01-24 16:44

武晓兵的博客 173 个回答知乎用户翻译组勤务员小艾，不值得一提的小人物。 Awesome Python中文版来啦！本文由伯乐在线 - 艾凌风翻译，Namco 校稿。未经许可，...
【消息中间件】Apache Kafka 教程
2023-03-01 06:30

逆流°只是风景-bjhxcc的博客在大数据中，使用了大量的数据。关于数据，我们有两个主要挑战。第一个挑战是如何收集大量的数据，第二个挑战是分析收集的数据。为了克服这些挑战，您必须需要一个消息系统。Kafka专为分布式高吞吐量系统而设计。...
没有解决我的问题, 去提问

悬赏问题

¥15 拟通过pc下指令到安卓系统，如果追求响应速度，尽可能无延迟，是不是用安卓模拟器会优于实体的安卓手机？如果是，可以快多少毫秒？
¥20 神经网络Sequential name=sequential, built=False
¥16 Qphython 用xlrd读取excel报错
¥15 单片机学习顺序问题！！
¥15 ikuai客户端多拨vpn，重启总是有个别重拨不上
¥20 关于#anlogic#sdram#的问题，如何解决？(关键词-performance)
¥15 相敏解调 matlab
¥15 求lingo代码和思路
¥15 公交车和无人机协同运输
¥15 stm32代码移植没反应

如何构建多个源feed解析和数据整合守护进程？

1条回答 默认 最新

悬赏问题

1条回答默认最新