donglian2106 2010-08-27 18:08
浏览 54
已采纳

如何构建多个源feed解析和数据整合守护进程?

I am given task to write a script (or better yet, a daemon), that has to do several things:

  1. Crawl most recent data from several input xml feeds. There are, like, 15-20 feeds for the time being, but I believe number might go up to 50 in future. Feed size varies between 500 KB and 5 MB (it most likely won't go over 10 MB). Since feeds are not in a standardized format, there has to be a feed parser for each feed from given source, so that data is unified into single, common format.
  2. Store data into database, in such a way that every single unit of data that is extracted from feeds is still available.
  3. Since data changes over time (say, information is updated at least once per hour), it is necessary to keep archive of changed data.

One other thing that has proven to be difficult to manage in this case (I already hacked together some solution) is that during step 2 database begins to slow down to a crawl, because of volume of SQL queries that insert data in several tables, which affects rest of the system that relies on database (it's a dedicated server with several sites hosted). And I couldn't even get to step 3...

Any hints on how should I approach this problem? Caveats to pay attention to? Anything that would help me in solving this problem is more than welcome.

Thanks!

  • 写回答

1条回答 默认 最新

  • dongpu8935 2010-08-27 18:18
    关注

    Some of my ideas:

    1. You can devise a clever way to use database transactions if your database supports transactions. I've only experimented with database transactions but they say it can increase insert speeds up to 40% (mysql.com) and it doesn't lock tables.

    2. You can append data to a temp file, even in a sql friendly format and load data into your database at once. Using LOAD DATA INFILE is usually 20 times faster (mysql), I've used to quickly insert over a 1 million entries and it was pretty quick.

    3. Setup some kind of queing system.

    4. Put a sleep or wait on each query (in python, time.sleep(1) will make the process wait 1 second)

    I'm not exactly sure what db you're using but here are some pointers in optimizing inserts:
    http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 拟通过pc下指令到安卓系统,如果追求响应速度,尽可能无延迟,是不是用安卓模拟器会优于实体的安卓手机?如果是,可以快多少毫秒?
  • ¥20 神经网络Sequential name=sequential, built=False
  • ¥16 Qphython 用xlrd读取excel报错
  • ¥15 单片机学习顺序问题!!
  • ¥15 ikuai客户端多拨vpn,重启总是有个别重拨不上
  • ¥20 关于#anlogic#sdram#的问题,如何解决?(关键词-performance)
  • ¥15 相敏解调 matlab
  • ¥15 求lingo代码和思路
  • ¥15 公交车和无人机协同运输
  • ¥15 stm32代码移植没反应