donglie7268 2009-06-16 12:25 采纳率: 100%
浏览 73
已采纳

维基百科集成问题 - 需要最终解决这个问题101

Sorry guys, I've been running a mock asking questions on how to integrate wikipedia data into my application and frankly I don't think I've had any success on my end as I've been trying all the ideas and kinda giving up when I read a dead end or obstacle. I'll try to explain what exactly I am trying to do here.

I have a simple directory of locations like cities and countries. My application is a simple php based ajax based application with a search and browse facility. People sign up and associate themselves with a city and when a user browses cities - he/ she can see people and companies in that city i.e. whoever is a part of our system that is.

That part is kinda easily set up on its own and is working fine. The thing is that My search results would be in the format i.e. some one searches for lets say Beijing. It would return in a three tabbed interface box:

  1. First Tab would have an infobox containig city information for Beijing
  2. Seond would be a country tab holding an infobox of the country information from CHina
  3. Third tab would have Listings of all contacts in Beijing.

The content for the first two tabs should come from Wikipedia.Now I'm totally lost with what would be the best way to get this done and furthermore once decide on a methodology then - how do I do it and make it such that its quite robust.

A couple of ideas good and bad as I have been able to digest so far are:

  1. Run a curl request directly to wikipedia and parse the returning data everytime a search is made. There is no need to maintain a local copy in this case of the data on wikipedia. The other issue is that its wholly reliant on data from a remote third location and I doubt it is feasible to do a request everytime to wikipedia to retrieve basic information. Plus considering that data on wikipedia requires to be parsed at every request - thats gonna surmount to heavy server loads.. or am I speculating here.

  2. Take a Download of the wikipedia dump and query that. Well I've downloaded the entire database but its gonna take forever to import all the tables from the xml dump. Plus consider the fact that I just want to extract a list of countries and cities and their info boxes - alot of the information in the dump is of no use to me.

  3. Make my own local tables and create a cron[I'll explain why cron job here] script that would somehow parse all teh countries and cities pages on wikipedia and convert them to a format I can use in my tables. However honestly speaking I do not need all of the information in the infoboxes as is infact if I could just even get the basic markup of the infoboxes as is - that would be more than enough for me. Like:

Title of Country | Infobox Raw text

I can personally extract stuff like coordinates and other details if I want.

I even tried downloading third party datasets from infochiumps and dbpedia but the dataset from infochimps is incomplete and didn't contain all the information I wanted to display - plus with dbpedia I have absolutely no idea what to do with the csv file I downloaded of infoboxes and am afraid that it might also not be complete.

But that is just part of the issue here. I want a way to show the wikipedia information - I'll have all the links point to wikipedia as well as a nice info from wikipedia displayed properly all around BUT the issue is that I need a way that periodically I can update the information I have from wikipedia so atleast I don't have totally outdated data. Like well lets say a system that can check and if we have a new country or new location it can parse the information and somehow retrieve it. I'm relying on categories of countries and cities in wikipedia for this here but frankly all these ideas are on paper, partially coded and its a huge mess.

I'm programming in PHP and MySQL and my deadline is fast approaching - given the above situation and requirements what is the best and most practical method to follow and implement. I am totally open to ideas - practical examples if anyone has done something similar - I would love to hear :D

  • 写回答

4条回答 默认 最新

  • duanjiao5723 2009-06-16 13:44
    关注

    I'd suggest the following

    • Query the city from wikipedia when it (the city) is created in your DB
    • Parse the data, store a local copy with the timestamp of the last update
    • on access, update the data if it is necessary. You can display the old one with a watermark saying it is ... days old and now updating. Then change to the freshly aquired one when the update is done. You've said you are using AJAX, so it won't be a problem

    It would minimize the queryes to wikipedia and your service won't show empty pages even when wikipedia is unreachable.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(3条)

报告相同问题?

悬赏问题

  • ¥20 iqoo11 如何下载安装工程模式
  • ¥15 本题的答案是不是有问题
  • ¥15 关于#r语言#的问题:(svydesign)为什么在一个大的数据集中抽取了一个小数据集
  • ¥15 C++使用Gunplot
  • ¥15 这个电路是如何实现路灯控制器的,原理是什么,怎么求解灯亮起后熄灭的时间如图?
  • ¥15 matlab数字图像处理频率域滤波
  • ¥15 在abaqus做了二维正交切削模型,给刀具添加了超声振动条件后输出切削力为什么比普通切削增大这么多
  • ¥15 ELGamal和paillier计算效率谁快?
  • ¥15 蓝桥杯单片机第十三届第一场,整点继电器吸合,5s后断开出现了问题
  • ¥15 file converter 转换格式失败 报错 Error marking filters as finished,如何解决?