维基百科整合问题 - 需要最终解决这个问题 101

发布于 2024-07-24 21:52:13 字数 1614 浏览 2 评论 0原文

抱歉,大家,我一直在运行一个模拟询问如何将维基百科数据集成到我的应用程序中,坦率地说,我认为我没有取得任何成功,因为我一直在尝试所有想法并有点放弃当我读到死胡同或障碍时。 我将尝试解释我到底想在这里做什么。

我有一个简单的位置目录,例如城市和国家。 我的应用程序是一个简单的基于 php 的 ajax 应用程序,具有搜索和浏览功能。 人们注册并将自己与一个城市关联起来,当用户浏览城市时 - 他/她可以看到该城市中的人员和公司,即我们系统中的任何人。

该部分很容易自行设置并且工作正常。 问题是,我的搜索结果将采用以下格式:即有人搜索北京。 它将返回一个三选项卡界面框:

  1. 第一个选项卡将包含一个包含北京城市信息的信息框,第二
  2. 个选项卡将包含一个包含中国国家/地区信息的信息框的国家/地区选项卡,
  3. 第三个选项卡将包含北京所有联系人的列表。

前两个选项卡的内容应该来自维基百科。现在我完全不知道什么是完成这项工作的最佳方法,而且一旦决定了一种方法,那么我该如何做并使其变得相当强壮的。

到目前为止,我能够消化的一些好的和坏的想法是:

  1. 直接向维基百科运行一个curl请求,并在每次进行搜索时解析返回的数据。 在这种情况下,无需维护维基百科上数据的本地副本。 另一个问题是它完全依赖于来自远程第三位置的数据,我怀疑每次向维基百科发出请求以检索基本信息是否可行。 另外考虑到维基百科上的数据需要在每次请求时进行解析 - 这将克服繁重的服务器负载..或者我在这里推测。

  2. 下载维基百科转储并进行查询。 好吧,我已经下载了整个数据库,但是从 xml 转储中导入所有表将花费很长时间。 另外考虑一下这样一个事实:我只想提取国家和城市及其信息框的列表 - 转储中的很多信息对我来说没有用。

  3. 制作我自己的本地表并创建一个 cron[我将在此处解释为什么 cron 作业] 脚本,该脚本将以某种方式解析维基百科上的所有国家和城市页面,并将它们转换为我可以在表中使用的格式。 但老实说,我不需要信息框中的所有信息,事实上,如果我什至可以按原样获取信息框的基本标记 - 这对我来说就足够了。 像:

国家名称| 信息框原始文本

如果我愿意,我可以亲自提取坐标和其他详细信息等内容。

我什至尝试从 infochiumps 和 dbpedia 下载第三方数据集,但 infochimps 的数据集不完整,并且不包含我想要显示的所有信息 - 再加上 dbpedia,我完全不知道如何处理我下载的 infoboxes 的 csv 文件恐怕它也可能不完整。

但这只是问题的一部分。 我想要一种显示维基百科信息的方法 - 我将所有指向维基百科的链接以及来自维基百科的良好信息正确显示在各处,但问题是我需要一种可以定期更新我所拥有的信息的方法来自维基百科所以至少我没有完全过时的数据。 比如说,一个可以检查的系统,如果我们有一个新的国家或新的位置,它可以解析信息并以某种方式检索它。 我在这里依靠维基百科中的国家和城市类别来实现此目的,但坦率地说,所有这些想法都在纸上,部分编码,而且非常混乱。

我正在使用 PHP 和 MySQL 进行编程,并且我的截止日期很快就到了 - 考虑到上述情况和要求,遵循和实施的最佳和最实用的方法是什么。 我对想法完全持开放态度 - 如果有人做过类似的事情,我很想听听实际例子:D

Sorry guys, I've been running a mock asking questions on how to integrate wikipedia data into my application and frankly I don't think I've had any success on my end as I've been trying all the ideas and kinda giving up when I read a dead end or obstacle. I'll try to explain what exactly I am trying to do here.

I have a simple directory of locations like cities and countries. My application is a simple php based ajax based application with a search and browse facility. People sign up and associate themselves with a city and when a user browses cities - he/ she can see people and companies in that city i.e. whoever is a part of our system that is.

That part is kinda easily set up on its own and is working fine. The thing is that My search results would be in the format i.e. some one searches for lets say Beijing. It would return in a three tabbed interface box:

  1. First Tab would have an infobox containig city information for Beijing
  2. Seond would be a country tab holding an infobox of the country information from CHina
  3. Third tab would have Listings of all contacts in Beijing.

The content for the first two tabs should come from Wikipedia.Now I'm totally lost with what would be the best way to get this done and furthermore once decide on a methodology then - how do I do it and make it such that its quite robust.

A couple of ideas good and bad as I have been able to digest so far are:

  1. Run a curl request directly to wikipedia and parse the returning data everytime a search is made. There is no need to maintain a local copy in this case of the data on wikipedia. The other issue is that its wholly reliant on data from a remote third location and I doubt it is feasible to do a request everytime to wikipedia to retrieve basic information. Plus considering that data on wikipedia requires to be parsed at every request - thats gonna surmount to heavy server loads.. or am I speculating here.

  2. Take a Download of the wikipedia dump and query that. Well I've downloaded the entire database but its gonna take forever to import all the tables from the xml dump. Plus consider the fact that I just want to extract a list of countries and cities and their info boxes - alot of the information in the dump is of no use to me.

  3. Make my own local tables and create a cron[I'll explain why cron job here] script that would somehow parse all teh countries and cities pages on wikipedia and convert them to a format I can use in my tables. However honestly speaking I do not need all of the information in the infoboxes as is infact if I could just even get the basic markup of the infoboxes as is - that would be more than enough for me. Like:

Title of Country | Infobox Raw text

I can personally extract stuff like coordinates and other details if I want.

I even tried downloading third party datasets from infochiumps and dbpedia but the dataset from infochimps is incomplete and didn't contain all the information I wanted to display - plus with dbpedia I have absolutely no idea what to do with the csv file I downloaded of infoboxes and am afraid that it might also not be complete.

But that is just part of the issue here. I want a way to show the wikipedia information - I'll have all the links point to wikipedia as well as a nice info from wikipedia displayed properly all around BUT the issue is that I need a way that periodically I can update the information I have from wikipedia so atleast I don't have totally outdated data. Like well lets say a system that can check and if we have a new country or new location it can parse the information and somehow retrieve it. I'm relying on categories of countries and cities in wikipedia for this here but frankly all these ideas are on paper, partially coded and its a huge mess.

I'm programming in PHP and MySQL and my deadline is fast approaching - given the above situation and requirements what is the best and most practical method to follow and implement. I am totally open to ideas - practical examples if anyone has done something similar - I would love to hear :D

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

第几種人 2024-07-31 21:52:13

我能想到的几件事:

  1. 只需在您网站上的 iframe 中显示维基百科数据。

  2. 使用 Curl 从 wikipedia 获取 html,然后使用自定义样式表对其进行样式设置和/或隐藏您不想显示的部分。

    使用

尝试实际解析 HTML 并提取出您想要的片段将是一个巨大的痛苦,并且很可能必须针对每个城市进行定制。 最好先做一些简单的事情,然后在你决定确实需要时再回去改进它。

A couple things I can think of:

  1. Just display the wikipedia data in an iframe on your site.

  2. Use Curl to get the html from wikipedia, then use a custom stylesheet to style it and/or hide the parts you don't want displayed.

Trying to actually parse the HTML and pull out the pieces you want is going to be a giant pain, and is most likely going to have to be custom for each city. Better off getting something simple working for now then going back and improving it later if you decide you really need to.

遥远的她 2024-07-31 21:52:13

维基百科地理编码网络服务之一

如何使用 您可以将邮政编码和国家/地区等传递给简短的文章摘要和维基百科文章的链接。

如果这样就足够了。

How about using one of the Wikipedia Geocoding Webservices

There are several available where you can pass in e.g. postalcode and country to a short article summary and a link to the wikipedia article.

If that would be enough.

北城孤痞 2024-07-31 21:52:13

我建议以下

  • 在数据库中创建城市(城市)时从维基百科查询城市
  • 解析数据,存储带有上次
  • 访问更新的时间戳的本地副本,如果有必要则更新数据。 您可以显示旧的,并带有水印,说明它已经……几天了,现在正在更新。 然后更新完成后换成新获得的。 您说过您正在使用 AJAX,所以这不会成为问题。

它将最大限度地减少对维基百科的查询,并且即使维基百科无法访问,您的服务也不会显示空页面。

I'd suggest the following

  • Query the city from wikipedia when it (the city) is created in your DB
  • Parse the data, store a local copy with the timestamp of the last update
  • on access, update the data if it is necessary. You can display the old one with a watermark saying it is ... days old and now updating. Then change to the freshly aquired one when the update is done. You've said you are using AJAX, so it won't be a problem

It would minimize the queryes to wikipedia and your service won't show empty pages even when wikipedia is unreachable.

阳光下的泡沫是彩色的 2024-07-31 21:52:13

看看 DBPedia,它包含 CSV 格式的维基百科数据的良好提取。

Have a look at DBPedia it contains nice extraction of Wikipedia data in CSV format.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文