铁轨 +用于维基百科数据提取的 MediaWiki API
我正在尝试使用 Rails 根据搜索词从维基百科中提取数据。
例如,
1)如果我有字符串“美国偶像”,我想将其传递给维基百科并获取与之相关的文章列表。我的目标是获取前 3 个超链接并将其显示在网站上。
2)更进一步,我需要从维基百科中提取小块数据 - 比如信息框,或者维基百科文章的前几个单词。
有什么建议吗?
谢谢!
I am trying to use Rails to extract data from Wikipedia, based on a search term.
For example,
1) if I have the String "American Idol", I want to pass that to Wikipedia and get a list of the articles that relate to that. My goal will be to take the first 3 hyperlinks and display them on the website.
2) one step further would involve me extracting small pieces of data from Wikipedia - say the infobox, or the first few words of the wikipedia article.
Any tips?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您不需要诉诸屏幕抓取,MediaWiki 有一个非常全面的 API正是这种事情。请参阅 https://github.com/jpatokal/mediawiki-gateway 以获取方便的 Ruby 包装器。
或者,如果您只对信息框等数据感兴趣,请参阅 DBpedia 了解 Wikipedia 的数据库版本。
You don't need to resort to screen-scraping, MediaWiki has a very comprehensive API for precisely this kind of thing. See https://github.com/jpatokal/mediawiki-gateway for a handy Ruby wrapper around it.
Alternatively, if you're only interested in data like infoboxes, see DBpedia for the database version of Wikipedia.
您可以使用另一个 gem:https://github.com/kenpratt/wikipedia-client
这个 gem 似乎只得到您搜索的第一个结果,但您可以查阅文档来确定。
关于内容,一旦您获得页面,gem 允许您访问文章、链接、图像等不同内容。
There is another gem that you can use: https://github.com/kenpratt/wikipedia-client
This gem seems to get just the first result of your search, but you can consult the documentation to be sure.
Regarding the content, once you get the page, the gem allows you to access the different content of the article, links, images and so on.
使用 mechanize 和 nokogiri 来做到这一点。这是一个很棒的备忘单:
http:// www.e-tobi.net/blog/files/ruby-mechanize-cheat-sheet.pdf
Mechanize 是一个模拟网站调用的工具箱,nokogiri 是一个 html/xml 解析器。意识到这一点应该很简单。
Use mechanize and nokogiri to do that. This is a great cheat sheet for that:
http://www.e-tobi.net/blog/files/ruby-mechanize-cheat-sheet.pdf
Mechanize is a toolbox to simulate website calls and nokogiri is an html/xml parser. It should be simple to realize that.