如何从维基百科中提取统计数据?
我想提取维基百科中所有死者的列表并比较他们去世时的年龄。维基百科上所有死去的人都填写了以下字段:
| birth_name = Thomas Alva Edison
| birth_date = {{birth date|mf=yes|1847|02|11}}
| death_date ={{death date and age|mf=yes|1931|10|18|1847|02|11}}
我将不得不制作一个爬虫?维基百科 API 中有什么可以帮助我的吗? 有什么地方可以让我开始爬行吗?有死者名单吗?
I want to extract a list of all dead people in Wikipedia and compare their ages when they died. All dead people in Wikipedia has the following fields filled:
| birth_name = Thomas Alva Edison
| birth_date = {{birth date|mf=yes|1847|02|11}}
| death_date ={{death date and age|mf=yes|1931|10|18|1847|02|11}}
I will have to make a crawler? There is anything in the Wikipedia API that can help me?
Is there any place where I can start to crawl? Any list of dead people?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以在此处找到可供下载的维基百科所有内容的转储:
http://dumps.wikimedia。 org/enwiki/latest/
该文件是一个大小为几 GB 的
.xml
文件,包含维基百科上所有页面的文本(以及其他内容)。如何处理它取决于您要使用的编程语言。You can find a dump of all the contents of Wikipedia available for download here:
http://dumps.wikimedia.org/enwiki/latest/
The file is an
.xml
file of several gigabytes in size, and contains the text of all the pages on Wikipedia (amongst other things). How you process this depends on what programming language you're going to use.这就是 DBpedia 的用途 - 数据库中维基百科的所有结构化数据。在 http://dbpedia.org/sparql 尝试以下查询:
This is what DBpedia is for - all the structured data from Wikipedia in a database. Try the following query at http://dbpedia.org/sparql :