如何获取维基百科类别及其子类别下的所有文章页面?
我想获取一个类别及其子类别下的所有文章名称。
我知道的选项:
- 使用维基百科 API。有这样的选择吗??
- d/l 转储。哪种格式更适合我的使用?
- 还有一个在维基百科中搜索类似
incategory:"music"
的选项,但我没有看到在 XML 中查看该内容的选项。
请分享您的想法
I want to get all the articles names under a category and its sub-categories.
Options I'm aware of:
- Using the Wikipedia API. Does it have such an option??
- d/l the dump. Which format would be better for my usage?
- There is also an option to search in Wikipedia something like
incategory:"music"
, but I didn't see an option to view that in XML.
Please share your thoughts
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
以下资源将帮助您下载该类别及其所有子类别的所有页面:
http://en .wikipedia.org/wiki/Wikipedia:CatScan
这里还有一个 API:
https ://www.mediawiki.org/wiki/API:Categorymembers
The following resource will help you to download all pages from the category and all its subcategories:
http://en.wikipedia.org/wiki/Wikipedia:CatScan
There is also an API available here:
https://www.mediawiki.org/wiki/API:Categorymembers
您可以通过以下两种 API 方法来执行此操作:
对于此类别的文章页面
对于获取子类别:
您可以在 Mediawiki API
You can do this through the following two API methods:
For articles pages for this category
For get subcategories:
You can get more info on Mediawiki API
请注意,维基百科的分类系统不是树,甚至不是无环图。通过不断地跟踪子类别链接,您很可能最终会回到开始的地方。
如果您要进行许多此类查询,最好的方法是下载数据库转储。如果这是一件罕见的事情并且只处理小类别,那么您可能可以避免对
list=categorymembers
。incategory:"music"
似乎没有进行子类别搜索。Note that Wikipedia's categorization system is not a tree, or even an acyclic graph. It is quite possible that by continually following subcategory links you will eventually wind up back where you started.
If you are going to be making many such queries, you would be best served by downloading a database dump. If this will be an infrequent thing and will only be dealing with small categories, you could probably get away with making repeated queries to
list=categorymembers
.incategory:"music"
does not appear to do subcategory searching.