We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(4)
不一定有与门户对应的类别,尽管您可以尝试查找与门户同名的类别、门户页面所在的类别(使用 API,您可以使用
prop= 进行查询类别
),或从门户页面链接的类别页面(prop=links&plnamespace=14
)。这些语言中的任何一种都可以。您还可以选择 perl、java、C#、objective-c 或任何其他语言。不同质量的框架列表可以在此处或此处。
API 当然可以使用
prop=revisions
为您提供内容。您甚至可以使用rvsection=0
仅查询“lead”部分。 API 还可以使用list=categorymembers
为您提供类别中的页面列表,并使用prop=categories
为您提供页面的类别列表。500页应该不是问题。如果您想要大部分文章,您需要考虑使用 数据库转储 代替。
有关详细信息,请参阅API 文档。
There isn't necessarily a category corresponding to a portal, although you could try looking for a category with the same name as the portal, the categories the portal page is in (using the API, you can query this with
prop=categories
), or the category pages linked from the portal page (prop=links&plnamespace=14
).Any of those languages would work. You could also pick perl, java, C#, objective-c, or just about any other language. A list of frameworks of varying quality can be found here or here.
The API can certainly give you content, using
prop=revisions
. You can even query just the "lead" section withrvsection=0
. The API can also give you the list of pages in a category withlist=categorymembers
and the list of categories for a page usingprop=categories
.500 pages shouldn't be an issue. If you were to be wanting a significant proportion of the articles, you'd want to look into using a database dump instead.
See the API documentation for details.
Python,抓取页面很有趣,为此我建议通过 lxml.html 使用 xpath。
Python, have fun with scraping the page, for this I would suggest xpath via lxml.html.
虽然您正在寻找基于网络爬虫的解决方案,但我建议您看看DBPedia。本质上它是 RDF 格式的维基百科。您可以下载整个数据库转储,对其运行 SPARQL 查询,或者直接指向资源并启动通过浏览参考文献来探索。
例如,可以通过以下 URL 访问计算机科学类别:
Although you are looking for a web crawler based solution, let me suggest you to take a look at DBPedia. Essentially it's Wikipedia in RDF format. You can download entire database dumps, run SPARQL queries against it, or directly point to a resource and start exploring from there by walking the references.
For example, the Computer science category can be accessed at this URL:
我会建议使用 python 来快速开发。
你必须有两个模块。会爬取所有可能的类别
内部类别(基本上是类别树),其他可以
从详细信息页面(即普通维基页面)提取信息
维基百科支持特殊:在 URL 参数中导出,这将
允许您获取 XML 响应。使用python的xpath
模块将为您提供帮助。
I will suggest python for fast development.
You have to have two modules. One will crawl all the possible categories
Inside category(basically a category tree), other which can
Extract info from the details page(I.e normal wiki page)
Wikipedia supports special:export in the URL param which will
allow you to get the XML response. Use python's xpath
Module will help you.