我正在做一个夏天的研究项目,我必须使用从维基百科获取一些数据,存储它,然后对其进行一些分析。我正在使用维基百科 API 来收集数据,并且我已经很好地掌握了这些数据。
我的问题是关于 API links-alllinks 选项="nofollow noreferrer">此处为文档
阅读完描述后,无论是在 API 本身中(它都已经崩溃了)我无法直接链接到该部分),我想我明白它应该返回什么。然而,当我运行查询时,它返回了我意想不到的东西。
这是我运行的查询:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=google&rvprop=ids|timestamp|user|comment|content&rvlimit=1&list=alllinks&alunique&allimit=40&format=xml
本质上是说:获取 Google 页面的最新修订版,包括每个修订版的 id、时间戳、用户、评论和内容,并以 XML 格式返回。
链接(我认为)应该给我返回指向 google 页面的维基百科页面列表(在本例中是前 40 个唯一的页面)。
我不确定关于脏话的政策是什么,但这就是我准确返回的结果:
<?xml version="1.0"?>
<api>
<query><normalized>
<n from="google" to="Google" />
</normalized>
<pages>
<page pageid="1092923" ns="0" title="Google">
<revisions>
<rev revid="366826294" parentid="366673948" user="Citation bot" timestamp="2010-06-08T17:18:31Z" comment="Citations: [161]Tweaked: url. [[User:Mono|Mono]]" xml:space="preserve">
<!-- The page content, I've replaced this cos its not of interest -->
</rev>
</revisions>
</page>
</pages>
<alllinks>
<!-- offensive content removed -->
</alllinks>
</query>
<query-continue>
<revisions rvstartid="366673948" />
<alllinks alfrom="!2009" />
</query-continue>
</api>
部分,它只是一堆随机的官样文章和攻击性评论。几乎没有我想象的那样。我已经进行了相当多的搜索,但似乎无法找到我的问题的直接答案。
-
list=alllinks
选项应该返回什么?
- 为什么我会把这些废话放进去?
I'm doing a research project for the summer and I've got to use get some data from Wikipedia, store it and then do some analysis on it. I'm using the Wikipedia API to gather the data and I've got that down pretty well.
What my questions is in regards to the links-alllinks
option in the API doc here
After reading the description, both there and in the API itself (it's down and bit and I can't link directly to the section), I think I understand what it's supposed to return. However when I ran a query it gave me back something I didn't expect.
Here's the query I ran:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=google&rvprop=ids|timestamp|user|comment|content&rvlimit=1&list=alllinks&alunique&allimit=40&format=xml
Which in essence says: Get the last revision of the Google page, include the id, timestamp, user, comment and content of each revision, and return it in XML format.
The allinks (I thought) should give me back a list of wikipedia pages which point to the google page (In this case the first 40 unique ones).
I'm not sure what the policy is on swears, but this is the result I got back exactly:
<?xml version="1.0"?>
<api>
<query><normalized>
<n from="google" to="Google" />
</normalized>
<pages>
<page pageid="1092923" ns="0" title="Google">
<revisions>
<rev revid="366826294" parentid="366673948" user="Citation bot" timestamp="2010-06-08T17:18:31Z" comment="Citations: [161]Tweaked: url. [[User:Mono|Mono]]" xml:space="preserve">
<!-- The page content, I've replaced this cos its not of interest -->
</rev>
</revisions>
</page>
</pages>
<alllinks>
<!-- offensive content removed -->
</alllinks>
</query>
<query-continue>
<revisions rvstartid="366673948" />
<alllinks alfrom="!2009" />
</query-continue>
</api>
The <alllinks>
part, its just a load of random gobbledy-gook and offensive comments. No nearly what I thought I'd get. I've done a fair bit of searching but I can't seem to find a direct answer to my question.
- What should the
list=alllinks
option return?
- Why am I getting this crap in there?
发布评论
评论(1)
你不需要一个清单;列表是遍历所有页面的东西。在您的情况下,您只需“枚举指向给定名称空间的所有链接”。
您想要一个与 Google 页面关联的属性,因此您需要 prop=links 而不是 alllinks 垃圾。
所以你的查询变成:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions|links&titles=google&rvprop=ids|时间戳|用户|评论|内容&rvlimit=1&format=xml
You don't want a list; a list is something that iterates over all pages. In your case you simply "enumerate all links that point to a given namespace".
You want a property associated with the Google page, so you need prop=links instead of the alllinks crap.
So your query becomes:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions|links&titles=google&rvprop=ids|timestamp|user|comment|content&rvlimit=1&format=xml