这是一个有点难以理解的问题,但我会尽力解释它。首先,让我展示一个示例页面:
http://en.wikipedia.org/wiki/African_bush_elephant
这是一个维基百科页面,特别是一个硬币页面,因为它的右侧有“taxobox”。我正在尝试使用 PHP 解析该分类箱中的属性。维基百科中有两种方法可以创建这样的分类箱:手动或使用特殊的“自动分类箱”模板。
我可以解析手册。我使用维基百科的 API 以 json 格式返回页面内容,接下来我使用一些正则表达式来获取这些属性。
然而,在自动分类箱的情况下,返回的内容是这样的:
> {{automatic taxobox | name = African Bush Elephant<ref
> name=MSW3>{{MSW3 Proboscidea | id = 11500009 | page =
> 91}}</ref> | status = VU | status_system = iucn3.1 | status_ref
> = <ref name=IUCN>{{IUCN2010|assessors=Blanc, J.|year=2008|version=2010.1|id=12392|title=Loxodonta
> africana|downloaded=04 April 2010}}</ref> | trend = unknown |
> image = African Bush Elephant.jpg | taxon = Loxodonta africana |
> synonyms = ''Loxodonta africana africana'' | binomial = ''Loxodonta
> africana'' | binomial_authority = ([[Johann Friedrich
> Blumenbach|Blumenbach]], 1797) }}
如果您将其与维基百科上看到的实际页面进行比较,您会发现缺少几个属性。例如,属性“Kingdom”在真实页面上可见,但此处未返回。还有更多类似的属性缺失。
这就像由于模板需要维基百科的服务器端命令将模板转换为实际输出一样。我了解到该 API 有一个“expandtemplates”操作,您可以发送类似于上面的代码片段,然后您将获得用户看到的返回结果。我将其用于多个模板并且它有效,但不幸的是不适用于自动分类框模板。单击此链接查看 Expandtemplates 返回的内容:
完整链接
如您所见,模板实际上并未展开。相反,它显示更多的模板,嵌套并重复多次。
所以现在我一直试图从具有自动分类框模板的页面读取这些属性。我能想到的唯一的其他方向是不使用 API 并只解析实际页面的 html。这对于某些属性来说是可行的,但其他属性则非常难以解析。
This is a question that is a bit hard to follow but I will do my best explaining it. First, let me present an example page:
http://en.wikipedia.org/wiki/African_bush_elephant
That's a wikipedia page, a specie page in particular since it has the 'taxobox' to the right. I'm trying to parse the attributes in that taxobox using PHP. There's two ways in Wikipedia to create such a taxobox: manually, or by using the special "auto taxobox" template.
I can parse the manual one. I use Wikipedia's API to return the page's content in json format, next I use some regular expressions to get those properties.
In the case of an auto taxobox, however, the content returned is like this:
> {{automatic taxobox | name = African Bush Elephant<ref
> name=MSW3>{{MSW3 Proboscidea | id = 11500009 | page =
> 91}}</ref> | status = VU | status_system = iucn3.1 | status_ref
> = <ref name=IUCN>{{IUCN2010|assessors=Blanc, J.|year=2008|version=2010.1|id=12392|title=Loxodonta
> africana|downloaded=04 April 2010}}</ref> | trend = unknown |
> image = African Bush Elephant.jpg | taxon = Loxodonta africana |
> synonyms = ''Loxodonta africana africana'' | binomial = ''Loxodonta
> africana'' | binomial_authority = ([[Johann Friedrich
> Blumenbach|Blumenbach]], 1797) }}
If you'd compare this with the actual page as you would see it on Wikipedia, you'll notice several attributes are missing. For example, the property "Kingdom" is visible on the real page but not returned here. There's more properties missing like that.
This is like due to the template needing Wikipedia's server side command to transform the template into actual output. I learned that the API has an "expandtemplates" action, which you can send a snippet like the one above, and you'll get the results returned as the user would see it. I'm using this for several templates and it works, but unfortunately not for the auto taxobox template. Click this link to see what expandtemplates returns:
complete link
As you can see, the template doesn't actually expand. Instead, it shows more templates, nested and repeated several times.
So now I'm stuck trying to read these properties from pages that have the auto taxobox template. The only other direction I can think of is to not use the API and to just parse the html of the actual page. That would be doable for some properties, but others are extremely fragile to parse.
发布评论
评论(3)
使用
action=parse
而不是action=expandtemplates
。正如您所注意到的,expandtemplates
仅扩展单个级别;此外,它不会完全预处理输入(例如,它不会成功处理模板内的某些变量引用)。Use
action=parse
instead ofaction=expandtemplates
. As you've noticed,expandtemplates
only expands a single level; additionally, it won't fully preprocess input (e.g, it won't successfully handle certain variable references inside templates).与其重新发明轮子,不如查看DBPedia,它已经从维基百科模板中提取了所有可能的内容,并以各种方式公开易于解析的格式。
Instead of reinventing the wheel, check out DBPedia, which has already extracted everything possible from Wikipedia templates and made it public in a variety of easily parsable formats.
这是一段有效的 php 模板解析代码。
目标是拥有一个如下所示的数组 ($data):
$data[page name] = array(key1=>val1, key2=>val2...);
This is a snippet of working php template parsing code.
The goal is to have an array ($data) that looks like:
$data[page name] = array(key1=>val1, key2=>val2...);