在 PHP 中解析 Wikipedia 标记的最佳方法是什么?
我正在尝试以结构化的方式解析特定的维基百科内容。这是一个示例页面:
http://en.wikipedia.org/wiki/Polar_bear
我是取得了一些成功。我可以检测到该页面是一个“specie”页面,并且我还可以将 Taxobox(右侧)信息解析为结构。到目前为止,一切都很好。
不过,我也在尝试解析文本段落。这些由 API 以 Wiki 格式或 HTML 格式返回,我目前正在使用 Wiki 格式。
我可以阅读这些段落,但我想以特定的方式“清理”它们,因为最终我将不得不在我的应用程序中显示它,并且它没有 Wiki 标记的意义。例如,我想删除所有图像。通过过滤掉 [[Image:]] 块,这相当容易。然而,也有一些我根本无法删除的块,例如:
{{convert|350|-|680|kg|abbr=on}}
删除整个块会破坏句子。像这样具有特殊含义的符号有几十种。我想避免编写 100 个正则表达式来处理所有这些,并看看如何以更智能的方式解析它。
我的困境如下:
- 我可以继续我当前的半结构化解析路径 删除不需要的元素以及“模仿”有很多工作 确实需要渲染的模板。
- 或者,我可以从渲染的 HTML 输出开始并解析它,但我担心以结构化方式解析它同样脆弱和复杂理想
情况下,有一个库可以解决这个问题,但我还没有找到一个这取决于这份工作。我还查看了结构化维基百科数据库,例如 DBPedia,但它们仅具有与我已有的结构相同的结构,它们在维基文本本身中不提供任何结构。
I'm trying to parse specific Wikipedia content in a structured way. Here's an example page:
http://en.wikipedia.org/wiki/Polar_bear
I'm having some success. I can detect that this page is a "specie" page, and I can also parse the Taxobox (on the right) information into a structure. So far so good.
However, I'm also trying to parse the text paragraphs. These are returned by the API in Wiki format or HTML format, I'm currently working with the Wiki format.
I can read these paragraphs, but I'd like to "clean" them in a specific way, because ultimately I will have to display it in my app and it has no sense of Wiki markup. For example, I'd like to remove all images. That's fairly easy by filtering out [[Image:]] blocks. Yet there are also blocks that I simply cannot remove, such as:
{{convert|350|-|680|kg|abbr=on}}
Removing this entire block would break the sentence. And there are dozens of notations like this that have special meaning. I'd like to avoid writing a 100 regular expressions to process all of this and see how I can parse this in a smarter way.
My dilemma is as follow:
- I could continue my current path of semi-structured parsing where I'd
have a lot of work deleting unwanted elements as well as "mimicing"
templates that do need to be rendered. - Or, I could start with the rendered HTML output and parse that, but my worry is that it's just as fragile and complex to parse in a structured way
Ideally, there's be a library to solve this problem, but I haven't found one yet that is up to this job. I also had a look at structured Wikipedia databases like DBPedia but those only have the same structured I already have, they don't provide any structure in the Wiki text itself.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用的模板太多,无法手动重新实现所有模板,而且它们一直在变化。因此,您将需要能够处理所有模板的 wiki 语法的实际解析器。
而且 wiki syxtax 非常复杂,有很多怪癖并且没有正式的规范。这意味着创建您自己的解析器将是一项繁重的工作,您应该使用 MediaWiki 中的解析器。
因此,我认为通过 MediaWiki API 获取解析的 HTML 是最好的选择。
可能更容易从 wiki 标记解析的一件事是信息框,所以也许它们应该是一种特殊情况。
There are too many templates in use to reimplement all of them by hand and they change all the time. So, you will need actual parser of the wiki syntax that can process all the templates.
And the wiki syxtax is quite complex, has lots of quirks and no formal specification. This means creating your own parser would be too much work, you should use the one in MediaWiki.
Because of this, I think getting the parsed HTML through the MediaWiki API is your best bet.
One thing that's probably easier to be parsed from wiki markup are the infoboxes, so maybe they should be a special case.