解析 wikimedia 标记 - 基于 EBNF 的解析器是否不太适合?
我正在尝试解析(在Java中)维基百科上找到的维基媒体标记。 有许多现有的软件包可以完成此任务,但我还没有找到任何可以特别满足我的需求的软件包。 我使用过的最好的软件包是 Mathclipse Bliki 解析器,它在大多数情况下都做得不错页。
然而,该解析器并不完整,无法解析某些页面或无法正确解析其他页面。 遗憾的是,代码相当混乱,因此解决这个解析引擎中的问题非常耗时且容易出错。
为了寻找更好的解析引擎,我研究了使用基于 EBNF 的解析器来完成此任务(特别是 ANTLR)。 然而,经过一番尝试后,这种方法似乎不太适合这项任务,因为维基媒体标记相对宽松,因此无法轻松适应结构化语法。
然而,我对 ANTLR 和类似解析器的经验非常有限,因此可能是我的经验不足导致了问题,而不是此类解析器本质上不适合此任务。 在这些主题上有更多经验的人可以在这里发表意见吗?
@Stobor:我已经提到我已经研究过各种解析引擎,包括谷歌查询返回的引擎。 迄今为止我发现的最好的是Bliki 引擎。 问题在于,使用此类解析器解决问题变得非常乏味,因为它们本质上都是条件语句和正则表达式的长链,导致意大利面条式代码。 我正在寻找更类似于 EBNF 解析方法的方法,因为该方法更清晰、更简洁,因此更容易理解和发展。 我已经看到了您发布的 mediawiki 链接,它似乎证实了我的怀疑,即开箱即用的 EBNF 不太适合这项任务。 因此,我正在寻找一个像 EBNF 一样清晰易懂的解析引擎,但也能够处理 wiki 标记的混乱语法。
I am attempting to parse (in Java) Wikimedia markup as found on Wikipedia. There are a number of existing packages out there for this task, but I have not found any to fit my needs particularly well. The best package I have worked with is the Mathclipse Bliki parser, which does a decent job on most pages.
This parser is incomplete, however, and fails to parse certain pages or parses incorrectly on others. Sadly the code is rather messy and thus fixing the problems in this parsing engine is very time consuming and error prone.
In attempting to find a better parsing engine I have investigated using an EBNF-based parser for this task (specifically ANTLR). After some attempts however it seems that this approach isn't particularly well suited for this task, as the Wikimedia markup is relatively relaxed and thus cannot be easily fit into a structured grammar.
My experience with ANTLR and similar parsers is very limited however, so it may be my inexperience that is causing problems rather than such parsers being inherently poorly suited for this task. Can anyone with more experience on these topics weigh in here?
@Stobor: I've mentioned that I've looked at various parsing engines, including the ones returned by the google query. The best I've found so far is the Bliki engine. The problem is that fixing problems with such parsers becomes incredibly tedious, because they are all essentially long chains of conditionals and regular expressions, resulting in spaghetti code. I am looking for something more akin to the EBNF method of parsing, as that method is much clearer and more concise, and thus easier to understand and evolve. I've seen the mediawiki link you posted, and it seems to confirm my suspicions that EBNF out of the box is poorly suited for this task. Thus I am looking for a parsing engine that is clear and understandable like EBNF, but also capable of handling the messy syntax of wiki markup.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果不使用 mediawiki 本身,解析任何一般意义上的 mediawiki 内容几乎是不可能的。 为了解析它,您需要能够完全解析 HTML 和 CSS(因为它们可以嵌入),并处理完整的模板实例化和扩展,以及相关内容可能使用的任何解析器添加。 该模板实例化相当于一个预处理器。
它在某种意义上类似于解析 C++,只不过解析器还处理格式错误的输入和由解析器扩展进行的任意语法添加。 实际的 mediawiki 实现很像 Perl 5,最初的实现并没有那么糟糕,因为所有的边缘情况都会出现,但是事情是链接在一起的,但实际上让任何后续实现来做同样的事情确实很复杂,特别是因为行为通常是突发的、没有记录的,而不是经过设计的。
如果您不需要 100% 的页面才能工作或能够提取所有内容,您也许可以拼凑出一些适合您的内容,正如您所注意到的,有一些软件包可以做到这一点。 在不了解您实际的精确需求的情况下,我怀疑任何人都可以就如何解析它为您提供更好的答案。 如果您需要能够处理每个页面并正确解析所有内容,那么您最好拥有一个相当大的团队和几年的工作时间,即使如此,您仍然有很多小的边缘情况。
简而言之,没有一个 EBNF 语法不太适合解析 mediawiki 标记,但实际上没有什么是......
Parsing mediawiki content in any generic sense is pretty much impossible short using mediawiki itself. In order to parse it you need to be able fully parse HTML and CSS (since they can be embedded), and handle full template instantiation and expansion, as well as any parser addition the relevent content may have been using. That template instantiation is equivalent to a preprocessor.
It is in some senses similiar to parsing C++ except the parser also handle malformed input and arbitrary syntax additions made by parser extensions. The actual mediawiki implementation is a lot like Perl 5, the original implementation was not so bad because all the edge cases just fall out however things are linked together, but actually getting any subsequent implementation to do the same thing is really complicated, especially since the behaviors are often emergent and undocumented, not designed.
If you do not need 100% of pages to work or to be able to extract all content you might be able to cobble something together that works for you, and as you have noted there are some packages that do that. Short of knowing your actual precise needs I doubt anyone can give you a substantially better answer on how to parse it. If you need to be able to work on every page and correctly parse everything you better have a fairly large team and several years to work, and even then you still have lots of small edge cases.
So in short, no an EBNF grammer is not well suited to parsing mediawiki markup, but nothing really is...
你是对的,维基媒体不适合 EBNF 明确定义的语法。
您将不得不查看能够回溯的工具,以便能够解析 Wiki
btyacc,它是一个回溯 yacc。
http://www.siber.com/btyacc/
你可以看看Accent。 比 Yacc 更好
http://accent.compilertools.net/
或者你可能需要分解并学习一些 prolog 的风格和滚动你自己的。 无论你做什么,你都会经历一段有趣的学习时期。
祝你好运
You are correct Wikimedia does not lend itself to EBNF well defined grammers.
You will have to look at tools that will backtrack to be able to parse Wiki
btyacc which is a backtracking yacc.
http://www.siber.com/btyacc/
You could look at Accent. Better than Yacc
http://accent.compilertools.net/
Or you may have to breakdown and learn some flavour of prolog and roll you own. Whatever you do you have a interesting learning period ahead of you.
Good luck
我曾经尝试为 Boost.Quickbook,本质上与维基百科使用的维基文本相同。
只是为了让一些基础知识发挥作用,这是一个非常乏味的过程,但我认为最终可以为其编写 EBNF 语法。 如果您有兴趣,可以在线获取我的部分解析器 (语法嵌入在文档字符串中)。
I once tried to write a parser for Boost.Quickbook, which is essentially the same as the wiki-text used by Wikipedia.
It was a very tedious process just to get some basics working, but I think it would eventually be possible to write EBNF grammar for it. If you're interested, my partial parser is available online (the grammar is embedded in doc-strings).
这个答案有点离谱,但是如何渲染文本然后解析 HTML Dom 来找出不同的 wiki 组件呢?
This answer is a little out there, but what about rendering the text and then parsing the HTML Dom in order to figure out different wiki components.