在哪里可以找到一个好的 PHP MediaWiki 标记解析器?
我会尝试稍微破解一下 MediaWiki 的代码,但我发现如果我能得到一个独立的解析器,那就没有必要了。
谁能帮我这个?
谢谢。
I would try hacking MediaWiki's code a little, but I figured out it would be unnecessary if I can get an independent parser.
Can anyone help me with this?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
本·休斯是对的。 要做到正确是非常困难的,特别是如果您想以 100% 的准确度解析来自大型 wiki(例如维基百科本身)的真实文章。 它在 wikitech 邮件列表中被频繁讨论,尽管进行了多次尝试,但还没有替代解析器能够提供这种产品。
首先,它并不是真正的解析器,因为它没有 AST(抽象语法树)这样的概念。 它是一个专门转换为HTML的转换器。
其次,不要陷入将 wikitext 视为一种标记语言的陷阱,在极少数情况下可以使用 HTML 进行扩展。 您必须将其视为 HTML 的扩展。 向 HTML 解析器添加 wiki 文本支持比向 wiki 文本解析器添加 HTML 支持容易得多。
归根结底,如果您想要任何其他格式,则需要从 HTML 转换为该格式。
基本上据说只有 MediaWiki 可以解析 wiki 文本。 但是,解析器与其余代码紧密集成。 经验丰富的 MediaWiki 黑客对有关隔离解析器的问题反应不佳 - 我已经尝试过(-:
但无论如何我也已经隔离了它。它还没有完成或准备好与任何人共享。但基本上你想开始未安装 MediaWiki 源或未连接到数据库或 Web 服务器,创建一个包含解析器的 PHP 存根程序,并在运行失败时检查错误,并为类、函数或 Web 服务器创建一个虚假存根。重复此操作,直到您已对解析器与 MediaWiki 其余部分交互的大部分位置进行了存根,
然后问题就出现了,因为源代码树变化很快,并且实时 wiki 会接受这些变化。解析器非常快,如果要在未来工作,您的变体必须跟上。
查看我的功能请求:错误 25984 - 将解析器与数据库依赖项隔离
Ben Hughes is right. It's very difficult to get right, especially if you want to parse real articles from big wikis like Wikipedia itself with 100% accuracy. It is discussed frequently in the wikitech mailing list and no alternative parser has come up with the goods despite many attempts.
Firstly it's not really a parser in that it has no such concept as an AST (abstract syntax tree). It's a converter that specifically converts to HTML.
Secondly don't fall into the trap of thinking of wikitext as a markup language which can be extended on rare occasions with HTML. You must think of it as an extension to HTML. It is much easier to add wikitext support to an HTML parser than to add HTML support to a wikitext parser.
What this boils down to is that if you want any other format you will need to convert from HTML to that format.
Basically it is stated that only MediaWiki can parse wikitext. But yes the parser is tightly integrated with the rest of the code. Experienced MediaWiki hackers do not react well to questions about isolating the parser - I've tried (-:
But I've also gone ahead and isolated it anyway. It's not complete or ready to share with anybody yet. But basically you want to start with the MediaWiki source not installed or connected to a database or web server. Make a PHP stub program that includes the parser and call an entry point. Check the error when it fails to run and make a phony stub for the class, function, or global that was accessed. Repeat until you have stubbed most of the places the parser interacts with the rest of MediaWiki.
The problem then comes in keeping your hacked stubbed variant in synch because the source tree changes quickly and the live wikis embrace the changes in the parser very quickly and your variant will have to keep up if it is to work into the future.
Check out my feature request: Bug 25984 - Isolate parser from database dependencies
它实际上是一种非常难以解析的格式。 您可以尝试将解析器组件从媒体 wiki 中分离出来(因为它也是 php),但它是一团混乱。 我见过一些部分独立的,它们对于非常有限的标记子集做了几乎合理的工作。
如果您碰巧实现了一个,或者重构了当前的维基百科,请告诉我,因为它可能非常有用。
It's actually an incredibly difficult format to parse. You can try to separate out the parser component from media wiki (as it is also php), but it is a tangled mess. I've seen a few partial standalone ones that do a nearly reasonable job for a very limited subset of the markup.
If you happen to implement one, or refactor the current wikipedia one let me know as it could be quite useful.