BBCode 到 HTML 转换规则
背景
我使用 C#
编写了非常简单的 BBCode 解析器,它将 BBCode 转换为 HTML。目前它仅支持 [b]
、[i]
和 [u]
标签。我知道无论用户输入什么内容,BBCode 始终被认为是有效的。我找不到如何将 BBCode 转换为 HTML 的严格规范
问题
- 是否存在标准的“BBCode 到 HTML”规范?
- 我应该如何处理
“[b][b][/b][/b]”
?目前解析器生成“[b][/b]”
。 - 我应该如何处理
“[b][i][u]zzz[/b][/i][/u]”
输入?目前我的解析器足够智能,可以在这种情况下生成"zzz"
输出,但是我想知道这是“太聪明”的做法,还是不是?
更多细节
我发现了一些现成的 BBCode 解析器实现,但它们对我来说太重/复杂,更糟糕的是,使用了大量的正则表达式,并且生成的标记不是我期望的。理想情况下,我希望在输出处接收 XHTML。为了推断“BBCode 到 HTML”转换规则,我使用这个在线解析器:http://www.bbcode。 org/playground.php。它生成的 HTML 在我看来直观上是正确的。我唯一不喜欢的是它不生成 XHTML。例如,"[b][i]zzz[/b][/i]"
转换为 "zzz(注意结束标签顺序)。 FireBug 当然将其显示为
"zzz"
。据我了解,浏览器修复了此类错误的结束标签顺序情况,但我有疑问:
- 我是否应该依赖此浏览器功能而不尝试制作 XHTML。
- 也许
"[b][i]zzz[/b]ccc[/i]"
必须理解为"[i]zzzccc[/i ]"
- 从逻辑上看是否存在此类不正确的格式,但与流行论坛 BBCode 输出相冲突(*zzz****ccc*,而不是 **[i]zzzccc[/i])
谢谢。
Background
I have written very simple BBCode parser using C#
which transforms BBCode to HTML. Currently it supports only [b]
, [i]
and [u]
tags. I know that BBCode is always considered as valid regardless whatever user have typed. I cannot find strict specification how to transform BBCode to HTML
Question
- Does standard "BBCode to HTML" specification exist?
- How should I handle
"[b][b][/b][/b]"
? For now parser yields"<b>[b][/b]</b>"
. - How should I handle
"[b][i][u]zzz[/b][/i][/u]"
input? Currently my parser is smart enough to produce"<b><i><u>zzz</u></i></b>"
output for such case, but I wonder that it is "too smart" approach, or it is not?
More details
I have found some ready-to-use BBCode parser implementations, but they are too heavy/complex for me and, what is worse, use tons of Regular Expressions and produce not that markup what I expect. Ideally, I want to receive XHTML at the output. For inferring "BBCode to HTML" transformation rules I am using this online parser: http://www.bbcode.org/playground.php. It produces HTML that is intuitively correct on my opinion. The only thing I dislike it does not produce XHTML. For example "[b][i]zzz[/b][/i]"
is transformed to "<b><i>zzz</b></i>"
(note closing tags order). FireBug of course shows this as "<b><i>zzz</i></b><i></i>"
. As I understand, browsers fix such wrong closing tags order cases, but I am in doubt:
- Should I rely on this browsers feature and do not try to make XHTML.
- Maybe
"[b][i]zzz[/b]ccc[/i]"
must be understood as"<b>[i]zzz</b>ccc[/i]"
- looks logically for such improper formatting, but is in conflict with popular forums BBCode outputs (*zzz****ccc*, not **[i]zzzccc[/i])
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
关于你的第一个问题,我认为无论你的项目范围有多大,依靠浏览器来纠正任何类型的错误都不是一个好主意(好吧,也许除了当你实际进行错误测试时)在浏览器本身上)。有些浏览器可能在这方面做得很好,而另一些浏览器可能会失败。确保输出语法正确(或至少尽可能正确)的最佳方法是首先将其以正确的语法发送到浏览器。
关于你的第二个问题,由于你试图将正确的 BBCode 转换为正确的 HTML,如果你的输入是
[b][i]zzz[/b]ccc[/i]
,它的 正确的 HTML 等效项将是zzzccc
而不是[i]zzzccc[/i]
。这就是事情变得复杂的地方,因为你不再只编写一个转换器,还要编写一个语法检查器/纠正器。我已经用 PHP 为一种相当奇怪的游戏引擎脚本语言编写了一个类似的脚本,但逻辑可以很容易地应用于您的案例。基本上,我为每个开始标签设置了一个标志,并检查结束标签是否位于正确的位置。当然,这提供了有限的功能,但对于我所需要的,它做到了。如果您需要更高级的搜索模式,我认为您只能使用正则表达式。On your first question, I don't think that relying on browsers to correct any kind of mistakes is a good idea regardless the scope of your project (well, maybe except when you're actually doing bug tests on the browser itself). Some browsers might do an awesome job on that while others might fail miserably. The best way to make sure the output syntax is correct (or at least as correct as possible) is to send it with a correct syntax to the browser in the first place.
Regarding your second question, since you're trying to have correct BBCode converted to correct HTML, if your input is
[b][i]zzz[/b]ccc[/i]
, its correct HTML equivalent would be<i><b>zzz</b>ccc</i>
and not<b>[i]zzz</b>ccc[/i]
. And this is where things get complicated as you would not be writing just a converter anymore, but also a syntax checker/correcter. I have written a similar script in PHP for a rather weird game engine scripting language but the logic could be easily applied to your case. Basically, I had a flag set for each opening tag and checked if the closing tag was in the right position. Of course, this gives limited functionality but for what I needed it did the trick. If you need more advanced search patterns, I think you're stuck with regex.如果您只想实现 B、I 和 U(这不是非常重要的标签),为什么不简单地为每个标签设置一个计数器:每次打开时 +1,每次关闭时 -1。
在论坛帖子(或其他内容)结束时,如果仍有打开的标签,只需关闭它们即可。如果用户输入无效的 bbcode,在他们的帖子期间可能看起来很奇怪,但这不会是灾难性的。
If you're only going to implement B, I and U, which aren't terribly important tags, why not simply have a counter for each of those tags: +1 each time it is opened, and -1 each time it's closed.
At the end of a forum post (or whatever) if there are still-open tags, simply close them. If the user puts in invalid bbcode, it may look strange for the duration of their post, but it won't be disastrous.
关于无效的用户提交的标记,您至少有三个选择:
我不推荐 3。它很快就会变得非常棘手。 1和2都是合理的选择。
至于如何解析BBCode,我强烈建议不要使用正则表达式。 BBCode 实际上是一种相当复杂的语言。最重要的是,它支持标签嵌套。正则表达式无法处理任意嵌套。这是正则表达式的基本限制之一。这使得它成为解析 HTML 和 BBCode 等语言的糟糕选择。
对于我自己的项目 rbbcode,我使用解析表达式语法 (PEG)。我建议使用类似的东西。一般来说,这些类型的工具称为“编译器编译器”、“编译器生成器”或“解析器生成器”。使用其中一种可能是最明智的方法,因为它允许您以干净、可读的格式指定 BBCode 的语法。与使用正则表达式或尝试构建自己的状态机相比,这种方式的错误会更少。
Regarding invalid user-submitted markup, you have at least three options:
I don't recommend 3. It gets really tricky really fast. 1 and 2 are both reasonable options.
As for how to parse BBCode, I strongly recommend against using regex. BBCode is actually a fairly complex language. Most significantly, it supports nesting of tags. Regex can't handle arbitrary nesting. That's one of the fundamental limitations of regex. That makes it a bad choice for parsing languages like HTML and BBCode.
For my own project, rbbcode, I use a parsing expression grammer (PEG). I recommend using something similar. In general, these types of tools are called "compiler compilers," "compiler generators," or "parser generators." Using one of these is probably the sanest approach, as it allows you to specify the grammar of BBCode in a clean, readable format. You'll have fewer bugs this way than if you use regex or attempt to build your own state machine.