类似 wiki 标记的正则表达式转换
考虑以下标记输入:
* Line 1 * Line 2 :* Line 2.1 :* Line 2.2 * Line 3
这通常编码为:
<ul> <li>Line 1</li> <li>Line 2</li> <ul> <li>Line 2.1</li> <li>Line 2.2</li> </ul> <li>Line 3</li> </ul>
我的问题:
- 使用单行来表示相同输入的最佳表示方式是什么?
- 生成相应XHTML的正则表达式是什么?
例如,单行输入格式可以是:
> Line 1 > Line 2 >> Line 2.1 >> Line 2.2 > Line 3
>
是无序列表项分隔符。我选择 >
因为文本可能包含典型的标点符号。使用 »(或其他此类非 104 键键)会很有趣,但打字不太容易。
行输入格式也可以是:
[Line 1][Line 2 [Line 2.1][Line 2.2]][Line 3]
更新#1 - 问题稍微简单一些。巢的数量可以限制为三个。 n 层深度的通用解决方案仍然很酷。
更新 #2 - XHTML,而不是 HTML。
更新 #3 - 另一种可能的输入格式。
更新 #4 - Java 解决方案(或纯正则表达式)最受欢迎。
更新 #5
修改后的代码:
String in = " * Line 1 * Line 2 > * Line 2.1 * Line 2.2 < * Line 3";
String sub = "<ul>" + in.replace( " > ", "<ul>" ) + "</ul>";
sub = sub.replace( " < ", "</ul>" );
sub = sub.replaceAll( "( | >)\\* ([^*<>]*)", "<li>$2</li>" );
System.out.println( "Result: " + sub );
打印以下内容:
Result: <ul><li>Line 1 </li>* Line 2<ul>* Line 2.1<li>Line 2.2</li></ul>* Line 3
Consider the following mark-up input:
* Line 1 * Line 2 :* Line 2.1 :* Line 2.2 * Line 3
This is typically coded as:
<ul> <li>Line 1</li> <li>Line 2</li> <ul> <li>Line 2.1</li> <li>Line 2.2</li> </ul> <li>Line 3</li> </ul>
My questions:
- What would be a good representation for the same input using a single line?
- What is the regular expression to generate the corresponding XHTML?
For example, the single line input format could be:
> Line 1 > Line 2 >> Line 2.1 >> Line 2.2 > Line 3
With >
being unordered list item delimiter. I chose >
because the text might include typical punctuation marks. Using » (or other such non-104-key keys) would be fun, but not as easy to type.
The line input format could also be:
[Line 1][Line 2 [Line 2.1][Line 2.2]][Line 3]
Update #1 - The problem is a little simpler. The number of nests can be limited to three. A general solution for n-levels deep would still be cool.
Update #2 - XHTML, not HTML.
Update #3 - Another possible input format.
Update #4 - Java solutions (or pure regex) are most welcome.
Update #5
Revised code:
String in = " * Line 1 * Line 2 > * Line 2.1 * Line 2.2 < * Line 3";
String sub = "<ul>" + in.replace( " > ", "<ul>" ) + "</ul>";
sub = sub.replace( " < ", "</ul>" );
sub = sub.replaceAll( "( | >)\\* ([^*<>]*)", "<li>$2</li>" );
System.out.println( "Result: " + sub );
Prints the following:
Result: <ul><li>Line 1 </li>* Line 2<ul>* Line 2.1<li>Line 2.2</li></ul>* Line 3
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你的例子对我来说似乎很好。
不幸的是,纯 RegEx 无法跟踪您所处的嵌套级别,因此它不知道将 /UL 关闭标记放在哪里。
像这样的东西可能会起作用:
在这里,大于和小于在层次结构中上下移动,星号是项目符号的分隔符。每个之前和之后的空格用作一种转义序列,因此当这些字符没有被空格包围时,您仍然可以按字面使用这些字符或用于其他目的,例如斜体和粗体。
对正则表达式的尝试:
编辑:根据下面的评论进行调整以生成 XHTML,关闭 LI 标记。还修复了我的 C# 语法。
最终编辑:我认为最后一个 Replace 中的 \ * 和 \ 2 需要对 C# 进行转义,修复。另请注意,前两个 Replace() 调用可以使用 String.Replace() 而不是 RegEx,这可能会更快。
Your example seems fine to me.
Unfortunately, pure RegEx can't keep track of which nesting level you are on, so it won't know where to put the /UL close tags.
Something like this might work:
Here, the greater-than and less-than move up and down the hierarchy, and the asterisks are the delimiters for the bullets. The spaces before and after each are used as a sort of escape sequence, so you can still use those characters literally or for other purposes like italics and bold when they aren't surrounded by spaces.
A stab at the RegEx:
Edit: Adjusted to produce XHTML, closing the LI tags, based on comment below. Also fixed my C# syntax.
Final edit: I think the \ * and \ 2 in the last Replace need to be escaped for C#, fixing. Also, note that the first two Replace() calls can use String.Replace() rather than RegEx, which will likely be faster.
我不建议使用正则表达式作为解析和转换工具。正则表达式往往具有很高的开销,并且不是解析语言的最有效方法......这才是您真正要求它做的事情。你已经创建了一种语言,尽管它很简单,但你应该这样对待它。我建议为 WIKI 风格的格式化代码编写一个实际的、专用的解析器。由于您可以将解析器专门针对您的语言,因此它应该更高效。此外,您不必创建一些可怕的正则表达式来解析您的语言并处理其所有细微差别。从长远来看,您将获得更清晰的代码、更好的可维护性等好处。
我建议使用以下资源:
I would not recommend using regular expressions as a parsing and transformation tool. Regular expressions tend to have high overhead, and are not the most efficient means of parsing a language...which is what you are really asking it to do. You have created a language, simple as it is, and you should treat it as such. I recommend writing an actual, dedicated parser for your WIKI-style formatting code. Since you can target the parser specifically to your language, it should be more efficient. In addition, you won't have to create some frightening monstrosity that is a regex to parse your language and handle all of its nuances. In the long run, you gain the benefits of clearer code, better maintainability, etc.
I suggest the following resources:
解决方案
一个可行的解决方案如下:
这将创建所需的 XHTML 片段:
Solution
A working solution follows:
This creates the desired XHTML fragment: