有没有一种聪明的方法可以将纯文本列表解析为 HTML?
问题:是否有一种聪明的方法将纯文本列表解析为 HTML?
或者,我们必须诉诸深奥的递归方法,还是纯粹的蛮力?
我想知道这个问题已经有一段时间了。 在我自己的思考中,我一次又一次地回到蛮力和奇怪的递归方法......但它总是显得如此笨重。 一定有更好的方法,对吗?
那么有什么巧妙的方法呢?
假设
有必要设置一个场景,所以这些是我的假设。
列表可以嵌套 3 层深度(至少),无论是无序列表还是有序列表。 列表类型和深度由其前缀控制:
- 前缀后必须有一个空格。
- 列表深度由前缀中非空格字符的数量控制;
*****
将嵌套五个列表深。 - 列表类型由字符类型强制执行,
*
或-
为无序列表,#
为无序列表。
项目仅由 1 个
\n
字符分隔。 (让我们假设两个连续的换行符可以作为一个“组”、一个段落、div 或其他一些 HTML 标记,如 Markdown 或 Textile 中的标记。)列表类型可以自由混合。
输出应为有效的 HTML 4,最好以
s 结尾
可以根据需要使用或不使用正则表达式来完成解析。
示例标记
* List
*# List
** List
**# List
** List
# List
#* List
## List
##* List
## List
所需的输出
为了便于阅读而进行了一些分解,但它应该是此的有效变体(请记住,我只是很好地间隔了它!):
<ul>
<li>List</li>
<li>
<ol><li>list</li></ol>
<ul><li>List</li></ul>
</li>
<li>List</li>
<li>
<ol><li>List</li></ol>
</li>
<li>List</li>
</ul>
<ol>
<li>List</li>
<li>
<ul><li>list</li></ul>
<ol><li>List</li></ol>
</li>
<li>List</li>
<li>
<ul><li>List</li></ul>
</li>
<li>List</li>
</ol>
总而言之
,您如何做到这一点? 我真的很想了解处理不可预测的递归列表的好方法,因为在我看来,它对任何人来说都是一团丑陋的混乱。
Question: Is there a clever way to parse plain-text lists into HTML?
Or, must we resort to esoteric recursive methods, or sheer brute force?
I've been wondering this for a while now. In my own ruminations I have come back again and again to the brute-force, and odd recursive, methods ... but it always seems so clunky. There must be a better way, right?
So what's the clever way?
Assumptions
It is necessary to set up a scenario, so these are my assumptions.
Lists may be nested 3 levels deep (at a minimum), of either unordered or ordered lists. The list type and depth is controlled by its prefix:
- There is a mandatory space following the prefix.
- List depth is controlled by how many non-spaced characters there are in the prefix;
*****
would be nested five lists deep. - List type is enforced by character type,
*
or-
being an unordered list,#
being a disordered list.
Items are separated by only 1
\n
character. (Lets pretend two consecutive new-lines qualify as a "group", a paragraph, div, or some other HTML tag like in Markdown or Textile.)List types may be freely mixed.
Output should be valid HTML 4, preferably with ending
</li>
sParsing can be done with, or without, Regex as desired.
Sample Markup
* List
*# List
** List
**# List
** List
# List
#* List
## List
##* List
## List
Desired Output
Broken up a bit for readability, but it should be a valid variation of this (remember, that I'm just spacing it nicely!):
<ul>
<li>List</li>
<li>
<ol><li>list</li></ol>
<ul><li>List</li></ul>
</li>
<li>List</li>
<li>
<ol><li>List</li></ol>
</li>
<li>List</li>
</ul>
<ol>
<li>List</li>
<li>
<ul><li>list</li></ul>
<ol><li>List</li></ol>
</li>
<li>List</li>
<li>
<ul><li>List</li></ul>
</li>
<li>List</li>
</ol>
In Summary
Just how do you do this? I'd really like to understand the good ways to handle unpredictably recursing lists, because it strikes me as an ugly mess for anyone to tangle with.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
基本迭代技术:
并插入适当的开始/结束标记(
、
< ;ul>
) 并在当前缩进级别大于或小于前一个缩进级别时递增/递减缩进计数器。编辑:这是一个简单的表达式,经过一些调整后可能会为您工作:每个匹配都是一个顶级列表,具有两组命名捕获、标记(字符计数是缩进级别) ,最后一个字符表示所需的列表类型)和列表项文本。
Basic iterative technique:
<li>
s and inserting appropriate begin / end tags (<ol></ol>
,<ul></ul>
) and incrementing / decrementing the indentation counter whenever the current indentation level is greater or less than the previous one.Edit: Here's a simple expression that'll probably work for you with a bit of tweaking: each match is a top-level list, with two sets of named captures, the markers (char count is indentation level, last char indicates desired list type) and the list item text.
具有一些Pythonic概念的逐行解决方案:
它是这样工作的:为了处理该行,我将前一行的标记与该行的标记进行比较。
我使用一个虚构的函数
split_line_into_marker_and_remainder
,它返回两个结果:标记cur
和文本本身。 将其实现为具有 3 个参数、一个输入和 2 个输出字符串的 C++ 函数非常简单。其核心是一个虚构的函数
kill_common_beginning
,它将删除prev
和cur
的重复部分。 之后,我需要关闭先前标记中保留的所有内容并打开当前标记中保留的所有内容。 我可以通过替换、将字符映射到字符串或通过循环来完成。这三行在 C++ 中非常简单:
但是请注意,有一种特殊情况:当缩进没有改变时,这些行不会输出任何内容。 如果我们在列表之外,那没问题,但在列表中就不行了:所以在这种情况下,我们应该手动输出
。
The line-by-line solution with some pythonic concepts:
This is how it works: to process the line, I compare the marker for previous line with the marker for this line.
I use a fictional function
split_line_into_marker_and_remainder
, which returns two results, markercur
and the text itself. It's trivial to implement it as a C++ function with 3 arguments, an input and 2 output strings.At the core is a fictional function
kill_common_beginning
which would take away the repeat part ofprev
andcur
. After that, I need to close everything that remains in previous marker and open everything that remains in current marker. I can do it with a replace, by mapping characters to string, or by a loop.The three lines wil be pretty straightforward in C++:
Note, however, that there is a special case: when the indentation didn't change, those lines don't output anything. That's fine if we're outside of the list, but that's not fine in the list: so in that case we should output the
</li><li>
manually.我见过的最好的解释来自 Mark Jason Dominus 的 High-Order Perl。 全文可在线获取:http://hop.perl.plover.com/book/。
尽管这些示例都是用 Perl 编写的,但每个区域背后的逻辑分解都非常棒。
第 8 章(!PDF 链接)专门介绍解析。 尽管本书中的课程有些相关。
Best explanation I've seen is from Higher-Order Perl by Mark Jason Dominus. The full text is available online at http://hop.perl.plover.com/book/.
Though the examples are all in Perl, the breakdown of the logic behind each area is fantastic.
Chapter 8 (! PDF link) is specifically about parsing. Though the lessons through out the book are somewhat related.
查看纺织。
它有多种语言版本。
Look at Textile.
It is available in a number of languages.
这是如何使用正则表达式和循环来做到这一点(
^
代表换行符,$
代表结束行):这使得它比简单的简单得多正则表达式。 它的工作方式是:首先展开每一行,就好像它是孤立的一样,然后吃掉额外的列表标记。
This how you can do it with regexp and cycle (
^
stands for newline,$
for endline):This makes it much simpler than a simple regexp. The way it works: you first expand each line as if it was isolated, but then eat extra list markers.
这是我自己的解决方案,它似乎是 Shog9 的建议(他的正则表达式的变体,Ruby 不支持命名匹配)和 Ilya 的迭代方法的混合体。 我的工作语言是 Ruby。
一些值得注意的事情:我使用了基于堆栈的系统,并且“String#scan(pattern)”实际上只是一个返回匹配数组的“全部匹配”方法。
值得庆幸的是,这段代码确实有效并生成了有效的 HTML。 结果确实比我预期的要好。 它甚至不觉得笨重。
Here is my own solution, which seems to be a hybrid of Shog9's suggestions (a variation on his regex, Ruby doesn't support named matches) and Ilya's iterative method. My working language was Ruby.
Some things of note: I used a stack-based system, and that "String#scan(pattern)" is really just a "match-all" method that returns an array of matches.
Thankfully this code does work and generate valid HTML. And this did turn out better than I had anticipated. It doesn't even feel clunky.
这个 Perl 程序是对此的第一次尝试。
希望有帮助
This Perl program is a first attempt at that.
Hope it helps
尝试明胶。 语法定义可能不超过 5 行。
Try Gelatin. The syntax definition would probably be 5 lines or less.