当前位置：文江博客话题详情

有没有一种聪明的方法可以将纯文本列表解析为 HTML？

发布于 2024-07-24 14:08:02 字数 1703 浏览 6 评论 0原文

问题：是否有一种聪明的方法将纯文本列表解析为 HTML？

或者，我们必须诉诸深奥的递归方法，还是纯粹的蛮力？

我想知道这个问题已经有一段时间了。在我自己的思考中，我一次又一次地回到蛮力和奇怪的递归方法......但它总是显得如此笨重。一定有更好的方法，对吗？

那么有什么巧妙的方法呢？

假设

有必要设置一个场景，所以这些是我的假设。

列表可以嵌套 3 层深度（至少），无论是无序列表还是有序列表。列表类型和深度由其前缀控制：
1. 前缀后必须有一个空格。
2. 列表深度由前缀中非空格字符的数量控制； ***** 将嵌套五个列表深。
3. 列表类型由字符类型强制执行，* 或 - 为无序列表，# 为无序列表。
项目仅由 1 个 \n 字符分隔。（让我们假设两个连续的换行符可以作为一个“组”、一个段落、div 或其他一些 HTML 标记，如 Markdown 或 Textile 中的标记。）
列表类型可以自由混合。
输出应为有效的 HTML 4，最好以 s 结尾
可以根据需要使用或不使用正则表达式来完成解析。

示例标记

* List
*# List
** List
**# List
** List

# List
#* List
## List
##* List
## List

所需的输出

为了便于阅读而进行了一些分解，但它应该是此的有效变体（请记住，我只是很好地间隔了它！）：

<ul>
  <li>List</li>
  <li>
    <ol><li>list</li></ol>
    <ul><li>List</li></ul>
  </li>
  <li>List</li>
  <li>
    <ol><li>List</li></ol>
  </li>
  <li>List</li>
</ul>


<ol>
  <li>List</li>
  <li>
    <ul><li>list</li></ul>
    <ol><li>List</li></ol>
  </li>
  <li>List</li>
  <li>
    <ul><li>List</li></ul>
  </li>
  <li>List</li>
</ol>

总而言之

，您如何做到这一点？我真的很想了解处理不可预测的递归列表的好方法，因为在我看来，它对任何人来说都是一团丑陋的混乱。

原文

Question: Is there a clever way to parse plain-text lists into HTML?

Or, must we resort to esoteric recursive methods, or sheer brute force?

I've been wondering this for a while now. In my own ruminations I have come back again and again to the brute-force, and odd recursive, methods ... but it always seems so clunky. There must be a better way, right?

So what's the clever way?

Assumptions

It is necessary to set up a scenario, so these are my assumptions.

Lists may be nested 3 levels deep (at a minimum), of either unordered or ordered lists. The list type and depth is controlled by its prefix:
1. There is a mandatory space following the prefix.
2. List depth is controlled by how many non-spaced characters there are in the prefix; ***** would be nested five lists deep.
3. List type is enforced by character type, * or - being an unordered list, # being a disordered list.
Items are separated by only 1 \n character. (Lets pretend two consecutive new-lines qualify as a "group", a paragraph, div, or some other HTML tag like in Markdown or Textile.)
List types may be freely mixed.
Output should be valid HTML 4, preferably with ending </li>s
Parsing can be done with, or without, Regex as desired.

Sample Markup

* List
*# List
** List
**# List
** List

# List
#* List
## List
##* List
## List

Desired Output

Broken up a bit for readability, but it should be a valid variation of this (remember, that I'm just spacing it nicely!):

<ul>
  <li>List</li>
  <li>
    <ol><li>list</li></ol>
    <ul><li>List</li></ul>
  </li>
  <li>List</li>
  <li>
    <ol><li>List</li></ol>
  </li>
  <li>List</li>
</ul>


<ol>
  <li>List</li>
  <li>
    <ul><li>list</li></ul>
    <ol><li>List</li></ol>
  </li>
  <li>List</li>
  <li>
    <ul><li>List</li></ul>
  </li>
  <li>List</li>
</ol>

In Summary

Just how do you do this? I'd really like to understand the good ways to handle unpredictably recursing lists, because it strikes me as an ugly mess for anyone to tangle with.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

此生挚爱伱 2024-07-31 14:08:02

基本迭代技术：

正则表达式或其他一些简单的解析器将识别列表的格式，捕获每个列表项（包括具有附加缩进级别的列表项）。
用于跟踪当前缩进级别的计数器。
循环访问每个捕获的逻辑，写出
并插入适当的开始/结束标记（
、< ;ul>) 并在当前缩进级别大于或小于前一个缩进级别时递增/递减缩进计数器。

编辑：这是一个简单的表达式，经过一些调整后可能会为您工作：每个匹配都是一个顶级列表，具有两组命名捕获、标记（字符计数是缩进级别），最后一个字符表示所需的列表类型）和列表项文本。

(?:(?:^|\n)[\t ]*(?<marker>[*#]+)[\t ]*(?<text>[^\n\r]+)\r*(?=\n|$))+

Basic iterative technique:

A regex or some other simple parser that'll recognize the format for a list, capturing each list item (including those with additional levels of indentation).
A counter to keep track of the current indentation level.
Logic to iterate through each capture, writing out <li>s and inserting appropriate begin / end tags (<ol></ol>, <ul></ul>) and incrementing / decrementing the indentation counter whenever the current indentation level is greater or less than the previous one.

Edit: Here's a simple expression that'll probably work for you with a bit of tweaking: each match is a top-level list, with two sets of named captures, the markers (char count is indentation level, last char indicates desired list type) and the list item text.

(?:(?:^|\n)[\t ]*(?<marker>[*#]+)[\t ]*(?<text>[^\n\r]+)\r*(?=\n|$))+

回复收藏 0 原文

寒尘 2024-07-31 14:08:02

具有一些Pythonic概念的逐行解决方案：

cur = ''
for line in lines():
    prev = cur
    cur, text = split_line_into_marker_and_remainder(line)
    if cur && (cur == prev) :
         print '</li><li>'
    else :
         nprev, ncur = kill_common_beginning(prev, cur)
         for c in nprev: print '</li>' + ((c == '#') ? '</ol>' : '</ul>') 
         for c in ncur:  print           ((c == '#') ? '<ol>'  : '<ul>' )  + '<li>'
    print text

它是这样工作的：为了处理该行，我将前一行的标记与该行的标记进行比较。

我使用一个虚构的函数 split_line_into_marker_and_remainder，它返回两个结果：标记 cur 和文本本身。将其实现为具有 3 个参数、一个输入和 2 个输出字符串的 C++ 函数非常简单。

其核心是一个虚构的函数 kill_common_beginning ，它将删除 prev 和 cur 的重复部分。之后，我需要关闭先前标记中保留的所有内容并打开当前标记中保留的所有内容。我可以通过替换、将字符映射到字符串或通过循环来完成。

这三行在 C++ 中非常简单：

char * saved = prev;
for (; *prev && (*prev == *cur);  prev++, cur++ ); // "kill_common_beginning"
while (*prev) *(prev++) == '#' ? ...
while (*cur)  *(cur++) == '#' ? ...
cur = saved;

但是请注意，有一种特殊情况：当缩进没有改变时，这些行不会输出任何内容。如果我们在列表之外，那没问题，但在列表中就不行了：所以在这种情况下，我们应该手动输出

。

The line-by-line solution with some pythonic concepts:

cur = ''
for line in lines():
    prev = cur
    cur, text = split_line_into_marker_and_remainder(line)
    if cur && (cur == prev) :
         print '</li><li>'
    else :
         nprev, ncur = kill_common_beginning(prev, cur)
         for c in nprev: print '</li>' + ((c == '#') ? '</ol>' : '</ul>') 
         for c in ncur:  print           ((c == '#') ? '<ol>'  : '<ul>' )  + '<li>'
    print text

This is how it works: to process the line, I compare the marker for previous line with the marker for this line.

I use a fictional function split_line_into_marker_and_remainder, which returns two results, marker cur and the text itself. It's trivial to implement it as a C++ function with 3 arguments, an input and 2 output strings.

At the core is a fictional function kill_common_beginning which would take away the repeat part of prev and cur. After that, I need to close everything that remains in previous marker and open everything that remains in current marker. I can do it with a replace, by mapping characters to string, or by a loop.

The three lines wil be pretty straightforward in C++:

char * saved = prev;
for (; *prev && (*prev == *cur);  prev++, cur++ ); // "kill_common_beginning"
while (*prev) *(prev++) == '#' ? ...
while (*cur)  *(cur++) == '#' ? ...
cur = saved;

Note, however, that there is a special case: when the indentation didn't change, those lines don't output anything. That's fine if we're outside of the list, but that's not fine in the list: so in that case we should output the </li><li> manually.

回复收藏 0 原文

香橙ぽ 2024-07-31 14:08:02

我见过的最好的解释来自 Mark Jason Dominus 的 High-Order Perl。全文可在线获取：http://hop.perl.plover.com/book/。

尽管这些示例都是用 Perl 编写的，但每个区域背后的逻辑分解都非常棒。

第 8 章（！PDF 链接）专门介绍解析。尽管本书中的课程有些相关。

回复收藏 0 原文

烟雨凡馨 2024-07-31 14:08:02

查看纺织。

它有多种语言版本。

回复收藏 0 原文

江湖正好 2024-07-31 14:08:02

这是如何使用正则表达式和循环来做到这一点（^代表换行符，$代表结束行）：

do { 
    ^#anything$ -> <ol><li>$^anything</li></ol>$
    ^*anything$ -> <ul><li>$^anything</li></ul>$
} while any of those above applies

do {
    </ol><ol> -> 
    </ul><ul> -> 
    </li><li> -> 
} while any of those above applies

这使得它比简单的简单得多正则表达式。它的工作方式是：首先展开每一行，就好像它是孤立的一样，然后吃掉额外的列表标记。

This how you can do it with regexp and cycle (^ stands for newline, $ for endline):

do { 
    ^#anything$ -> <ol><li>$^anything</li></ol>$
    ^*anything$ -> <ul><li>$^anything</li></ul>$
} while any of those above applies

do {
    </ol><ol> -> 
    </ul><ul> -> 
    </li><li> -> 
} while any of those above applies

This makes it much simpler than a simple regexp. The way it works: you first expand each line as if it was isolated, but then eat extra list markers.

回复收藏 0 原文

‖放下 2024-07-31 14:08:02

这是我自己的解决方案，它似乎是 Shog9 的建议（他的正则表达式的变体，Ruby 不支持命名匹配）和 Ilya 的迭代方法的混合体。我的工作语言是 Ruby。

一些值得注意的事情：我使用了基于堆栈的系统，并且“String#scan(pattern)”实际上只是一个返回匹配数组的“全部匹配”方法。

def list(text)
  # returns [['*','text'],...]
  parts = text.scan(/(?:(?:^|\n)([#*]+)[\t ]*(.+)(?=\n|$))/)

  # returns ul/ol based on the byte passed in
  list_type = lambda { |c| (c == '*' ? 'ul' : 'ol') }

  prev = []
  tags = [list_type.call(parts[0][0][0].chr)]
  result = parts.inject("<#{tags.last}><li>") do |output,newline|
    unless prev.count == 0
      # the following comparison says whether added or removed,
      # this is the "how much"
      diff = (prev[0].length - newline[0].length).abs
      case prev[0].length <=> newline[0].length
        when -1: # new tags to add
          part = ((diff > 1) ? newline[0].slice(-1 - diff,-1) : newline[0][-1].chr)
          part.each_char do |c|
            tags << list_type.call(c)
            output << "<#{tags.last}><li>"
          end
        when 0: # no new tags... but possibly changed
          if newline[0] == prev[0]
            output << '</li><li>'
          else
            STDERR.puts "Bad input string: #{newline.join(' ')}"
          end
        when 1: # tags removed
          diff.times{ output << "</li></#{tags.pop}>" }
          output << '</li><li>'
      end
    end

    prev = newline
    output + newline[1]
  end

  tags.reverse.each { |t| result << "</li></#{t}>" }
  result
end

值得庆幸的是，这段代码确实有效并生成了有效的 HTML。结果确实比我预期的要好。它甚至不觉得笨重。

Here is my own solution, which seems to be a hybrid of Shog9's suggestions (a variation on his regex, Ruby doesn't support named matches) and Ilya's iterative method. My working language was Ruby.

Some things of note: I used a stack-based system, and that "String#scan(pattern)" is really just a "match-all" method that returns an array of matches.

def list(text)
  # returns [['*','text'],...]
  parts = text.scan(/(?:(?:^|\n)([#*]+)[\t ]*(.+)(?=\n|$))/)

  # returns ul/ol based on the byte passed in
  list_type = lambda { |c| (c == '*' ? 'ul' : 'ol') }

  prev = []
  tags = [list_type.call(parts[0][0][0].chr)]
  result = parts.inject("<#{tags.last}><li>") do |output,newline|
    unless prev.count == 0
      # the following comparison says whether added or removed,
      # this is the "how much"
      diff = (prev[0].length - newline[0].length).abs
      case prev[0].length <=> newline[0].length
        when -1: # new tags to add
          part = ((diff > 1) ? newline[0].slice(-1 - diff,-1) : newline[0][-1].chr)
          part.each_char do |c|
            tags << list_type.call(c)
            output << "<#{tags.last}><li>"
          end
        when 0: # no new tags... but possibly changed
          if newline[0] == prev[0]
            output << '</li><li>'
          else
            STDERR.puts "Bad input string: #{newline.join(' ')}"
          end
        when 1: # tags removed
          diff.times{ output << "</li></#{tags.pop}>" }
          output << '</li><li>'
      end
    end

    prev = newline
    output + newline[1]
  end

  tags.reverse.each { |t| result << "</li></#{t}>" }
  result
end

Thankfully this code does work and generate valid HTML. And this did turn out better than I had anticipated. It doesn't even feel clunky.

回复收藏 0 原文

爱，才寂寞 2024-07-31 14:08:02

这个 Perl 程序是对此的第一次尝试。

#! /usr/bin/env perl
use strict;
use warnings;
use 5.010;

my $data = [];
while( my $line = <> ){
  last if $line =~ /^[.]{3,3}$/;
  my($nest,$rest) = $line =~ /^([\#*]*)\s+(.*)$/x;
  my @nest = split '', $nest;

  if( @nest ){
    recourse($data,$rest,@nest);
  }else{
    push @$data, $line;
  }
}

de_recourse($data);

sub de_recourse{
  my($ref) = @_;
  my %de_map = (
    '*' => 'ul',
    '#' => 'ol'
  );

  if( ref $ref ){
    my($type,@elem) = @$ref;
    if( ref $type ){
      for my $elem (@$ref){
        de_recourse($elem);
      }
    }else{
      $type = $de_map{$type};

      say "<$type>";
      for my $elem (@elem){
        say "<li>";
        de_recourse($elem);
        say "</li>"
      }
      say "</$type>";
    }
  }else{
    print $ref;
  }
}

sub recourse{
  my($last_ref,$str,@nest) = @_;
  die unless @_ >= 2;
  die unless ref $last_ref;
  my $nest = shift @nest;

  if( @_ == 2 ){
    push @$last_ref, $str;
    return;
  }

  my $previous = $last_ref->[-1];
  if( ref $previous ){
    if( $previous->[0] eq $nest ){
      recourse( $previous,$str,@nest );
      return;
    }
  }

  my $new_ref = [ $nest ];
  push @$last_ref, $new_ref;
  recourse( $new_ref, $str, @nest );
}

希望有帮助

This Perl program is a first attempt at that.

#! /usr/bin/env perl
use strict;
use warnings;
use 5.010;

my $data = [];
while( my $line = <> ){
  last if $line =~ /^[.]{3,3}$/;
  my($nest,$rest) = $line =~ /^([\#*]*)\s+(.*)$/x;
  my @nest = split '', $nest;

  if( @nest ){
    recourse($data,$rest,@nest);
  }else{
    push @$data, $line;
  }
}

de_recourse($data);

sub de_recourse{
  my($ref) = @_;
  my %de_map = (
    '*' => 'ul',
    '#' => 'ol'
  );

  if( ref $ref ){
    my($type,@elem) = @$ref;
    if( ref $type ){
      for my $elem (@$ref){
        de_recourse($elem);
      }
    }else{
      $type = $de_map{$type};

      say "<$type>";
      for my $elem (@elem){
        say "<li>";
        de_recourse($elem);
        say "</li>"
      }
      say "</$type>";
    }
  }else{
    print $ref;
  }
}

sub recourse{
  my($last_ref,$str,@nest) = @_;
  die unless @_ >= 2;
  die unless ref $last_ref;
  my $nest = shift @nest;

  if( @_ == 2 ){
    push @$last_ref, $str;
    return;
  }

  my $previous = $last_ref->[-1];
  if( ref $previous ){
    if( $previous->[0] eq $nest ){
      recourse( $previous,$str,@nest );
      return;
    }
  }

  my $new_ref = [ $nest ];
  push @$last_ref, $new_ref;
  recourse( $new_ref, $str, @nest );
}

Hope it helps

回复收藏 0 原文