在 perl 中使用 split() 时如何实现自己的转义序列?

发布于 2024-09-16 11:14:03 字数 626 浏览 3 评论 0原文

我正在尝试为 EDI 数据格式编写一个解析器,它只是分隔文本,但分隔符是在文件顶部定义的。

本质上,它是一堆基于我在代码顶部读取的值的 split() 。 问题是还有一个自定义“转义字符”,表明我需要忽略以下分隔符。

例如,假设 * 是分隔符,而 ?是转义,我正在做类似

use Data::Dumper;
my $delim = "*";
my $escape = "?";
my $edi = "foo*bar*baz*aster?*isk";

my @split = split("\\" . $delim, $edi);
print Dumper(\@split);

我需要它返回“aster*isk”作为最后一个元素的事情。

我最初的想法是在调用 split() 函数之前用一些自定义映射的不可打印的 ascii 序列替换转义字符和后续字符的每个实例,然后使用另一个正则表达式将它们切换回正确的值。

这是可行的,但感觉就像一个 hack,一旦我对所有 5 个不同的潜在分隔符都这样做,就会变得非常难看。每个分隔符也可能是一个正则表达式特殊字符,导致我自己的正则表达式中出现大量转义。

有什么方法可以避免这种情况,可能通过传递给我的 split() 调用的特殊正则表达式来实现吗?

I'm trying to write a parser for the EDI data format, which is just delimited text but where the delimiters are defined at the top of the file.

Essentially it's a bunch of splits() based on values I read at the top of my code.
The problem is theres also a custom 'escape character' that indicates that I need to ignore the following delimiter.

For example assuming * is the delimiter and ? is the escape, I'm doing something like

use Data::Dumper;
my $delim = "*";
my $escape = "?";
my $edi = "foo*bar*baz*aster?*isk";

my @split = split("\\" . $delim, $edi);
print Dumper(\@split);

I need it to return "aster*isk" as the last element.

My original idea was to do something where I replace every instance of the escape character and the following character with some custom-mapped unprintable ascii sequence before I call my split() functions, then another regexp to switch them back to the right values.

That is doable but feels like a hack, and will get pretty ugly once I do it for all 5 different potential delimiters. Each delimiter is potentially a regexp special char as well, leading to a lot of escaping in my own regular expressions.

Is there any way to avoid this, possibly with a special regexp passed to my split() calls?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

恏ㄋ傷疤忘ㄋ疼 2024-09-23 11:14:03
my @split = split( /(?<!\Q$escape\E)\Q$delim\E/, $edi);

将为您进行拆分,但您必须单独删除转义字符:

s/\Q$escape$delim\E/$delim/g for @split;

更新:要允许转义字符转义任何字符,包括其本身,而不仅仅是分隔符需要不同的方法。这是一种方法:

my @split = $edi =~ /(?:\Q$delim\E|^)((?:\Q$escape\E.|(?!\Q$delim\E).)*+)/gs;
s/\Q$escape$delim\E/$delim/g for @split;

*+ 需要 perl 5.10+。在那之前,它会是:

/(?:\Q$delim\E|^)((?>(?:\Q$escape\E.|(?!\Q$delim\E).)*))/gs
my @split = split( /(?<!\Q$escape\E)\Q$delim\E/, $edi);

will do the split for you, but you have to remove the escape characters separately:

s/\Q$escape$delim\E/$delim/g for @split;

Update: to allow the escape character to escape any character, including itself, not just the delimiter requires a different approach. Here's one way:

my @split = $edi =~ /(?:\Q$delim\E|^)((?:\Q$escape\E.|(?!\Q$delim\E).)*+)/gs;
s/\Q$escape$delim\E/$delim/g for @split;

*+ requires perl 5.10+. Before then, it would be:

/(?:\Q$delim\E|^)((?>(?:\Q$escape\E.|(?!\Q$delim\E).)*))/gs
冷情 2024-09-23 11:14:03

如果您想正确处理转义字符是字段的最后一个字符的情况,这有点棘手。这是一种方法:

# Process escapes to hide the following character:
$edi =~ s/\Q$escape\E(.)/sprintf '%s%d%s', $escape, ord $1, $escape/esg;

my @split = split( /\Q$delim\E/, $edi);

# Convert escape sequences into the escaped character:
s/\Q$escape\E(\d+)\Q$escape\E/chr $1/eg for @split;

请注意,这假设转义字符和分隔符都不是数字,但它确实支持全部 Unicode 字符。

This is a bit tricky if you want to handle the case where the escape character is the last character of a field correctly. Here's one way:

# Process escapes to hide the following character:
$edi =~ s/\Q$escape\E(.)/sprintf '%s%d%s', $escape, ord $1, $escape/esg;

my @split = split( /\Q$delim\E/, $edi);

# Convert escape sequences into the escaped character:
s/\Q$escape\E(\d+)\Q$escape\E/chr $1/eg for @split;

Note that this assumes that neither the escape char nor the delimiter will be a digit, but it does support the full range of Unicode characters.

魔法唧唧 2024-09-23 11:14:03

这是一个自定义函数——它比 ysth 的答案长,但在我看来,它更容易分解成有用的部分(不只是一个正则表达式),而且它还能够处理您要求的多个分隔符。

sub split_edi {
  my ($in, %args) = @_;
  die q/Usage: split_edi($input, escape => "#", delims => [ ... ]) /
    unless defined $in and defined $args{escape} and defined $args{delims};

  my $escape = quotemeta $args{escape};
  my $delims = join '|', map quotemeta, @{ $args{delims} };

  my ($cur, @ret);

  while ($in !~ /\G\z/cg) {
    if ($in =~ /\G$escape(.)/mcg) {
      $cur .= $1;
    } elsif ($in =~ /\G(?:$delims)/cg) {
      push @ret, $cur; 
      $cur = '';
    } elsif ($in =~ /\G((?:(?!$delims|$escape).)+)/mcg) {
      $cur .= $1;
    } else {
      die "hobbs can't write parsers";
    }
  }
  push @ret, $cur if defined $cur;
  @ret;
}

第一行是参数解析,根据需要反斜杠转义字符,并构建与任何分隔符匹配的正则表达式片段。

然后是匹配循环:

  • 如果我们找到转义符,则跳过它并捕获以下字符作为输出的文字位,而不是对其进行特殊处理。
  • 如果我们找到任何分隔符,则开始一条新记录。
  • 否则,捕获字符直到下一个转义符或分隔符。
  • 当我们到达字符串末尾时停止。

这非常简单,并且仍然具有相当稳定的性能。就像 ysth 的正则表达式解决方案一样,它是循序渐进的——它不会尝试不必要的回溯。如果转义符或任何分隔符是多字符,则不能保证正确性,尽管我实际上认为它非常正确:)

say for split_edi("foo*bar;baz*aster?*isk", delims => [qw(* ;)], escape => "?");
foo
bar
baz
aster*isk

Here's a custom function -- it's longer than ysth's answer, but in my opinion it's easier to break down into useful pieces (not being all one regex), and it also has the ability to cope with multiple delimiters that you asked for.

sub split_edi {
  my ($in, %args) = @_;
  die q/Usage: split_edi($input, escape => "#", delims => [ ... ]) /
    unless defined $in and defined $args{escape} and defined $args{delims};

  my $escape = quotemeta $args{escape};
  my $delims = join '|', map quotemeta, @{ $args{delims} };

  my ($cur, @ret);

  while ($in !~ /\G\z/cg) {
    if ($in =~ /\G$escape(.)/mcg) {
      $cur .= $1;
    } elsif ($in =~ /\G(?:$delims)/cg) {
      push @ret, $cur; 
      $cur = '';
    } elsif ($in =~ /\G((?:(?!$delims|$escape).)+)/mcg) {
      $cur .= $1;
    } else {
      die "hobbs can't write parsers";
    }
  }
  push @ret, $cur if defined $cur;
  @ret;
}

The first line is argument parsing, backslashing the escape char as necessary, and building a regex fragment that matches any of the delimiters.

Then comes the matching loop:

  • If we find the escape, skip over it and capture the following character as a literal bit of the output instead of treating it specially.
  • If we find any of the delimiters, start a new record.
  • Otherwise, capture characters until the next escape or delimiter.
  • Stop when we reach end-of-string.

which is pretty straightforward and still has pretty solid performance. Like ysth's regex solutions, it's ratcheting -- it won't try to backtrack unnecessarily. Correctness isn't guaranteed if the escape or any of the delimiters is multi-character, although I actually think it's pretty much right :)

say for split_edi("foo*bar;baz*aster?*isk", delims => [qw(* ;)], escape => "?");
foo
bar
baz
aster*isk
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文