如何使用 Perl 通过正则表达式替换在连续匹配之间散布字符？

发布于 2024-08-09 13:36:37 字数 1229 浏览 12 评论 0原文

以下逗号分隔值行包含几个连续的空字段：

$rawData = 
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"

我想用“N/A”值替换这些空字段，这就是我决定通过正则表达式替换来完成此操作的原因。

我首先尝试了这个：

$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'

它返回的

2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n

不是我想要的。当出现两个以上连续逗号时就会出现问题。正则表达式一次吞噬两个逗号，因此当它重新扫描字符串时，它从第三个逗号而不是第二个逗号开始。

我认为这可能与前瞻与回顾断言有关，所以我尝试了以下正则表达式：

$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'

结果是：

2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n

这也不起作用。它只是将逗号对移动了一位。

我知道通过相同的正则表达式清洗这个字符串两次就可以了，但这看起来很粗糙。当然，必须有一种方法可以让单个正则表达式替换来完成这项工作。有什么建议吗？

最终的字符串应如下所示：

2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n

原文

The following lines of comma-separated values contains several consecutive empty fields:

$rawData = 
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"

I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.

I tried this first of all:

$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'

which returned

2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n

Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.

I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:

$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'

which resulted in:

2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n

That didn't work either. It just shifted the comma-pairings by one.

I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?

The final string should look like this:

2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

缱绻入梦 2024-08-16 13:36:37

编辑：请注意，您可以打开数据字符串的文件句柄并让 readline 处理行结尾：

#!/usr/bin/perl

use strict; use warnings;
use autodie;

my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA

open my $str_h, '<', \$str;

while(my $row = <$str_h>) {
    chomp $row;
    print join(',',
        map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
    ), "\n";
}

输出：

E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A

您还可以使用：

pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;

解释：当 s/// 找到时a ,, 并将其替换为 ,N/A, 它已经移动到最后一个逗号之后的字符。因此，如果只使用，它会错过一些连续的逗号。

$str =~ s{,(,|\n)}{,N/A$1}g;

因此，我在每次成功替换后使用循环将 pos $str 向后移动一个字符。

现在，如 @ysth 显示：

$str =~ s!,(?=[,\n])!,N/A!g;

将使 while 变得不必要。

EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings:

#!/usr/bin/perl

use strict; use warnings;
use autodie;

my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA

open my $str_h, '<', \$str;

while(my $row = <$str_h>) {
    chomp $row;
    print join(',',
        map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
    ), "\n";
}

Output:

E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A

You can also use:

pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;

Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use

$str =~ s{,(,|\n)}{,N/A$1}g;

Therefore, I used a loop to move pos $str back by a character after each successful substitution.

Now, as @ysth shows:

$str =~ s!,(?=[,\n])!,N/A!g;

would make the while unnecessary.

回复收藏 0 原文

陌伤ぢ 2024-08-16 13:36:37

我不太清楚你在后向示例中试图做什么，但我怀疑你在那里遇到了优先级错误，并且后向之后的所有内容都应该包含在 (?: ... ) 因此 | 不会避免进行后向查找。

从头开始，您想要做的事情听起来很简单：如果逗号后面跟着另一个逗号或换行符，则在逗号后面放置 N/A：

s!,(?=[,\n])!,N/A!g;

示例：

my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";

use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);

输出：

"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"

I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind.

Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:

s!,(?=[,\n])!,N/A!g;

Example:

my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";

use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);

Output:

"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"

回复收藏 0 原文

丶情人眼里出诗心の 2024-08-16 13:36:37

您可以搜索

(?<=,)(?=,|$)

并将其替换为 N/A。

此正则表达式匹配两个逗号之间或逗号与行尾之间的（空）空格。

You could search for

(?<=,)(?=,|$)

and replace that with N/A.

This regex matches the (empty) space between two commas or between a comma and end of line.

回复收藏 0 原文

岛徒 2024-08-16 13:36:37

快速而肮脏的黑客版本：

my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;

不是最快的代码，而是最短的。它应该最多循环两次。

The quick and dirty hack version:

my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;

Not the fastest code, but the shortest. It should loop through at max twice.

回复收藏 0 原文

李不 2024-08-16 13:36:37

不是正则表达式，但也不太复杂：

$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);

末尾需要 ,-1 来强制 split 在字符串末尾包含任何空字段。

Not a regex, but not too complicated either:

$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);

The ,-1 is needed at the end to force split to include any empty fields at the end of the string.

回复收藏 0 原文

~没有更多了~

关于作者

蓝眼睛不忧郁

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

如何使用 Perl 通过正则表达式替换在连续匹配之间散布字符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

狼性发作

美煞众生

黑凤梨

慕巷

virou

两仪

友情链接

如何使用 Perl 通过正则表达式替换在连续匹配之间散布字符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

狼性发作

美煞众生

黑凤梨

慕巷

virou

两仪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。