如何使用 Perl 通过正则表达式替换在连续匹配之间散布字符?
以下逗号分隔值行包含几个连续的空字段:
$rawData =
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
我想用“N/A”值替换这些空字段,这就是我决定通过正则表达式替换来完成此操作的原因。
我首先尝试了这个:
$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
它返回的
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n
不是我想要的。当出现两个以上连续逗号时就会出现问题。正则表达式一次吞噬两个逗号,因此当它重新扫描字符串时,它从第三个逗号而不是第二个逗号开始。
我认为这可能与前瞻与回顾断言有关,所以我尝试了以下正则表达式:
$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
结果是:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n
这也不起作用。它只是将逗号对移动了一位。
我知道通过相同的正则表达式清洗这个字符串两次就可以了,但这看起来很粗糙。当然,必须有一种方法可以让单个正则表达式替换来完成这项工作。有什么建议吗?
最终的字符串应如下所示:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
The following lines of comma-separated values contains several consecutive empty fields:
$rawData =
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.
I tried this first of all:
$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which returned
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n
Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.
I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:
$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which resulted in:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n
That didn't work either. It just shifted the comma-pairings by one.
I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?
The final string should look like this:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
编辑:请注意,您可以打开数据字符串的文件句柄并让
readline
处理行结尾:输出:
您还可以使用:
解释:当
s///
找到时a,,
并将其替换为,N/A,
它已经移动到最后一个逗号之后的字符。因此,如果只使用,它会错过一些连续的逗号。因此,我在每次成功替换后使用循环将 pos $str 向后移动一个字符。
现在,如 @ysth 显示:
将使
while
变得不必要。EDIT: Note that you could open a filehandle to the data string and let
readline
deal with line endings:Output:
You can also use:
Explanation: When
s///
finds a,,
and replaces it with,N/A,
it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only useTherefore, I used a loop to move
pos $str
back by a character after each successful substitution.Now, as @ysth shows:
would make the
while
unnecessary.我不太清楚你在后向示例中试图做什么,但我怀疑你在那里遇到了优先级错误,并且后向之后的所有内容都应该包含在
(?: ... )
因此|
不会避免进行后向查找。从头开始,您想要做的事情听起来很简单:如果逗号后面跟着另一个逗号或换行符,则在逗号后面放置 N/A:
示例:
输出:
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a
(?: ... )
so the|
doesn't avoid doing the lookbehind.Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:
Example:
Output:
您可以搜索
并将其替换为 N/A。
此正则表达式匹配两个逗号之间或逗号与行尾之间的(空)空格。
You could search for
and replace that with N/A.
This regex matches the (empty) space between two commas or between a comma and end of line.
快速而肮脏的黑客版本:
不是最快的代码,而是最短的。它应该最多循环两次。
The quick and dirty hack version:
Not the fastest code, but the shortest. It should loop through at max twice.
不是正则表达式,但也不太复杂:
末尾需要
,-1
来强制split
在字符串末尾包含任何空字段。Not a regex, but not too complicated either:
The
,-1
is needed at the end to forcesplit
to include any empty fields at the end of the string.