如何删除文件中的某些换行符
我有一个包含大约 70,000 条记录的文件,其结构大致如下:
01499 1000642 4520101000900000
...more numbers...
104000900169
+Fieldname1
-Content
+Fieldname2
-Content
-Content
-Content
+Fieldname3
-Content
-Content
+Fieldname4
-Content
+Fieldname5
-Content
-Content
-Content
-Content
-Content
-Content
01473 1000642 4520101000900000
...more numbers...
因此,每条记录都以一列数字开头,以空行结尾。在此空行之前,大多数记录都有一个 +Fieldname5
和一个或多个 -Content
行。**
我想要做的是将所有多行条目合并为一个行,同时将前导减号字符替换为空格除了属于最后一个字段(即本例中的 Fieldname5)的字符。
它应该看起来像这样:
01499 1000642 4520101000900000
...more numbers...
104000900169
+Fieldname1
-Content
+Fieldname2
-Content Content Content
+Fieldname3
-Content Content
+Fieldname4
-Content
+Fieldname5
-Content
-Content
-Content
-Content
-Content
-Content
01473 1000642 4520101000900000
...more numbers...
我现在拥有的是这个(改编自 这个答案):
use strict;
use warnings;
our $input = "export.txt";
our $output = "export2.txt";
open our $in, "<$input" or die "$!\n";
open our $out, ">$output" or die "$!\n";
my $this_line = "";
my $new = "";
while(<$in>) {
my $last_line = $this_line;
$this_line = $_;
# If both $last_line and $this_line start with a "-" do the following:
if ($last_line =~ /^-.+/ && $this_line =~ /^-.+/) {
# Remove \n from $last_line
chomp $last_line;
# Remove leading "-" from $this_line
$this_line =~ s/^-//;
# Join both lines and print them to the file
$new = join(' ', $last_line, $this_line);
print $out $new;
} else {
print $out $last_line;
}
}
close ($in);
close ($out);
但是这个有两个问题:
它正确打印输出连接的行,但仍然打印出第二行,例如,
+字段名称2 -内容内容 内容 -内容
那么如何让脚本只输出连接的行呢?
- 它一次只能处理两行,而某些多行条目最多可达四十行。
我该如何执行以下操作?
- 逐行读入文件并写入输出文件
- 当出现多行节时一次性读取并处理,将
\n-
替换为,除非它属于给定的字段名称(例如,
Fieldname5
)。 - 再次返回读取和写入每一行,直到出现另一个多行块
。它成功了!我只是在开头添加了另一个条件:
use strict;
use warnings;
our $input = "export.txt";
our $output = "export2.txt";
open our $in, "<$input" or die "Kann '$input' nicht finden: $!\n";
open our $out, ">$output" or die "Kann '$output' nicht erstellen: $!\n";
my $insideMultiline = 0;
my $multilineBuffer = "";
my $exception = 0; # Variable indicating whether the current
# multiline-block is a "special" or not
LINE:
while (<$in>) {
if (/^\+Fieldname5/) { # If line starts with +Fieldname5,
# set $exception to "1"
$exception = 1;
}
elsif (/^\s/) { # If line starts with a space,
# set $exception to "0"
$exception = "0";
}
if ($exception == 0 && /^-/) { # If $exception is "0" AND
# the line starts with "-",
# do the following
chomp;
if ($insideMultiline) {
s/^-/ /;
$multilineBuffer .= $_;
}
else {
$insideMultiline = 1;
$multilineBuffer = $_;
}
next LINE;
}
else {
if ($insideMultiline) {
print $out "$multilineBuffer\n";
$insideMultiline = 0;
$multilineBuffer = "";
}
print $out $_;
}
}
close ($in);
close ($out);
I have a file which contains about 70,000 records which is structured roughly like this:
01499 1000642 4520101000900000
...more numbers...
104000900169
+Fieldname1
-Content
+Fieldname2
-Content
-Content
-Content
+Fieldname3
-Content
-Content
+Fieldname4
-Content
+Fieldname5
-Content
-Content
-Content
-Content
-Content
-Content
01473 1000642 4520101000900000
...more numbers...
Every record thus starts with a column of numbers and ends with a blank line. Before this blank line most records have a +Fieldname5
and one or more -Content
lines.**
What I would like to do is to merge all multi-line entries into one line while replacing the leading minus-character by a space except those pertaining to the last field (i.e. Fieldname5 in this case).
It should look like this:
01499 1000642 4520101000900000
...more numbers...
104000900169
+Fieldname1
-Content
+Fieldname2
-Content Content Content
+Fieldname3
-Content Content
+Fieldname4
-Content
+Fieldname5
-Content
-Content
-Content
-Content
-Content
-Content
01473 1000642 4520101000900000
...more numbers...
What I have now is this (adapted from this answer):
use strict;
use warnings;
our $input = "export.txt";
our $output = "export2.txt";
open our $in, "<$input" or die "$!\n";
open our $out, ">$output" or die "$!\n";
my $this_line = "";
my $new = "";
while(<$in>) {
my $last_line = $this_line;
$this_line = $_;
# If both $last_line and $this_line start with a "-" do the following:
if ($last_line =~ /^-.+/ && $this_line =~ /^-.+/) {
# Remove \n from $last_line
chomp $last_line;
# Remove leading "-" from $this_line
$this_line =~ s/^-//;
# Join both lines and print them to the file
$new = join(' ', $last_line, $this_line);
print $out $new;
} else {
print $out $last_line;
}
}
close ($in);
close ($out);
But there are two problems with this:
It correctly prints out the joined line, but then still prints out the second line, e.g.,
+Fieldname2
-Content Content
Content
-Content
So how can I make the script only output the joined line?
- It only works on two lines at a time, while some of the multi-line entries have up to forty lines.
How can I do the following?
- Read in a file line by line and write it to an output file
- When a multi-line section appears read and process it in one go, replacing
\n-
by, except if it belongs to a given fieldname (e.g.,
Fieldname5
). - Return to reading and writing each line again until another multi-line block appears
It worked! I just added another conditional at the beginning:
use strict;
use warnings;
our $input = "export.txt";
our $output = "export2.txt";
open our $in, "<$input" or die "Kann '$input' nicht finden: $!\n";
open our $out, ">$output" or die "Kann '$output' nicht erstellen: $!\n";
my $insideMultiline = 0;
my $multilineBuffer = "";
my $exception = 0; # Variable indicating whether the current
# multiline-block is a "special" or not
LINE:
while (<$in>) {
if (/^\+Fieldname5/) { # If line starts with +Fieldname5,
# set $exception to "1"
$exception = 1;
}
elsif (/^\s/) { # If line starts with a space,
# set $exception to "0"
$exception = "0";
}
if ($exception == 0 && /^-/) { # If $exception is "0" AND
# the line starts with "-",
# do the following
chomp;
if ($insideMultiline) {
s/^-/ /;
$multilineBuffer .= $_;
}
else {
$insideMultiline = 1;
$multilineBuffer = $_;
}
next LINE;
}
else {
if ($insideMultiline) {
print $out "$multilineBuffer\n";
$insideMultiline = 0;
$multilineBuffer = "";
}
print $out $_;
}
}
close ($in);
close ($out);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
假设唯一以“-”开头的行是这些多行部分,您可以这样做...
至于嵌入的子问题(“除了与最后一个字段相关的那些”),我需要有关文件的更多详细信息格式才能做到这一点。看起来好像有一个空行将字段和内容集彼此分隔开,但这在描述中并不是 100% 清楚。不过,上面的代码应该可以满足您在底部列出的要求。
Assuming the only lines which begin with "-" are these multi-line sections, you could do this...
As to the embedded subquestion ("except those pertaining to the last field"), I'd need more detail on the file format to be able to do that. It looks like a blank line separates the sets of fields and contents from one another, but that's not 100% clear in the description. The code above should handle the requirements you laid out at the bottom, though.