如何删除文件中的某些换行符

发布于 2024-11-25 13:36:44 字数 3773 浏览 5 评论 0原文

我有一个包含大约 70,000 条记录的文件,其结构大致如下:

01499     1000642   4520101000900000
...more numbers...
104000900169
+Fieldname1
-Content
+Fieldname2
-Content
-Content
-Content
+Fieldname3
-Content
-Content
+Fieldname4
-Content
+Fieldname5
-Content
-Content
-Content
-Content
-Content
-Content

01473     1000642   4520101000900000
...more numbers...

因此,每条记录都以一列数字开头,以空行结尾。在此空行之前,大多数记录都有一个 +Fieldname5 和一个或多个 -Content 行。**

我想要做的是将所有多行条目合并为一个行,同时将前导减号字符替换为空格除了属于最后一个字段(即本例中的 Fieldname5)的字符。

它应该看起来像这样:

01499     1000642   4520101000900000
...more numbers...
104000900169
+Fieldname1
-Content
+Fieldname2
-Content Content Content
+Fieldname3
-Content Content
+Fieldname4
-Content
+Fieldname5
-Content
-Content
-Content
-Content
-Content
-Content

01473     1000642   4520101000900000
...more numbers...

我现在拥有的是这个(改编自 这个答案):

use strict;
use warnings;

our $input = "export.txt";
our $output = "export2.txt";

open our $in, "<$input" or die "$!\n";
open our $out, ">$output" or die "$!\n";

my $this_line = "";
my $new = "";

while(<$in>) {
    my $last_line = $this_line;
    $this_line = $_;

    # If both $last_line and $this_line start with a "-" do the following:
    if ($last_line =~ /^-.+/ && $this_line =~ /^-.+/) {

        # Remove \n from $last_line
        chomp $last_line;

        # Remove leading "-" from $this_line
        $this_line =~ s/^-//;

        # Join both lines and print them to the file
        $new = join(' ', $last_line, $this_line);
        print $out $new;
        } else {
        print $out $last_line;
            }
    }
close ($in);
close ($out);

但是这个有两个问题:

  • 它正确打印输出连接的行,但仍然打印出第二行,例如,

    +字段名称2 -内容内容 内容 -内容

那么如何让脚本只输出连接的行呢?

  • 它一次只能处理两行,而某些多行条目最多可达四十行。

我该如何执行以下操作?

  1. 逐行读入文件并写入输出文件
  2. 当出现多行节时一次性读取并处理,将\n-替换为,除非它属于给定的字段名称(例如,Fieldname5)。
  3. 再次返回读取和写入每一行,直到出现另一个多行块

。它成功了!我只是在开头添加了另一个条件:

use strict;
use warnings;

our $input = "export.txt";
our $output = "export2.txt";

open our $in, "<$input" or die "Kann '$input' nicht finden: $!\n";
open our $out, ">$output" or die "Kann '$output' nicht erstellen: $!\n";


my $insideMultiline = 0;
my $multilineBuffer = "";
my $exception = 0;  # Variable indicating whether the current
                    # multiline-block is a "special" or not

LINE:
while (<$in>) {
    if (/^\+Fieldname5/) {  # If line starts with +Fieldname5,
                            # set $exception to "1"
        $exception = 1;
    }
    elsif (/^\s/) {         # If line starts with a space,
                            # set $exception to "0"
        $exception = "0";
    }
    if ($exception == 0 && /^-/) {  # If $exception is "0" AND
                                    # the line starts with "-",
                                    # do the following
        chomp;
        if ($insideMultiline) {
            s/^-/ /;
            $multilineBuffer .= $_;
        }
        else {
            $insideMultiline = 1;
            $multilineBuffer = $_;
        }
        next LINE;
    }
    else {
        if ($insideMultiline) {
            print $out "$multilineBuffer\n";
            $insideMultiline = 0;
            $multilineBuffer = "";
        }
        print $out $_;
        }
}

close ($in);
close ($out);

I have a file which contains about 70,000 records which is structured roughly like this:

01499     1000642   4520101000900000
...more numbers...
104000900169
+Fieldname1
-Content
+Fieldname2
-Content
-Content
-Content
+Fieldname3
-Content
-Content
+Fieldname4
-Content
+Fieldname5
-Content
-Content
-Content
-Content
-Content
-Content

01473     1000642   4520101000900000
...more numbers...

Every record thus starts with a column of numbers and ends with a blank line. Before this blank line most records have a +Fieldname5 and one or more -Content lines.**

What I would like to do is to merge all multi-line entries into one line while replacing the leading minus-character by a space except those pertaining to the last field (i.e. Fieldname5 in this case).

It should look like this:

01499     1000642   4520101000900000
...more numbers...
104000900169
+Fieldname1
-Content
+Fieldname2
-Content Content Content
+Fieldname3
-Content Content
+Fieldname4
-Content
+Fieldname5
-Content
-Content
-Content
-Content
-Content
-Content

01473     1000642   4520101000900000
...more numbers...

What I have now is this (adapted from this answer):

use strict;
use warnings;

our $input = "export.txt";
our $output = "export2.txt";

open our $in, "<$input" or die "$!\n";
open our $out, ">$output" or die "$!\n";

my $this_line = "";
my $new = "";

while(<$in>) {
    my $last_line = $this_line;
    $this_line = $_;

    # If both $last_line and $this_line start with a "-" do the following:
    if ($last_line =~ /^-.+/ && $this_line =~ /^-.+/) {

        # Remove \n from $last_line
        chomp $last_line;

        # Remove leading "-" from $this_line
        $this_line =~ s/^-//;

        # Join both lines and print them to the file
        $new = join(' ', $last_line, $this_line);
        print $out $new;
        } else {
        print $out $last_line;
            }
    }
close ($in);
close ($out);

But there are two problems with this:

  • It correctly prints out the joined line, but then still prints out the second line, e.g.,

    +Fieldname2
    -Content Content
    Content
    -Content

So how can I make the script only output the joined line?

  • It only works on two lines at a time, while some of the multi-line entries have up to forty lines.

How can I do the following?

  1. Read in a file line by line and write it to an output file
  2. When a multi-line section appears read and process it in one go, replacing \n- by , except if it belongs to a given fieldname (e.g., Fieldname5).
  3. Return to reading and writing each line again until another multi-line block appears

It worked! I just added another conditional at the beginning:

use strict;
use warnings;

our $input = "export.txt";
our $output = "export2.txt";

open our $in, "<$input" or die "Kann '$input' nicht finden: $!\n";
open our $out, ">$output" or die "Kann '$output' nicht erstellen: $!\n";


my $insideMultiline = 0;
my $multilineBuffer = "";
my $exception = 0;  # Variable indicating whether the current
                    # multiline-block is a "special" or not

LINE:
while (<$in>) {
    if (/^\+Fieldname5/) {  # If line starts with +Fieldname5,
                            # set $exception to "1"
        $exception = 1;
    }
    elsif (/^\s/) {         # If line starts with a space,
                            # set $exception to "0"
        $exception = "0";
    }
    if ($exception == 0 && /^-/) {  # If $exception is "0" AND
                                    # the line starts with "-",
                                    # do the following
        chomp;
        if ($insideMultiline) {
            s/^-/ /;
            $multilineBuffer .= $_;
        }
        else {
            $insideMultiline = 1;
            $multilineBuffer = $_;
        }
        next LINE;
    }
    else {
        if ($insideMultiline) {
            print $out "$multilineBuffer\n";
            $insideMultiline = 0;
            $multilineBuffer = "";
        }
        print $out $_;
        }
}

close ($in);
close ($out);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

遗失的美好 2024-12-02 13:36:44

假设唯一以“-”开头的行是这些多行部分,您可以这样做...

# Open $in and $out as in your original code...

my $insideMultiline = 0;
my $multilineBuffer = "";

LINE:
while (<$in>) {
    if (/^-/) {
        chomp;
        if ($insideMultiline) {
            s/^-/ /;
            $multilineBuffer .= $_;
        }
        else {
            $insideMultiline = 1;
            $multilineBuffer = $_;
        }
        next LINE;
    }
    else {
        if ($insideMultiline) {
            print $out "$multilineBuffer\n";
            $insideMultiline = 0;
            $multilineBuffer = "";
        }
        print $out $_;
    }
}

至于嵌入的子问题(“除了与最后一个字段相关的那些”),我需要有关文件的更多详细信息格式才能做到这一点。看起来好像有一个空行将字段和内容集彼此分隔开,但这在描述中并不是 100% 清楚。不过,上面的代码应该可以满足您在底部列出的要求。

Assuming the only lines which begin with "-" are these multi-line sections, you could do this...

# Open $in and $out as in your original code...

my $insideMultiline = 0;
my $multilineBuffer = "";

LINE:
while (<$in>) {
    if (/^-/) {
        chomp;
        if ($insideMultiline) {
            s/^-/ /;
            $multilineBuffer .= $_;
        }
        else {
            $insideMultiline = 1;
            $multilineBuffer = $_;
        }
        next LINE;
    }
    else {
        if ($insideMultiline) {
            print $out "$multilineBuffer\n";
            $insideMultiline = 0;
            $multilineBuffer = "";
        }
        print $out $_;
    }
}

As to the embedded subquestion ("except those pertaining to the last field"), I'd need more detail on the file format to be able to do that. It looks like a blank line separates the sets of fields and contents from one another, but that's not 100% clear in the description. The code above should handle the requirements you laid out at the bottom, though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文