“chomp”的奇怪行为用于在 Perl 中逐行处理文件

发布于 2024-12-10 19:06:51 字数 618 浏览 0 评论 0原文

我正在使用以下 Perl 脚本进行一些简单的处理：

use strict;
my $file = "data-text";
open(FILE, "<$file") or die "Can't open $file: $!\n";
my @lines = <FILE>;
close FILE;
my @arrayA = (); my @arrayB=();
my $i = 0;
while($i < @lines) {
    print $lines[$i], "\t", $lines[$i+1], "\n";
    chomp($lines[$i]); chomp($lines[$i+1]); #The problem is here...
    push @arrayA, \$lines[$i];
    push @arrayB, \$lines[$i+1];
    print $lines[$i], "\t", $lines[$i+1], "\n";
    $i+=2;
}

正如我在脚本中指出的，问题出在 chomp($lines[$i]); 行。 chomp($lines[$i+1]);.看来如果我使用这条线，线条就会变得混乱。

怎么了？这是为什么呢？

原文

I am using the following Perl script to do some simple processing:

use strict;
my $file = "data-text";
open(FILE, "<$file") or die "Can't open $file: $!\n";
my @lines = <FILE>;
close FILE;
my @arrayA = (); my @arrayB=();
my $i = 0;
while($i < @lines) {
    print $lines[$i], "\t", $lines[$i+1], "\n";
    chomp($lines[$i]); chomp($lines[$i+1]); #The problem is here...
    push @arrayA, \$lines[$i];
    push @arrayB, \$lines[$i+1];
    print $lines[$i], "\t", $lines[$i+1], "\n";
    $i+=2;
}

As I indicated in the script, the problem is at the line chomp($lines[$i]); chomp($lines[$i+1]);. It seems if I use this line, the lines would be messed up.

What is wrong? Why is this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怪异←思 2024-12-17 19:06:51

chomp 从字符串末尾删除单个 \n 字符。

如果字符串以 \r\n（Windows 样式的行结尾）结尾，则 chomp 会将 \r 保留在原处。这可能会导致与您所看到的症状类似的症状。

编辑：

一些背景。类 Unix 系统（包括 Linux）使用单个换行符 ('\n') 来标记文本文件中每一行的结尾。 Windows（及其前身 MS-DOS）使用两个字符：回车符和换行符 (\r\n)。

Perl 的许多功能都是为处理文本而设计的。这意味着 Perl 默认情况下假定它正在读取的任何文本文件都使用底层操作系统的本机行尾表示，这是相当合理的。

Perl 从 C 继承的一个功能是，当读取一行文本时，本机行尾序列（无论它是什么）都会被转换为单个 '\n' 字符。（反向翻译是在输出时完成的）。这使得大多数程序不必担心文本的表示方式；它在输入和输出上与规范的内部形式进行相互转换。（由于历史原因，该形式恰好与 Unix 格式匹配。）

但是，如果您需要处理非本机文本文件，那么这并没有多大帮助。如果您在类 Unix 环境中运行，但读取 Windows 格式的文本文件，则 \r 字符看起来像是该行的一部分。特别是，chomp 不会对它们做任何特殊的事情。当您打印 \r 字符时，它通常会导致光标移动到当前行的开头，而不前进到下一行。真是一团糟。（Cygwin 是这种混乱的丰富来源；它是一个类 Unix 环境，默认使用 Unix 风格的文本文件，但它在 Windows 下运行，对 Windows 文件系统具有完全可见性。您使用 Cygwin 吗？）

请参阅@BillRupert 的评论;他在 Windows 下运行 Perl 的 Windows 本机实现，因此他看不到您遇到的问题。

如果您想处理非本机文本文件，则需要做一些额外的工作。例如，当读取一行文本时，您不仅

chomp $line;

可以这样写：

chomp $line;
$line =~ s/\r$//;

在编写文本时，您可以这样做：

$line =~ s/$/\r/;

但首先您需要决定是要使用 Windows 风格还是 Unix 风格来编写输出行结尾。这很棘手。

（可能有一个 Perl 模块可以使这变得更容易；任何知道该模块的人，请在评论中提及。）

顺便说一句，您看到的输出不是您的程序产生的输出。如果您通过以可打印形式显示不可打印字符的内容来过滤输出，您将在输出中看到 \r 或 ^M。使用<代码>... | cat -A 或 ... | cat -v 如果您的系统有 cat 命令。

如果可能，您可以考虑在尝试阅读之前翻译您的输入。

chomp deletes a single \n character from the end of a string.

If the string ends with \r\n (the Windows-style line ending), chomp will leave the \r in place. This would likely result in symptoms similar to what you're seeing.

EDIT:

Some background. Unix-like systems (including Linux) use a single line-feed character ('\n') to mark the end of each line in a text file. Windows (and its predecessor MS-DOS) uses two characters, a carriage return and a line feed (\r\n).

Many of Perl's features are designed to work with text. Which means, quite reasonably, that Perl assumes by default that any text file it's reading uses the native end-of-line representation of the underlying operating system.

A feature Perl inherited from C is that, when reading a line of text, the native end-of-line sequence, whatever it is, is translated to a single '\n' character. (The reverse translation is done on output). This frees most programs from having to worry about how text is represented; it's translated to and from a canonical internal form on input and output. (That form happens to match the Unix format, for historical reasons.)

But that doesn't help much if you need to deal with non-native text files. If you're running in a Unix-like environment, but reading Windows-format text files, the \r characters are going to look like part of the line. In particular, chomp won't do anything special with them. And when you print a \r character, it typically causes the cursor to move to the beginning of the current line without advancing to the next line. It's a mess. (Cygwin is a rich source of such confusion; it's a Unix-like environment, using Unix-style text files by default, but it runs under Windows with full visibility to the Windows file system. Are you using Cygwin?)

See @BillRupert's comment; he's running under Windows with a Windows native implementation of Perl, so he doesn't see the problem you're having.

If you want to deal with non-native text files, you'll need to do a little extra work. For example, when reading a line of text, rather than just

chomp $line;

you might write:

chomp $line;
$line =~ s/\r$//;

And when writing text, you can do this:

$line =~ s/$/\r/;

But first you'll need to decide whether you want to write the output with Windows-style or Unix-style line endings. It's tricky.

(There's probably a Perl module that makes this easier; anyone who knows of one, please mention it in a comment.)

Incidentally, the output you're seeing isn't the output your program is producing. If you filter your output through something that shows non-printable characters in printable form, you'll see \r or ^M in your output. Use ... | cat -A or ... | cat -v if your system has the cat command.

If possible, you might consider translating your input before trying to read it.

回复收藏 0 原文

绿光 2024-12-17 19:06:51

由于我没有你的数据文件，我无法确定，但首先，让我们切换到现代打开和句柄，让我们使用警告，也许只是咀嚼整个数组：

use strict;
use warnings;

## If line endings are the problem, try for example:
#local $/ = "\r\n";

my $file="data-text";

my @lines;
{
    open(my $fh, "<", $file) or die "Can't open $file: $!\n";
    @lines = <$fh>;
}

chomp @lines;

my @arrayA;
my @arrayB;
my $i = 0;
while ($i < @lines) {
    print $lines[$i],"\t",$lines[$i+1],"\n";
    push @arrayA, \$lines[$i];
    push @arrayB, \$lines[$i+1];

    ## The following line is now no different from the above, commented out
    #print $lines[$i],"\t",$lines[$i+1],"\n";
    $i+=2;
}

看看这是否比你期望的更多。如果您向我们提供该文件的（一部分），也许我们可以注意到更多内容。

此外，如果您所做的只是尝试将每隔一行拆分为两个数组，您可能会这样做：

while (@lines) {
    my $line1 = shift @lines;
    my $line2 = shift(@lines) || '';
    print $line1,"\t",$line2,"\n";
    push @arrayA, $line1;
    push @arrayB, $line2;
}

哪个内存使用量较少。

Since I don't have your data file I cannot tell for sure, but first of all, let's switch to the modern open and handles, let's use warnings and perhaps just chomp the whole array:

use strict;
use warnings;

## If line endings are the problem, try for example:
#local $/ = "\r\n";

my $file="data-text";

my @lines;
{
    open(my $fh, "<", $file) or die "Can't open $file: $!\n";
    @lines = <$fh>;
}

chomp @lines;

my @arrayA;
my @arrayB;
my $i = 0;
while ($i < @lines) {
    print $lines[$i],"\t",$lines[$i+1],"\n";
    push @arrayA, \$lines[$i];
    push @arrayB, \$lines[$i+1];

    ## The following line is now no different from the above, commented out
    #print $lines[$i],"\t",$lines[$i+1],"\n";
    $i+=2;
}

See if that does more what you expect. If you give us (a portion) of the file maybe we could notice something more.

Also if all you are doing is attempting to split every other line to two arrays, you might do:

while (@lines) {
    my $line1 = shift @lines;
    my $line2 = shift(@lines) || '';
    print $line1,"\t",$line2,"\n";
    push @arrayA, $line1;
    push @arrayB, $line2;
}

Which has less memory usage.

回复收藏 0 原文

~没有更多了~