我怎样才能阅读、分析，然后“不读”？并在 Perl 中重新读取输入流的开头？

发布于 2024-09-29 21:37:00 字数 576 浏览 6 评论 0原文

我正在读取和处理来自 Perl 中的 ARGV 文件句柄（即 while(<>) 构造）常规文件句柄（可能是 STDIN）的输入流。但是，我需要分析输入的很大一部分，以便检测四种不同但极其相似的格式中的哪一种进行编码（FASTQ 质量分数的不同 ASCII 编码；请参阅此处）。一旦我决定了数据的格式，我需要返回并再次解析这些行以实际读取数据。

所以我需要读取流的前 500 行左右两次。或者，换个角度来看，我需要阅读前 500 行，然后“将它们放回去”，以便我可以再次阅读它们。由于我可能正在从 STDIN 读取内容，因此我不能只是回到开头。而且文件很大，所以我不能将所有内容读入内存（尽管将前 500 行读入内存是可以的）。最好的方法是什么？

或者，我可以以某种方式复制输入流吗？

编辑：等一下。我刚刚意识到我无法再将输入作为一个大流处理，因为我必须独立检测每个文件的格式。所以我不能使用ARGV。不过，剩下的问题仍然存在。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

忆沫 2024-10-06 21:37:00

正如您所说，如果文件句柄可能是 STDIN，则无法使用 seek 来倒回它。但它仍然非常简单。我不会为模块烦恼：

my @lines;

while (<$file>) {
  push @lines, $_;
  last if @lines == 500;
}

... # examine @lines to determine format

while (defined( $_ = @lines ? shift @lines : <$file> )) {
  ... # process line
}

请记住，在这种情况下您需要显式定义，因为特殊情况会将隐式定义添加到某些while< /code> 循环不适用于这个更复杂的表达式。

As you said, if the filehandle might be STDIN, you can't use seek to rewind it. But it's still pretty simple. I wouldn't bother with a module:

my @lines;

while (<$file>) {
  push @lines, $_;
  last if @lines == 500;
}

... # examine @lines to determine format

while (defined( $_ = @lines ? shift @lines : <$file> )) {
  ... # process line
}

Remember that you need an explicit defined in this case, because the special case that adds an implicit defined to some while loops doesn't apply to this more complex expression.

回复收藏 0 原文

饭团 2024-10-06 21:37:00

有一个 CPAN 模块为 IO::Handle 类提供一个 unread 方法。然而，它的警告让人有些谨慎。我会仔细评估它的适用性。

如果您确实只需要保存 500 行，每行都相当短，那么该模块可能就足够了；它的示例确实使用了STDIN。

不过，我对魔法 ARGV 感到紧张。如果您的 <> 运算符导致打开和读取多个不同的文件，那么我不知道您是否能够备份到与当前打开的文件不同的文件。

因此，您最终可能会自己编写推回逻辑。或者对与多个输入文件和/或 STDIN 性质相关的 ARGV 处理施加某种排序限制。

我的大多数具有神奇 ARGV 处理功能的程序在开始时都有类似以下内容的防护：

if (@ARGV == 0 && -t STDIN) {
    # select one or the other of the next two lines:

    # opt 1: emit warning 
    warn "$0: reading stdin from /dev/tty\n";

    # opt 2: populate @ARGV
    @ARGV = grep { -f && -T } <*>;  # glob plain textfiles

 }

在上面的第二种情况下，它默认为当前目录中的所有纯文本文件，还应该决定如果 grep< 时要做什么/code> 产生空列表。

对于某些期望或至少接受目录参数的程序，我偶尔会初始化一个空的 @ARGV 为 "."，以便程序默认为进程的当前值工作目录。

There is a CPAN module that provides an unread method for the IO::Handle class. However, its warnings make one somewhat cautious. I would evaluate its suitability carefully.

If you really only need to save away 500 lines, each reasonably short, that module might suffice; its example does use STDIN.

However, I'm nervous about magic ARGV. If your <> operator causes several distinct files to be opened and read, then I don't know that you're going to be able to back up to a different file than the one currently open.

So you might end up just writing the pushback logic yourself. Either that, or imposing some sort restriction on ARGV processing related to multiple input files and/or the nature of STDIN.

Most of my programs with magic ARGV processing have guards at their start that read something like:

if (@ARGV == 0 && -t STDIN) {
    # select one or the other of the next two lines:

    # opt 1: emit warning 
    warn "$0: reading stdin from /dev/tty\n";

    # opt 2: populate @ARGV
    @ARGV = grep { -f && -T } <*>;  # glob plain textfiles

 }

In the second case above, where it defaults to all the plain textfiles in the current directory, one should also decide what to do if grep produces the empty list.

For some programs that expect or at least admit directory arguments, I'll occasionally initialize an empty @ARGV to "." instead, so that the program defaults to the process's current working directory.

回复收藏 0 原文

~没有更多了~