我怎样才能阅读、分析,然后“不读”?并在 Perl 中重新读取输入流的开头?
我正在读取和处理来自 Perl 中的 ARGV 文件句柄(即 while(<>)
构造) 常规文件句柄(可能是 STDIN)的输入流。但是,我需要分析输入的很大一部分,以便检测四种不同但极其相似的格式中的哪一种进行编码(FASTQ 质量分数的不同 ASCII 编码;请参阅 此处)。一旦我决定了数据的格式,我需要返回并再次解析这些行以实际读取数据。
所以我需要读取流的前 500 行左右两次。或者,换个角度来看,我需要阅读前 500 行,然后“将它们放回去”,以便我可以再次阅读它们。由于我可能正在从 STDIN 读取内容,因此我不能只是回到开头。而且文件很大,所以我不能将所有内容读入内存(尽管将前 500 行读入内存是可以的)。最好的方法是什么?
或者,我可以以某种方式复制输入流吗?
编辑:等一下。我刚刚意识到我无法再将输入作为一个大流处理,因为我必须独立检测每个文件的格式。所以我不能使用ARGV。不过,剩下的问题仍然存在。
I'm reading and processing a stream of input from the ARGV filehandle in Perl (i.e. the a regular filehandle, which may be STDIN. However, I need to analyze a significant portion of the input in order to detect which of four different but extremely similar formats it is encoded in (different ASCII encodings of FASTQ quality scores; see here). Once I've decided which format the data is in, I need to go back and parse those lines a second time to actually read the data.while(<>)
construct)
So I need to read the first 500 lines or so of the stream twice. Or, to look at it another way, I need to read the first 500 lines, and then "put them back" so I can read them again. Since I may be reading from STDIN, I can't just seek back to the beginning. And the files are huge, so I can't just read everything into memory (although reading those first 500 lines into memory is ok). What's the best way to do this?
Alternatively, can I duplicate the input stream somehow?
Edit: Wait a minute. I just realized that I can't process the input as one big stream anymore, because I have to detect each file's format independently. So I can't use ARGV. The rest of the question still stands, though.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正如您所说,如果文件句柄可能是 STDIN,则无法使用
seek
来倒回它。但它仍然非常简单。我不会为模块烦恼:请记住,在这种情况下您需要显式定义,因为特殊情况会将隐式定义添加到某些
while< /code> 循环不适用于这个更复杂的表达式。
As you said, if the filehandle might be STDIN, you can't use
seek
to rewind it. But it's still pretty simple. I wouldn't bother with a module:Remember that you need an explicit
defined
in this case, because the special case that adds an implicitdefined
to somewhile
loops doesn't apply to this more complex expression.有一个 CPAN 模块为
IO::Handle
类提供一个unread
方法。然而,它的警告让人有些谨慎。我会仔细评估它的适用性。如果您确实只需要保存 500 行,每行都相当短,那么该模块可能就足够了;它的示例确实使用了
STDIN
。不过,我对魔法 ARGV 感到紧张。如果您的
<>
运算符导致打开和读取多个不同的文件,那么我不知道您是否能够备份到与当前打开的文件不同的文件。因此,您最终可能会自己编写推回逻辑。或者对与多个输入文件和/或 STDIN 性质相关的 ARGV 处理施加某种排序限制。
我的大多数具有神奇 ARGV 处理功能的程序在开始时都有类似以下内容的防护:
在上面的第二种情况下,它默认为当前目录中的所有纯文本文件,还应该决定如果
grep< 时要做什么/code> 产生空列表。
对于某些期望或至少接受目录参数的程序,我偶尔会初始化一个空的
@ARGV
为"."
,以便程序默认为进程的当前值工作目录。There is a CPAN module that provides an
unread
method for theIO::Handle
class. However, its warnings make one somewhat cautious. I would evaluate its suitability carefully.If you really only need to save away 500 lines, each reasonably short, that module might suffice; its example does use
STDIN
.However, I'm nervous about magic ARGV. If your
<>
operator causes several distinct files to be opened and read, then I don't know that you're going to be able to back up to a different file than the one currently open.So you might end up just writing the pushback logic yourself. Either that, or imposing some sort restriction on ARGV processing related to multiple input files and/or the nature of
STDIN
.Most of my programs with magic ARGV processing have guards at their start that read something like:
In the second case above, where it defaults to all the plain textfiles in the current directory, one should also decide what to do if
grep
produces the empty list.For some programs that expect or at least admit directory arguments, I'll occasionally initialize an empty
@ARGV
to"."
instead, so that the program defaults to the process's current working directory.