我需要使用 perl 流处理以 UTF-16 小尾数法编码的 1Gb 文本文件,具有 unix 风格的结尾(即流中只有 0x000A,没有 0x000D),并且开头有 LE BOM。文件在 Windows 上处理(还需要 Unix 解决方案)。我所说的流处理是指使用 while (<>) 逐行读取和写入。
如果有一个单行命令行就好了:
perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;" infile.txt > outfile.txt
用于测试的输入的十六进制转储(两行:每行“a”和“b”字母):
FF FE 61 00 0A 00 62 00 0A 00
像s/b/c/g这样的处理应该给出输出(“b”替换为“c”):
FF FE 61 00 0A 00 63 00 0A 00
PS。现在,在我所有的试验中,要么 CRLF 输出存在问题(输出 0D 0A 字节产生不正确的 unicode 符号,我只需要 0A00 而无需 0D00 来保留相同的 unix 样式),或者每个新行切换 LE/BE,即相同的“a ” 在输出中,奇数行上的一行为 6100,偶数行上的为 0061。
I need to stream-process using perl a 1Gb text file encoded in UTF-16 little-endian with unix-style endings (i.e. 0x000A only without 0x000D in the stream) and LE BOM in the beginning. File is processed on Windows (Unix solutions are needed also). By stream-process I mean using while (<>), line-by-line reading and writing.
Would be nice to have a command line one-liner like:
perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;" infile.txt > outfile.txt
Hex dump of input for testing (two lines: "a" and "b" letters on each):
FF FE 61 00 0A 00 62 00 0A 00
processing like s/b/c/g should give an output ("b" replaced with "c"):
FF FE 61 00 0A 00 63 00 0A 00
PS. Right now with all my trials either there's a problem with CRLF output (0D 0A bytes are output producing incorrect unicode symbol, and I need only 0A00 without 0D00 to preserve same unix style) or every new line switches LE/BE, i.e. same "a" on one line is 6100 on the odd lines and 0061 on the even lines in the output.
发布评论
评论(1)
我想出的最好的办法是:
但请注意,我必须使用 而不是
infile.txt
,以便文件位于 STDIN 上。理论上, open 编译指示应该控制神奇的ARGV
使用的编码文件句柄,但在这种情况下我无法让它正常工作。infile.txt
之间的区别在于打开文件的方式和时间。使用BEGIN
块中binmode STDIN
时,文件已打开,您可以更改编码。当您使用
infile.txt
时,文件名作为命令行参数传递并放置在@ARGV
数组中。当BEGIN
块执行时,文件尚未打开,因此您无法设置其编码。从理论上讲,您应该能够说:并让神奇的
处理应用正确的编码。但在这种情况下我无法让它正常工作。The best I've come up with is this:
But note that I had to use
<infile.txt
instead ofinfile.txt
so that the file would be on STDIN. Theoretically, the open pragma should control the encoding used by the magicARGV
filehandle, but I can't get it to work correctly in this case.The difference between
<infile.txt
andinfile.txt
is in how and when the files are opened. With<infile.txt
, the file is connected to standard input, and opened before Perl begins running. When youbinmode STDIN
in aBEGIN
block, the file is already open, and you can change the encoding.When you use
infile.txt
, the filename is passed as a command line argument and placed in the@ARGV
array. When theBEGIN
block executes, the file is not open yet, so you can't set its encoding. Theoretically, you ought to be able to say:and have the magic
<ARGV>
processing apply the right encoding. But I haven't been able to get that to work right in this case.