在 Windows perl 中流处理带有 BOM 和 Unix 行结尾的 UTF-16 文件

发布于 2025-01-08 20:13:45 字数 591 浏览 0 评论 0 原文

我需要使用 perl 流处理以 UTF-16 小尾数法编码的 1Gb 文本文件,具有 unix 风格的结尾(即流中只有 0x000A,没有 0x000D),并且开头有 LE BOM。文件在 Windows 上处理(还需要 Unix 解决方案)。我所说的流处理是指使用 while (<>) 逐行读取和写入。 如果有一个单行命令行就好了:
perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;" infile.txt > outfile.txt

用于测试的输入的十六进制转储(两行:每行“a”和“b”字母): FF FE 61 00 0A 00 62 00 0A 00

s/b/c/g这样的处理应该给出输出(“b”替换为“c”): FF FE 61 00 0A 00 63 00 0A 00

PS。现在,在我所有的试验中,要么 CRLF 输出存在问题(输出 0D 0A 字节产生不正确的 unicode 符号,我只需要 0A00 而无需 0D00 来保留相同的 unix 样式),或者每个新行切换 LE/BE,即相同的“a ” 在输出中,奇数行上的一行为 6100,偶数行上的为 0061。

I need to stream-process using perl a 1Gb text file encoded in UTF-16 little-endian with unix-style endings (i.e. 0x000A only without 0x000D in the stream) and LE BOM in the beginning. File is processed on Windows (Unix solutions are needed also). By stream-process I mean using while (<>), line-by-line reading and writing.
Would be nice to have a command line one-liner like:
perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;" infile.txt > outfile.txt

Hex dump of input for testing (two lines: "a" and "b" letters on each):
FF FE 61 00 0A 00 62 00 0A 00

processing like s/b/c/g should give an output ("b" replaced with "c"):
FF FE 61 00 0A 00 63 00 0A 00

PS. Right now with all my trials either there's a problem with CRLF output (0D 0A bytes are output producing incorrect unicode symbol, and I need only 0A00 without 0D00 to preserve same unix style) or every new line switches LE/BE, i.e. same "a" on one line is 6100 on the odd lines and 0061 on the even lines in the output.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

唔猫 2025-01-15 20:13:45

我想出的最好的办法是:

perl -pe "BEGIN { binmode $_, ':raw:encoding(UTF-16LE)' for *STDIN, *STDOUT }; s/b/c/g;" <infile.txt >outfile.txt

但请注意,我必须使用 而不是 infile.txt ,以便文件位于 STDIN 上。理论上, open 编译指示应该控制神奇的 ARGV 使用的编码文件句柄,但在这种情况下我无法让它正常工作。

infile.txt 之间的区别在于打开文件的方式和时间。使用,文件连接到标准输入,并在 Perl 开始运行之前打开。当您在 BEGIN 块中binmode STDIN 时,文件已打开,您可以更改编码。

当您使用 infile.txt 时,文件名作为命令行参数传递并放置在 @ARGV 数组中。当 BEGIN 块执行时,文件尚未打开,因此您无法设置其编码。从理论上讲,您应该能够说:

use open qw(:std IO :raw:encoding(UTF-16LE));

并让神奇的 处理应用正确的编码。但在这种情况下我无法让它正常工作。

The best I've come up with is this:

perl -pe "BEGIN { binmode $_, ':raw:encoding(UTF-16LE)' for *STDIN, *STDOUT }; s/b/c/g;" <infile.txt >outfile.txt

But note that I had to use <infile.txt instead of infile.txt so that the file would be on STDIN. Theoretically, the open pragma should control the encoding used by the magic ARGV filehandle, but I can't get it to work correctly in this case.

The difference between <infile.txt and infile.txt is in how and when the files are opened. With <infile.txt, the file is connected to standard input, and opened before Perl begins running. When you binmode STDIN in a BEGIN block, the file is already open, and you can change the encoding.

When you use infile.txt, the filename is passed as a command line argument and placed in the @ARGV array. When the BEGIN block executes, the file is not open yet, so you can't set its encoding. Theoretically, you ought to be able to say:

use open qw(:std IO :raw:encoding(UTF-16LE));

and have the magic <ARGV> processing apply the right encoding. But I haven't been able to get that to work right in this case.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文