如何使用perl处理类似unicode格式的文件？

发布于 2024-11-05 09:56:05 字数 2077 浏览 3 评论 0原文

我有一个遗留程序，运行它后，它将生成一个日志文件。现在我需要分析这个日志文件。

但文件格式很奇怪。请看下面，我用vi打开它，它看起来像一个unicode文件，但它不是FFFE启动的。我用记事本打开后，保存再打开，发现FFFE是记事本添加的。然后我可以使用命令 'type log.txt > log1.txt”将整个文件转换为ANSI格式。后来在perl中，我可以在perl中使用/TDD/来搜索我需要的内容。

但是现在，我无法处理这种文件格式。

任何评论或想法都会被非常感谢。

0000000: 5400 4400 4400 3e00 2000 4c00 6f00 6100  T.D.D.>. .L.o.a.

记事本保存后，

0000000: fffe 5400 4400 4400 3e00 2000 4c00 6f00  ..T.D.D.>. .L.o.

open STDIN, "< log.txt";
while(<>)
{
  if (/TDD/)
  {
    # Add my logic.
  }
}

我已经阅读了非常有用的线程，但仍然无法解决我的问题。如何使用 Perl 打开 Unicode 文件？

我无法添加答案，所以我编辑我的帖子。

谢谢迈克尔，我尝试了你的脚本，但出现以下错误。我检查了我的 perl 版本是 5.1，操作系统是 windows 2008。

* ascii
* ascii-ctrl
* iso-8859-1
* null
* utf-8-strict
* utf8
UTF-16:Unrecognised BOM 5400 at test.pl line 12.

更新

我使用以下命令尝试了 UTF-16LE：

perl.exe open.pl utf-16le utf-16 <my log file>.txt

但我仍然收到错误

UTF-16LE:Partial character at open.pl line 18, <$fh> line 1824.

，我也尝试了 utf-16be，收到相同的错误。

如果我使用 utf-16，我会收到错误

UTF-16:Unrecognised BOM 5400 at open.pl line 18.

open.pl line 18

is "print while <$fh>;"

知道吗？

更新日期：2011 年 5 月 11 日。谢谢你们的帮助。我解决了这个问题。我发现日志文件中的数据毕竟不是UTF-16。因此，我必须通过 Visual Studio 编写一个 .net 项目。它将使用 UTF-16 读取日志文件并使用 UTF-8 写入新文件。然后我使用 perl 脚本来解析文件并生成结果数据。现在它起作用了。

所以，如果你们谁知道如何使用perl读取一个有很多垃圾数据的文件，请告诉我，非常感谢。

例如垃圾数据样本

tests.cpp:34)
਍吀䐀䐀㸀 䰀漀愀搀椀渀最 挀挀洀挀漀爀攀⸀搀氀

使用十六进制阅读器打开它：

0000070: a88d e590 80e4 9080 e490 80e3 b880 e280  ................
0000080: 80e4 b080 e6bc 80e6 8480 e690 80e6 a480  ................
0000090: e6b8 80e6 9c80 e280 80e6 8c80 e68c 80e6  ................
00000a0: b480 e68c 80e6 bc80 e788 80e6 9480 e2b8  ................

原文

I have a legacy program, and after running it, it will generate a log file. Now I need to analysis this log file.

But the file format is very strange. Please see the following,I used vi to open it, it looks like an unicode file, but it is not FFFE started. after I used notepad open it, save it and open again, I found that the FFFE is added by notepad. Then I can use command 'type log.txt > log1.txt" to convert the whole file to ANSI format. Later in perl, I can use /TDD/ in perl to search what I need.

But now, I can't deal with this file format.

Any comment or idea will be very appreciated.

0000000: 5400 4400 4400 3e00 2000 4c00 6f00 6100  T.D.D.>. .L.o.a.

After notepad save it

0000000: fffe 5400 4400 4400 3e00 2000 4c00 6f00  ..T.D.D.>. .L.o.

open STDIN, "< log.txt";
while(<>)
{
  if (/TDD/)
  {
    # Add my logic.
  }
}

I have read the thread which is very useful, but still can't resolve my problem.
How can I open a Unicode file with Perl?

I can't add answer, so I edit my thread.

Thanks Michael,
I tried your script but got the following error. I checked my perl version is 5.1, OS is windows 2008.

* ascii
* ascii-ctrl
* iso-8859-1
* null
* utf-8-strict
* utf8
UTF-16:Unrecognised BOM 5400 at test.pl line 12.

Update

I tried the UTF-16LE with the command:

perl.exe open.pl utf-16le utf-16 <my log file>.txt

but I still got the error like

UTF-16LE:Partial character at open.pl line 18, <$fh> line 1824.

also, I tried utf-16be, got the same error.

If I used utf-16, I will got the error

UTF-16:Unrecognised BOM 5400 at open.pl line 18.

open.pl line 18

is "print while <$fh>;"

Any idea?

Updated: 5/11/2011.
Thank you guys for your help. I resolved the problem.
I found that the data in log file are not UTF-16 after all. So, I had to write a .net project by visual studio. It will read the log file with UTF-16 and write to a new file with UTF-8. And then I used perl script to parse the file and generate result data. It worked now.

So, if any of you know how to use perl read a file with many garbage data, please tell me, thank you very much.

e.g. garbage data sample

tests.cpp:34)
਍吀䐀䐀㸀 䰀漀愀搀椀渀最 挀挀洀挀漀爀攀⸀搀氀

use hex reader to open it:

0000070: a88d e590 80e4 9080 e490 80e3 b880 e280  ................
0000080: 80e4 b080 e6bc 80e6 8480 e690 80e6 a480  ................
0000090: e6b8 80e6 9c80 e280 80e6 8c80 e68c 80e6  ................
00000a0: b480 e68c 80e6 bc80 e788 80e6 9480 e2b8  ................

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

哑 2024-11-12 09:56:05

您的文件似乎是用 UTF-16LE 编码的。记事本添加的字节称为“字节顺序标记”，或简称为 BOM。

以下是使用 Perl 读取文件的方法：

use strict;
use warnings;
use Encode;
# list loaded encodings
print STDERR map "* $_\n", Encode->encodings;
# read arguments
my $enc = shift || 'utf16';
die "no files :-(\n" unless @ARGV;
# process files
for ( @ARGV ) {
    open my $fh, "<:encoding($enc)", $_ or die "open $_: $!";
    print <$fh>;
    close $fh;
}
# loaded more encodings now
print STDERR map "* $_\n", Encode->encodings;

像这样继续，注意为文件提供正确的编码：

perl open.pl utf16 open.utf16be.txt
perl open.pl utf16 open.utf16le.txt
perl open.pl utf16le open.utf16le.nobom.txt

这是遵循 tchrist 建议的修订版本：

use strict;
use warnings;
use Encode;

# read arguments
my $enc_in  = shift || die 'pass file encoding as first parameter';
my $enc_out = shift || die 'pass STDOUT encoding as second parameter';
print STDERR "going to read files as encoded in: $enc_in\n";
print STDERR "going to write to standard output in: $enc_out\n";
die "no files :-(\n" unless @ARGV;

binmode STDOUT, ":encoding($enc_out)"; # latin1, cp1252, utf8, UTF-8

print STDERR map "* $_\n", Encode->encodings; # list loaded encodings

for ( @ARGV ) { # process files
    open my $fh, "<:encoding($enc_in)", $_ or die "open $_: $!";
    print while <$fh>;
    close $fh;
}

print STDERR map "* $_\n", Encode->encodings; # more encodings now

Your file seems to be encoded in UTF-16LE. The bytes notepad adds are called "Byte Order Mark", or just BOM.

Here's how you can read your file using Perl:

use strict;
use warnings;
use Encode;
# list loaded encodings
print STDERR map "* $_\n", Encode->encodings;
# read arguments
my $enc = shift || 'utf16';
die "no files :-(\n" unless @ARGV;
# process files
for ( @ARGV ) {
    open my $fh, "<:encoding($enc)", $_ or die "open $_: $!";
    print <$fh>;
    close $fh;
}
# loaded more encodings now
print STDERR map "* $_\n", Encode->encodings;

Proceed like this, taking care to supply the correct encoding for your file:

perl open.pl utf16 open.utf16be.txt
perl open.pl utf16 open.utf16le.txt
perl open.pl utf16le open.utf16le.nobom.txt

Here's the revised version following tchrist's suggestions:

use strict;
use warnings;
use Encode;

# read arguments
my $enc_in  = shift || die 'pass file encoding as first parameter';
my $enc_out = shift || die 'pass STDOUT encoding as second parameter';
print STDERR "going to read files as encoded in: $enc_in\n";
print STDERR "going to write to standard output in: $enc_out\n";
die "no files :-(\n" unless @ARGV;

binmode STDOUT, ":encoding($enc_out)"; # latin1, cp1252, utf8, UTF-8

print STDERR map "* $_\n", Encode->encodings; # list loaded encodings

for ( @ARGV ) { # process files
    open my $fh, "<:encoding($enc_in)", $_ or die "open $_: $!";
    print while <$fh>;
    close $fh;
}

print STDERR map "* $_\n", Encode->encodings; # more encodings now

回复收藏 0 原文

~没有更多了~