在 Perl 中通过网络发送二进制安全数据

发布于 2024-12-10 03:07:05 字数 478 浏览 1 评论 0原文

我正在实现一个向服务器发送消息的网络客户端。这些消息是字节流,协议要求我预先发送每个流的长度。

如果我给出的消息(通过使用我的模块的代码)是一个字节字符串,那么长度可以很容易地通过 length $string 给出。但如果它是一串字符,我需要对其进行处理以获取原始字节。我现在所做的基本上是这样的:

my $msg = shift;   # some message from calling code
my $bytes;
if ( utf8::is_utf8( $msg ) ) { 
    $bytes = Encode::encode( 'utf-8', $msg );
} else { 
    $bytes = $msg;
}

my $length = length $bytes;

这是处理这个问题的正确方法吗?到目前为止似乎有效,但我还没有进行任何认真的测试。这种方法有哪些潜在的陷阱?

谢谢

I'm implementing a network client that sends messages to a server. The messages are streams of bytes, and the protocol requires that I send the length of each stream beforehand.

If the message that I am given (by the code using my module) is a byte string, then the length is given easily enough by length $string. But if it's a string of characters, I'll need to massage it to get the raw bytes. What I'm doing now is basically this:

my $msg = shift;   # some message from calling code
my $bytes;
if ( utf8::is_utf8( $msg ) ) { 
    $bytes = Encode::encode( 'utf-8', $msg );
} else { 
    $bytes = $msg;
}

my $length = length $bytes;

Is this the correct way to handle this? It seems to work so far, but I haven't done any serious testing yet. What potential pitfalls are there with this approach?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梦里泪两行 2024-12-17 03:07:05

您不应该真正猜测您的输入是什么。 定义您的代码以接受字节字符串或 Unicode 字符串,并将其留给调用者将输入转换为正确的格式(或者为调用者提供某种方式来指定他们要使用哪种字符串)重新提供)。

如果您将代码定义为接受字节字符串,则 \xFF 上面的任何字符都是错误。

如果您将代码定义为接受 Unicode 字符串,则可以使用 Encode::encode_utf8() 将它们转换为字节(无论 Perl 内部如何表示它们,都应该这样做)。

无论如何,调用 utf8::is_utf8() 通常是一个错误 - 您的程序不应该关心字符串的内部表示,而只关心它们包含的实际数据(字符序列)。其中一些字符(特别是 \x80\xFF 范围内的字符)是否在内部由一个或两个字节表示并不重要。

诗。阅读 perldoc Encode 可能有助于澄清 Perl 中字节和字符的问题。

You shouldn't really be guessing at what your input is. Define your code to accept either byte strings or Unicode character strings, and leave it to the caller to convert the input to the proper format (or provide some way for the caller to specify which kind of strings they're providing).

If you define your code to accept byte strings, then any characters above \xFF are an error.

If you define your code to accept Unicode character strings, then you can convert them to bytes with Encode::encode_utf8() (and should do so regardless of how they're internally represented by Perl).

In any case, calling utf8::is_utf8() is usually a mistake — your program should not care about the internal representation of strings, only about the actual data (a sequence of characters) they contain. Whether some of those characters (in particular, those in the range \x80 to \xFF) are internally represented by one or two bytes should not matter.

Ps. Reading perldoc Encode may help to clarify issues with bytes and characters in Perl.

短叹 2024-12-17 03:07:05

发送者:

use Encode qw( encode_utf8 );

sub pack_text {
   my ($text) = @_;
   my $bytes = encode_utf8($text);
   die "Text too long" if length($bytes) > 4294967295;
   return pack('N/a*', $bytes);
}

接收者:

use Encode qw( decode_utf8 );

sub read_bytes {
   my ($fh, $to_read) = @_;
   my $buf = '';
   while ($to_read > 0) {
      my $bytes_read = read($fh, $buf, $to_read, length($buf));
      die $! if !defined($bytes_read);
      die "Premature EOF" if !$bytes_read;
      $to_read -= $bytes_read;
   }
   return $buf;
}

sub read_uint32 {
   my ($fh) = @_;
   return unpack('N', read_bytes($fh, 4));
}

sub read_text {
   my ($fh) = @_;
   return decode_utf8(read_bytes($fh, read_uint32($fh)));
}

The sender:

use Encode qw( encode_utf8 );

sub pack_text {
   my ($text) = @_;
   my $bytes = encode_utf8($text);
   die "Text too long" if length($bytes) > 4294967295;
   return pack('N/a*', $bytes);
}

The receiver:

use Encode qw( decode_utf8 );

sub read_bytes {
   my ($fh, $to_read) = @_;
   my $buf = '';
   while ($to_read > 0) {
      my $bytes_read = read($fh, $buf, $to_read, length($buf));
      die $! if !defined($bytes_read);
      die "Premature EOF" if !$bytes_read;
      $to_read -= $bytes_read;
   }
   return $buf;
}

sub read_uint32 {
   my ($fh) = @_;
   return unpack('N', read_bytes($fh, 4));
}

sub read_text {
   my ($fh) = @_;
   return decode_utf8(read_bytes($fh, read_uint32($fh)));
}
帝王念 2024-12-17 03:07:05

perldoc -f length 曾经说过,早在 v5.8 中,

...您将得到字符数,而不是字节数。
要获取以字节为单位的长度,请使用 "do { use bytes; length(EXPR) }",
请参阅字节

length 的现代文档没有提及 bytes

length() 通常处理
逻辑字符,而不是物理字节。一个有多少字节
编码为 UTF-8 的字符串将占用,使用
“length(Encode::encode_utf8(EXPR))”(您必须“使用
首先编码”)。请参阅 Encodeperlunicode

但我不认为这会废弃 do { use bytes; ... } 解决方案。

perldoc -f length used to say, back in v5.8,

... you will get the number of characters, not the number of bytes.
To get the length in bytes, use "do { use bytes; length(EXPR) }",
see bytes.

The modern docs for length don't mention bytes:

length() normally deals in
logical characters, not physical bytes. For how many bytes a
string encoded as UTF-8 would take up, use
"length(Encode::encode_utf8(EXPR))" (you'll have to "use
Encode" first). See Encode and perlunicode.

but I don't think that deprecates the do { use bytes; ... } solution.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文