当前位置：文江博客话题详情

protocol-buffers TensorFlow tfrecord

什么是Protobuf消息？

发布于 2025-02-12 09:01:01 字数 423 浏览 1 评论 0 原文

我正在学习如何使用 tf.record s，在官方教程中，他们提到您可以打印 tf.train.train.example 消息（这是<<代码> Protobuf 协议如果我正确地得到）。

我知道 tf.records 用于序列化数据，在这种情况下，它们使用Protobuf协议。我也知道，使用 tf.train.feature ， tf.train.features 和 tf.train.train.example 一个人可以将数据转换为正确格式。

我的问题是在这种情况下打印 Messege 是什么意思？（本教程显示了如何打印 tf.train.example 消息）

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

记忆之渊 2025-02-19 09:01:01

一条消息通常被认为是从一个过程/线程到另一个过程/线程传达的字节集合。通常（但不一定），字节的集合对发件人和接收器意味着某些内容，例如，它是已通过某种方式序列化的对象（也许使用Google协议缓冲区）。因此，对象可以通过序列化并将字节放入一个可能称为“消息”的数组中来成为消息。

处理字节集合的过程不一定是这样的情况。例如，如果已经知道该字节应发送的位置，那么仅将它们传递给另一个连接的过程就不必真正划分它们。

传达消息的方式通常是某种排队 /管道 /套接字 /流 /流的方式。它变得有趣的是，大多数此类数据传输都是流连接。无论您在一端推动什么字样都会出来。那么，如何使用这些方式发送消息？

答案是，必须有某种方式在消息之间划界。有很多方法可以做到这一点，但是如今，使用Zeromq之类的东西更有意义，它为您（以及更重要的是）照顾了所有东西。 Zeromq是一个库 /协议，允许程序通过流连接将字节集合从一个进程 /线程传输到另一个过程，并确保接收程序将收藏夹用于一个不错的完整缓冲区。这些字节可以是通过Google协议缓冲区序列化的对象，也可以以其他方式序列化（有很多）。 HTTP也被用作移动对象的一种方式，例如HTML的页面。

因此模式是对象 - ＆gt;序列化 - ＆gt;消息缓冲区 - ＆gt;某种字节运输，从另一个字节划分了一个消息 - ＆gt;消息缓冲区 - ＆gt;挑战 - ＆gt;目的。

诸如协议缓冲区之类的序列化的优点是，发件人和接收器不必用相同的语言编写，也不需要共享任何内容。序列化的其他方法通常涉及在程序源代码中标记类定义，这使得很难用另一种语言进行挑选数据。

同样，在诸如C/C ++之类的语言中，只需将对象地址的字节从一个地方复制到另一个位置，就可以逃脱。如果目的地是一台机器，这可能是一场灾难。末日等可能很重要。有一些序列化标准接近此标准，特别是 cap'n proto （请参阅 this ）。

有变化。在一个过程中，“传递消息”可以简单地意味着传递对象周围的所有权。所有权可以按照惯例为例，即，如果我刚刚编写了对象指针的消息队列指针，我将不再突变对象。我认为在Rust中，它甚至是由语言语法表达的，因为一旦放弃了对象所有权，该语言就不会让您突变对象（在编译时奏效，这是使Rust如此出色的一部分）。网络结果看起来像消息传输，但实际上发生的只是指针（通常是64位）从A到B复制，而不是对象中的整个数据。这要快得多。

编辑

因此，消息传输协议如何工作？

值得深入研究Zeromq之类的工作。为了使其能够通过流连接传递整个应用程序消息，它需要操作某种协议。该协议本身将涉及“序列化”对象（协议数据单元）（嗯，转换为商定的电线格式），通过流连接推动，被接收端的Zeromq库进行了估算和理解。而且，当继续使用时，Zeromq使用了TCP（网络上），这也是IP构建的协议。这一直到以太网框架。

因此，有在协议上运行的协议，在其他协议之上运行（实际上，这是计算机互连的工作方式的层模型）。

为什么这很重要，以及可能出了什么问题

这对于考虑该协议分层是有用的。有时，一个人可能需要（例如）对缓冲溢出采取非常有力的措施，以防止远程开发。这可能是选择有助于防止此类事情的序列化技术（例如协议缓冲区）的原因。但是，当选择这样的技术时，只要所有协议层都同样强大，就必须意识到满足要求。如果操作系统的IP堆栈被打破且可利用，则没有必要使用协议缓冲区并声明自己的安全性。

openssl （请参阅此处）的Heartble 很好地说明了这一点。这是由弱指定的协议有效地引起的（请参见 rfc6520 ））;它是用英语定义的，要求程序员阅读此内容，手工编码协议，并注意文档中写的所有限制。关联的 rfc5426 甚至说：

本文档处理外部数据的格式
表示。以下非常基础，有些随便
将使用定义的演示语法。语法从
其结构的几个来源。尽管它类似于
在其语法和xdr [xdr]中，编程语言“ c”
语法和意图，绘制太多相似之处是有风险的。这
本演示语言的目的是仅记录TLS；它有
超出该特定目标的一般应用。

OpenSSL中的Heartble Bug是对英语规格进行编码错误的结果，鉴于强调的语句也许并不令人惊讶。使用OpenSSL的应用程序广泛，开放范围广泛，甚至认为应用程序本身（例如Web服务器）是HTTPS的写作非常好的书面实现。

现在，如果TLS的设计师选择使用体面且严格的序列化技术 - 甚至可能是Google协议缓冲区（加上一些消息划分） - 为了定义TLS中的PDU，那么不会发生Heartble的可能性就不会发生。。具体来说，请求/响应中的 payload_length 字段将在Google协议缓冲区内部得到照顾，从而删除从开发人员那里处理有效载荷的责任。

有趣的是将RFC中编写的协议规范与倾向于在电话界（受国际电话联盟监管）中找到的协议规范进行比较。 ITU的规格和工具非常“全面”（这应该是一种描述它们的中立方式）。许多电话都使用了ASN.1，这与Google协议缓冲区（且允许非常严格的消息定义）并不厌恶（并且基本上预先预先）这些天它甚至将JSON作为电线格式）。

“但是”，一个人指出，“如果ASN.1工具（或Google协议缓冲区）有一个错误，该怎么办？”确实，这是一个问题，这确实发生在ASN.1（来自商业ASN.1工具供应商中，无法记住哪个）。但是关键是，如果有一个库被广泛用于定义许多接口，那么识别错误的可能性更大（我本人在商业ASN.1工具中发现并报告了错误）。尽管使用英语定义了消息传递协议，但只有很少有关于开发人员对该英语含义的编码程度的目光。

并不是每个人都收到了消息

我感到令人失望的是，在软件世界的很大一部分中，仍然对使用Google协议缓冲器（ASN.1）的工具仍然有抵抗力。还有一些项目，确定了对这些事情的需求，然后发明了自己的项目。

一个这样的例子就是dbus-公平的很好。但是，他们确实发明了自己的序列化技术来指定DBUS消息。我不确定他们使用成熟和现成的东西会产生什么。

Google本身，当他们首次向全球宣布Google协议缓冲区时，被问到“您为什么不使用Asn.1？”，而舞台上的Googler必须承认从未听说过。因此，Google中的Googles尚未将Google用于Google进行“二进制序列化技术”；他们只是继续写下自己的书，而GPB缺少大量有用的功能。哦，具有讽刺意味。他们甚至不必从头开始写工具集；他们本可以在开源的ASN.1实施中采用和改进。

音译问题

这种分裂和增殖会导致问题。例如，在您的项目中，您希望能够将某些消息传输到Linux上的DBUS服务中。为此，您有一个.proto来定义您的信息，这非常适合在张量流中进行/沟通，但从根本上来说，对于DBU来说，毫无用处，DBU可以说出自己的格式。您最终会喜欢类似的东西

MyProtoMsg ipMsg;
MyEquivalentDBusMsg opMsg;

opMsg.field1 = ipMsg.field1;
opMsg.field2 = ipMsg.field2;
opMsg.field3 = ipMsg.field3;

。非常费力，非常无可奈何，并且不必要地消耗资源。另一个选项将仅仅是将您的GPB编码的消息包裹在DBUS消息中的字节数组中，但是人们觉得这是缺少的（它绕开了DBU的任何机会断言它的传递消息是正确形成的，并且在规范中形成了）。

如果世界同意一种真正的序列化技术，那么对象 /消息交换中的灵活性将是很棒的。

A message is classically thought of as a collection of bytes that are conveyed from one process/thread to another process/thread. Typically (but not necessarily), the collection of bytes means something to the sender and receiver, e.g. it's an object that has been serialised somehow (perhaps using Google Protocol Buffers). So, an object can become a message by serialising it and placing the bytes into an array that one might term a "message".

It's not necessarily the case the processes handling the collection of bytes will deserialise them. For example, a process that is simply going to pass them onwards down another connection need not actually deserialise them, if it already knows where the bytes are supposed to be sent.

The means by which a message is conveyed is typically some sort of queue / pipe / socket / stream / etc. Where it gets interesting is that most data transports of this sort are stream connections; whatever bytes you push in one end comes out the other. So, then, how to use those for sending messages?

The answer is that there has to be some way of demarcating between messages. There's lots of ways of doing that, but these days it makes far more sense to use something like ZeroMQ, which takes care of all that for you (and more besides). ZeroMQ is a library / protocol that allows a program to transfer a collection of bytes from one process/thread to another via stream connections, and ensure that the receiving program gets the collection in one nice and complete buffer. Those bytes could be objects serialised by Google Protocol Buffer, or serialised in some other way (there's lots). HTTP is also used as a way of moving objects around, e.g. a page of HTML.

So the pattern is object -> serialisation -> message buffer -> some sort of byte transport that demarcates one message from another -> message buffer -> deserialisation -> object.

An advantage of serialisations like Protocol Buffers is that the sender and receiver need not be written in the same language, or share anything at all except for the .proto file. Other approaches to serialisation often involves marking up class definitions in the program source code, which then makes it difficult to deserialise data in another language.

Also in languages like C/C++ one might get away with simply copying the bytes at the object's address from one place to another. This can be a total disaster if the destination is a different machine; endianness etc. can matter a lot. There are serialisation standards that get close to this, specifically Cap'n Proto (see this).

There are variations. Within a process, "passing a message" can simply mean passing ownership of an object around. Ownership can be by convention, i.e. if I've just written the object pointer to a message queue, I won't mutate the object anymore. I think in Rust it's even expressed by the language syntax, in that once object ownership has been given up the language won't let you mutate the object (worked out at compile time, part of what makes Rust so good). The net result looks like message transfer, but in fact all that's happened is a pointer (typically, 64bits) has been copied from A to B, not the entire data in the object. This is a lot faster.

EDIT

So, How Does a Message Transport Protocol Work?

It's worth digging into how something like ZeroMQ works. For it to be able to pass whole application messages across a stream connection, it needs operate some sort of protocol. That protocol is itself going to involve objects (Protocol Data Units) being "serialised" (well, converted to an agreed wire format), pushed through the stream connection, deserialised, and understood by the ZeroMQ library that's on the receiving end. And, when gets on down to it, ZeroMQ is using TCP (over a network), and that too is a protocol built on IP. And that goes on down to Ethernet frames.

So, there's protocols running atop protocols, running atop other protocols (in fact, this is the Layer Model of how computer interconnectedness works).

Why That Matters, and What Can Go Wrong

It's useful to bearing this protocol layering in mind. Sometimes, one might have a requirement to (for example), take very strong measures against buffer overflows, perhaps to prevent remote exploitation. That might be a reason to pick a serialisation technology that helps guard against such things - e.g. Protocol Buffers. However, when picking such a technology, one has to realise that the requirement is met provided that all of the protocol layerings are equally robust. There's no point using, say, Protocol Buffers and declaring oneself safe against buffer overflows, if the OS's IP stack is broken and exploitable.

This is well illustrated by the Heartbleed bug in OpenSSL (see here). This was caused effectively by a weakly specified protocol (see RFC6520); it's defined in English language, and requires the programmer to read this, code up the protocol by hand, and pay attention to all the strictures written in the document. The associated RFC5426 even says:

This document deals with the formatting of data in an external
representation. The following very basic and somewhat casually
defined presentation syntax will be used. The syntax draws from
several sources in its structure. Although it resembles the
programming language "C" in its syntax and XDR [XDR] in both its
syntax and intent, it would be risky to draw too many parallels. The
purpose of this presentation language is to document TLS only; it has
no general application beyond that particular goal.

The Heartbleed bug in OpenSSL was a result of the coding up of the English language spec being done wrong, and given that highlighted statement perhaps it's no great surprise. Applications that were using OpenSSL were wide, wide open to exploitation, even thought the applications themselves (e.g. Web servers) were very well written implementations of, say, HTTPS.

Now, had the designers of TLS chosen to use a decent and strict serialisation technology - perhaps even Google Protocol Buffers (plus some message demarcation) - to define the PDUs in TLS, it would have been far more likely that Heartbleed wouldn't have happened. Specifically, the payload_length field in a request / response would have been taken care of inside Google Protocol Buffers, thereby removing responsibility for handling the length of the payload from the developer.

What's interesting is to compare protocol specifications as written in RFCs with those that tend to be found in the world of telephony (regulated by the International Telephony Union). The ITU's specifications and tools are very "comprehensive" (that ought to be an acceptably neutral way of describing them). A lot of telephony uses ASN.1, which is not disimilar to (and substantially pre-dates) Google Protocol Buffers, but allows for very strict definitions of messages, requires pretty comprehensive tools to do it right, but is bang up to date (it even has JSON as a wire format these days).

"But", one points out, "what if the ASN.1 tools (or Google Protocol Buffers) has a bug?". Well indeed that is a problem, and that has indeed happened to ASN.1 (from one of the commercial ASN.1 tools vendors, can't rememeber which). But the point is that if there's one library that is widely used for defining lots of interfaces, then there's a greater chance of bugs being identified (I myself have found and reported bugs in commercial ASN.1 tools). Whereas if a messaging protocol is defined using, say, English language, there's only ever going to be a very few set of eyes on how well the developer has coded up the meaning of that English language.

Not Everyone Has Got the Message

What I find disappointing is that, across a large portion of the software world, there's still resistance to using tools like Google Protocol Buffers, ASN.1. There's also projects that, having identified the need for such things, go and invent their own.

One such example is dBus - which to be fair is pretty good. However they did go an invent their own serialisation technology for specifying dBus messages; I'm not sure what they gained over using something mature and off-the-shelf.

Google themselves, when they first announced Google Protocol Buffers to the world, were asked "Why didn't you use ASN.1?", and the Googler on the stage had to admit to never having heard of it. So, Googlers in Google hadn't used Google to Google for "binary serialisation technologies"; they'd just gone ahead and wrote their own, and GPB is missing a ton of useful features. Oh, the irony. They'd not even have had to write a toolset from scratch; they could have simply adopted and improved on one of the open source ASN.1 implementations.

Transliteration Problem

This fragmentation and proliferation causes problems. Say, for example, in your project you want to be able to transfer some of your messages into a dBus service on Linux. To do that, you've got a .proto defining your messages, which is great for communicating in/out of Tensor Flow, but fundamentally useless for dBus, which speaks its own format. You'd end up having something like

MyProtoMsg ipMsg;
MyEquivalentDBusMsg opMsg;

opMsg.field1 = ipMsg.field1;
opMsg.field2 = ipMsg.field2;
opMsg.field3 = ipMsg.field3;

and so on. Very laborious, very unmaintainable, and needlessly consumes resources. The other option would be simply to wrap up your GPB encoded messages in a byte array in a dBus message, but one feels that's missing the point (it bypasses any opportunity for dBus to assert that messages it's passing are correctly formed and within specifications).

If the world agreed on the One True Serialisation technology then the flexibility in object / message exchange would be fantastic.

回复收藏 0 原文

~没有更多了~