在协议缓冲区中使用小数和日期时间的最佳方法是什么?

发布于 2024-08-06 09:09:51 字数 244 浏览 3 评论 0原文

我想找出存储协议缓冲区支持的列表中未包含的某些常见数据类型的最佳方式是什么。

  • 日期时间(秒精度)
  • 日期时间(毫秒精度)
  • 具有固定精度的小数
  • 具有可变精度的小数
  • 许多布尔值(如果您有很多布尔值,由于它们的标签,看起来每个它们都会有 1-2 个字节的开销。

另外,我们的想法是将它们非常容易地映射到相应的 C++/Python/Java 数据类型。

I would like to find out what is the optimum way of storing some common data type that were not included in the list supported by protocol buffers.

  • datetime (seconds precision)
  • datetime (milliseconds precision)
  • decimals with fixed precision
  • decimals with variable precision
  • lots of bool values (if you have lots of them it looks like you'll have 1-2 bytes overhead for each of them due to their tags.

Also the idea is to map them very easy to corresponding C++/Python/Java data types.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

君勿笑 2024-08-13 09:09:51

protobuf 的设计原理很可能是尽可能保持数据类型支持“原生”,以便将来很容易采用新语言。我想他们可以提供内置消息类型,但是你在哪里划清界限呢?

我的解决方案是创建两种消息类型:

DateTime
TimeSpan

这只是因为我来自 C# 背景,这些类型被认为是理所当然的。

回想起来,TimeSpanDateTime 可能有些过头了,但它是避免从 h/m/s 转换为 s 的“廉价”方法,反之亦然;也就是说,只要实现一个实用函数就很简单,例如:

int TimeUtility::ToSeconds(int h, int m, int s)

Bklyn,指出堆内存用于嵌套消息;在某些情况下,这显然是非常有效的——我们应该始终了解内存的使用方式。但是,在其他情况下,这可能不太重要,因为我们更担心实现的难易程度(我认为这是 Java/C# 哲学)。

在 protobuf TextFormat::Printer 中使用非内在类型还有一个小缺点;您无法指定它的显示格式,因此它看起来像:

my_datetime {
    seconds: 10
    minutes: 25
    hours: 12
}

... 这对某些人来说太冗长了。也就是说,如果以秒为单位表示,则阅读起来会比较困难。

总而言之,我想说:

  • 如果您担心内存/解析效率,请使用秒/毫秒。
  • 但是,如果目标是易于实现,请使用嵌套消息(DateTime 等)。

The protobuf design rationale is most likely to keep data type support as "native" as possible, so that it's easy to adopt new languages in future. I suppose they could provide in-build message types, but where do you draw the line?

My solution was to create two message types:

DateTime
TimeSpan

This is only because I come from a C# background, where these types are taken for granted.

In retrospect, TimeSpan and DateTime may have been overkill, but it was a "cheap" way of avoiding conversion from h/m/s to s and vice versa; that said, it would have been simple to just implement a utility function such as:

int TimeUtility::ToSeconds(int h, int m, int s)

Bklyn, pointed out that heap memory is used for nested messages; in some cases this is clearly very valid - we should always be aware of how memory is used. But, in other cases this can be of less concern, where we're worried more about ease of implementation (this is the Java/C# philosophy I suppose).

There's also a small disadvantage to using non-intrinsic types with the protobuf TextFormat::Printer; you cannot specify the format in which it is displayed, so it'll look something like:

my_datetime {
    seconds: 10
    minutes: 25
    hours: 12
}

... which is too verbose for some. That said, it would be harder to read if it were represented in seconds.

To conclude, I'd say:

  • If you're worried about memory/parsing efficiency, use seconds/milliseconds.
  • However, if ease of implementation is the objective, use nested messages (DateTime, etc).
一身软味 2024-08-13 09:09:51

以下是基于我使用类似于协议缓冲区的有线协议的经验的一些想法。

日期时间(秒精度)

日期时间(毫秒精度)

我认为这两个的答案是相同的,在秒精度的情况下,您通常会处理较小范围的数字。

使用 sint64/sfixed64 来存储某个众所周知的纪元(例如 1970 年 1 月 1 日午夜 GMT)的偏移量(以秒/毫秒为单位)。 Date 对象在内部是这样的 用 Java 表示。我确信 Python 和 C++ 中有类似的东西。

如果您需要时区信息,请以 UTC 形式传递日期/时间,并将相关时区建模为单独的字符串字段。为此,您可以使用 Olson Zoneinfo 数据库 中的标识符,因为这已经变得有点标准。

这样您就可以获得日期/时间的规范表示,但您也可以本地化到任何相关的时区。

固定精度的小数

我的第一个想法是使用类似于从 Python 的十进制包构造 Decimal 对象的方式的字符串。我认为相对于某些数字表示来说这可能效率低下。

根据您使用的域,可能会有更好的解决方案。例如,如果您正在对货币值进行建模,也许您可​​以使用 uint32/64 来传达以美分为单位的值,而不是小数美元金额。

此线程中还有一些有用的建议< /a>.

可变精度的小数

Protocol Buffers 是否已经支持 float/double 标量类型?也许我误解了这个要点。

无论如何,如果您需要绕过这些标量类型,您可以使用 IEEE-754 编码为 uint32 或 uint64(分别为浮点型和双精度型)。例如,Java 允许您提取 IEEE-754 表示形式反之亦然 来自 Float/Double 对象。 C++/Python 中也有类似的机制。

很多布尔值(如果你有很多
其中看起来您将有 1-2 个
由于每个人的字节开销
他们的标签。

如果您担心线路上浪费字节,可以使用位掩码技术将许多布尔值压缩为单个 uint32 或 uint64。

由于 Protocol Buffers 没有一流的支持,所有这些技术都需要代理之间达成一定的君子契约。当给定字段具有超出协议缓冲区默认行为的附加编码语义时,也许在字段上使用“_dttm”或“_mask”等命名约定将有助于通信。

Here are some ideas based on my experience with a wire protocol similar to Protocol Buffers.

datetime (seconds precision)

datetime (milliseconds precision)

I think the answer to these two would be the same, you would just typically be dealing with a smaller range of numbers in the case of seconds precision.

Use a sint64/sfixed64 to store the offset in seconds/milliseconds from some well-known epoch like midnight GMT 1/1/1970. This how Date objects are internally represented in Java. I'm sure there are analogs in Python and C++.

If you need time zone information, pass around your date/times in terms of UTC and model the pertinent time zone as a separate string field. For that, you can use the identifiers from the Olson Zoneinfo database since that has become somewhat standard.

This way you have a canonical representation for date/time, but you can also localize to whatever time zone is pertinent.

decimals with fixed precision

My first thought is to use a string similar to how one constructs Decimal objects from Python's decimal package. I suppose that could be inefficient relative to some numerical representation.

There may be better solutions depending on what domain you're working with. For example, if you're modeling a monetary value, maybe you can get away with using a uint32/64 to communicate the value in cents as opposed to fractional dollar amounts.

There are also some useful suggestions in this thread.

decimals with variable precision

Doesn't Protocol Buffers already support this with float/double scalar types? Maybe I've misunderstood this bullet point.

Anyway, if you had a need to go around those scalar types, you can encode using IEEE-754 to uint32 or uint64 (float vs double respectively). For example, Java allows you to extract the IEEE-754 representation and vice versa from Float/Double objects. There are analogous mechanisms in C++/Python.

lots of bool values (if you have lots
of them it looks like you'll have 1-2
bytes overhead for each of them due to
their tags.

If you are concerned about wasted bytes on the wire, you could use bit-masking techniques to compress many booleans into a single uint32 or uint64.

Because there isn't first class support in Protocol Buffers, all of these techniques require a bit of a gentlemens' contract between agents. Perhaps using a naming convention on your fields like "_dttm" or "_mask" would help communicate when a given field has additional encoding semantics above and beyond the default behavior of Protocol Buffers.

假装不在乎 2024-08-13 09:09:51

抱歉,不是完整的答案,而是“我也是”。

我认为这是一个很好的问题,我很想为自己找到一个答案。对我来说,无法本地描述基本类型(例如日期时间和(对于金融应用程序)定点小数)或将它们映射到语言指定或用户定义的类型是真正的杀手。它或多或少地阻止了我使用这个库,否则我认为它非常棒。

在原型语法中声明您自己的“DateTime”或“FixedPoint”消息并不是真正的解决方案,因为您仍然需要手动将平台的表示形式与生成的对象进行转换,这很容易出错。此外,这些嵌套消息在 C++ 中存储为指向堆分配对象的指针,当基础类型基本上只是 64 位整数时,这是非常低效的。

具体来说,我希望能够在我的原型文件中编写类似的内容:

message Something {
   required fixed64 time = 1 [cpp_type="boost::posix_time::ptime"];
   required int64 price = 2 [cpp_type="fixed_point<int64_t, 4>"];
   ...
 };

并且我需要提供将这些类型与fixed64和int64相互转换所需的任何粘合,以便序列化能够工作。也许通过类似 adobe::promote 的东西?

Sorry, not a complete answer, but a "me too".

I think this is a great question, one I'd love an answer to myself. The inability to natively describe fundamental types like datetimes and (for financial applications) fixed point decimals, or map them to language-specified or user-defined types is a real killer for me. Its more or less prevented me from being able to use the library, which I otherwise think is fantastic.

Declaring your own "DateTime" or "FixedPoint" message in the proto grammar isn't really a solution, because you'll still need to convert your platform's representation to/from the generated objects manually, which is error prone. Additionally, these nested messages get stored as pointers to heap-allocated objects in C++, which is wildly inefficient when the underlying type is basically just a 64-bit integer.

Specifically, I'd want to be able to write something like this in my proto files:

message Something {
   required fixed64 time = 1 [cpp_type="boost::posix_time::ptime"];
   required int64 price = 2 [cpp_type="fixed_point<int64_t, 4>"];
   ...
 };

And I would be required to provide whatever glue was necessary to convert these types to/from fixed64 and int64 so that the serialization would work. Maybe thru something like adobe::promote?

旧城空念 2024-08-13 09:09:51

对于毫秒分辨率的日期时间,我使用了 int64,其日期时间为 YYYYMMDDHHMMSSmmm。这使得它既简洁又可读,而且令人惊讶的是,它会持续很长时间。

对于小数,我使用了 byte[],因为我知道没有更好的不会有损的表示形式。

For datetime with millisecond resolution I used an int64 that has the datetime as YYYYMMDDHHMMSSmmm. This makes it both concise and readable, and surprisingly, will last a very long time.

For decimals, I used byte[], knowing that there's no better representation that won't be lossy.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文