iostream 的二进制版本

发布于 2024-07-26 23:12:32 字数 2092 浏览 6 评论 0原文

我一直在编写 iostreams 的二进制版本。 它本质上允许您编写二进制文件,但可以让您更好地控制文件的格式。 用法示例:

my_file << binary::u32le << my_int << binary::u16le << my_string;

将 my_int 写为无符号 32 位整数,将 my_string 写为长度前缀字符串(其中前缀为 u16le)。要读回文件,您可以翻转箭头。 效果很好。 然而,我在设计中遇到了障碍,而且我仍然对此持观望态度。 所以,是时候问一下了。 (我们做了一些假设,例如 8 位字节、2s 补码整数和 IEEE 浮点数。)

iostreams 在底层使用了streambufs。 这确实是一个奇妙的设计——iostreams 将“int”的序列化编码为文本,并让底层的streambuf 处理其余的事情。 因此,您将获得 cout、fstreams、stringstreams 等。所有这些(iostreams 和streambufs)都是模板化的,通常在 char 上,但有时也作为 wchar。 然而,我的数据是字节流,最好用“unsigned char”表示。

我的第一次尝试是基于 unsigned char 对类进行模板化。 std::basic_string 模板足够好,但 streambuf 不行。 我在名为 codecvt 的类中遇到了几个问题,我永远无法遵循 unsigned char 主题。 这就提出了两个问题:

1)为什么streambuf要负责这些事情? 看起来代码转换不属于 Streambuf 的责任——streambuf 应该接受一个流,并缓冲流中的数据或从中提取的数据。 而已。 像代码转换这样高级的东西感觉应该属于 iostreams。

由于我无法让模板化的streambufs与unsigned char一起使用,所以我回到char,并且仅在char/unsigned char之间转换数据。 出于显而易见的原因,我试图尽量减少演员阵容的数量。 大多数数据基本上都在 read() 或 write() 函数中结束,然后调用底层的 Streambuf。 (并在此过程中使用强制转换。)读取功能基本上是:

size_t read(unsigned char *buffer, size_t size)
{
    size_t ret;
    ret = stream()->sgetn(reinterpret_cast<char *>(buffer), size);
    // deal with ret for return size, eof, errors, etc.
    ...
}

好的解决方案,坏的解决方案?


前两个问题表明需要更多信息。 首先,研究了诸如 boost::serialization 之类的项目,但它们存在于更高的级别,因为它们定义了自己的二进制格式。 这更多地用于较低级别的读/写,其中希望定义格式,或者格式已经定义,或者不需要或不需要批量元数据。

其次,有些人询问了 binary::u32le 修饰符。 它是一个类的实例,目前拥有所需的字节序和宽度,也许将来还拥有符号性。 该流保存该类最后传递的实例的副本,并在序列化中使用它。 这是一个解决方法,我最初尝试重载 << 操作员这样说:

bostream &operator << (uint8_t n);
bostream &operator << (uint16_t n);
bostream &operator << (uint32_t n);
bostream &operator << (uint64_t n);

但是在当时,这似乎不起作用。 我遇到了一些关于不明确的函数调用的问题。 对于常量来说尤其如此,尽管正如一位发帖者所建议的那样,您可以将其强制转换或仅将其声明为 const。 我似乎记得还有一些其他更大的问题。

I've been writing a binary version of iostreams. It essentially allows you to write binary files, but gives you much control over the format of the file. Example usage:

my_file << binary::u32le << my_int << binary::u16le << my_string;

Would write my_int as a unsigned 32-bit integer, and my_string as a length-prefixed string (where the prefix is u16le.) To read the file back, you would flip the arrows. Works great. However, I hit a bump in the design, and I'm still on the fence about it. So, time to ask SO. (We make a couple of assumptions, such as 8-bit bytes, 2s-complement ints, and IEEE floats at the moment.)

iostreams, under the hood, use streambufs. It's a fantastic design really -- iostreams code the serialization of an 'int' into text, and let the underlying streambuf handle the rest. Thus, you get cout, fstreams, stringstreams, etc. All of these, both the iostreams and the streambufs, are templated, usually on char, but sometimes also as a wchar. My data, however, is a byte stream, which best represented by 'unsigned char'.

My first attempts were to template the classes based on unsigned char. std::basic_string templates well enough, but streambuf does not. I ran into several problems with a class named codecvt, which I could never get to follow the unsigned char theme. This raises two questions:

1) Why is a streambuf responsible for such things? It seems like code-conversions lie way out of a streambuf's responsibility -- streambufs should take a stream, and buffer data to/from it. Nothing more. Something as high level as code conversions feels like it should belong in iostreams.

Since I couldn't get the templated streambufs to work with unsigned char, I went back to char, and merely casted data between char/unsigned char. I tried to minimize the number of casts, for obvious reasons. Most of the data basically winds up in a read() or write() function, which then invoke the underlying streambuf. (And use a cast in the process.) The read function is basically:

size_t read(unsigned char *buffer, size_t size)
{
    size_t ret;
    ret = stream()->sgetn(reinterpret_cast<char *>(buffer), size);
    // deal with ret for return size, eof, errors, etc.
    ...
}

Good solution, bad solution?


The first two questions indicate that more info is needed. First, projects such as boost::serialization were looked at, but they exist at a higher level, in that they define their own binary format. This is more for reading/writing at a lower level, where it is wished to define the format, or the format is already defined, or the bulk metadata is not required or desired.

Second, some have asked about the binary::u32le modifier. It is an instantiation of a class that holds the desired endianness and width, at the moment, perhaps signed-ness in the future. The stream holds a copy of the last-passed instance of that class, and used that in serialization. This was a bit of a workaround, I orginally tried overloading the << operator thusly:

bostream &operator << (uint8_t n);
bostream &operator << (uint16_t n);
bostream &operator << (uint32_t n);
bostream &operator << (uint64_t n);

However at the time, this didn't seem to work. I had several problems with ambiguous function call. This was especially true of constants, although you could, as one poster suggested, cast or merely declare it as a const <type>. I seem to remember that there was some other larger problem however.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

谎言 2024-08-02 23:12:32

我同意合法化。 我需要做几乎完全一样的你正在做的事情,并查看了重载 << / >>,但得出的结论是 iostream 不是旨在容纳它。 一方面,我不想对流类进行子类化才能定义我的重载。

我的解决方案(只需要在一台机器上临时序列化数据,因此不需要解决字节顺序)基于此模式:

// deducible template argument read
template <class T>
void read_raw(std::istream& stream, T& value,
    typename boost::enable_if< boost::is_pod<T> >::type* dummy = 0)
{
    stream.read(reinterpret_cast<char*>(&value), sizeof(value));
}

// explicit template argument read
template <class T>
T read_raw(std::istream& stream)
{
    T value;
    read_raw(stream, value);
    return value;
}

template <class T>
void write_raw(std::ostream& stream, const T& value,
    typename boost::enable_if< boost::is_pod<T> >::type* dummy = 0)
{
    stream.write(reinterpret_cast<const char*>(&value), sizeof(value));
}

然后我进一步为任何非 POD 类型(例如字符串)重载 read_raw/write_raw 。 注意,只有第一个版本的read_raw需要重载; 如果您正确使用ADL,第二个(1-arg)版本可以调用稍后在其他命名空间中定义的 2-arg 重载。

编写示例:

int32_t x;
int64_t y;
int8_t z;
write_raw(is, x);
write_raw(is, y);
write_raw<int16_t>(is, z); // explicitly write int8_t as int16_t

阅读示例:

int32_t x = read_raw<int32_t>(is); // explicit form
int64_t y;
read_raw(is, y); // implicit form
int8_t z = numeric_cast<int8_t>(read_raw<int16_t>(is));

它不像重载运算符那么性感,并且事情不那么容易放在一行上(无论如何我倾向于避免这一点,因为调试断点是面向行的),但我认为结果更简单,更明显,而且没有更冗长。

I agree with legalize. I needed to do almost exactly what you're doing, and looked at overloading << / >>, but came to the conclusion that iostream was just not designed to accommodate it. For one thing, I didn't want to have to subclass the stream classes to be able to define my overloads.

My solution (which only needed to serialize data temporarily on a single machine, and therefore did not need to address endianness) was based on this pattern:

// deducible template argument read
template <class T>
void read_raw(std::istream& stream, T& value,
    typename boost::enable_if< boost::is_pod<T> >::type* dummy = 0)
{
    stream.read(reinterpret_cast<char*>(&value), sizeof(value));
}

// explicit template argument read
template <class T>
T read_raw(std::istream& stream)
{
    T value;
    read_raw(stream, value);
    return value;
}

template <class T>
void write_raw(std::ostream& stream, const T& value,
    typename boost::enable_if< boost::is_pod<T> >::type* dummy = 0)
{
    stream.write(reinterpret_cast<const char*>(&value), sizeof(value));
}

I then further overloaded read_raw/write_raw for any non-POD types (e.g. strings). Note that only the first version of read_raw need be overloaded; if you use ADL correctly, the second (1-arg) version can call 2-arg overloads defined later and in other namespaces.

Write example:

int32_t x;
int64_t y;
int8_t z;
write_raw(is, x);
write_raw(is, y);
write_raw<int16_t>(is, z); // explicitly write int8_t as int16_t

Read example:

int32_t x = read_raw<int32_t>(is); // explicit form
int64_t y;
read_raw(is, y); // implicit form
int8_t z = numeric_cast<int8_t>(read_raw<int16_t>(is));

It's not as sexy as overloaded operators, and things don't fit on one line as easily (which I tend to avoid anyway, since debug breakpoints are line-oriented), but I think it turned out simpler, more obvious, and not much more verbose.

流绪微梦 2024-08-02 23:12:32

据我了解,用于指定类型的流属性更适合指定字节序、打包或其他“元数据”值。 类型本身的处理应该由编译器完成。 至少,STL 看起来就是这样设计的。

如果使用重载自动分隔类型,则仅当类型与变量的声明类型不同时才需要指定类型:

Stream& operator<<(int8_t);
Stream& operator<<(uint8_t);
Stream& operator<<(int16_t);
Stream& operator<<(uint16_t);
etc.

uint32_t x;
stream << x << (uint16_t)x;

读取声明类型以外的类型会有点混乱。 但总的来说,我认为应该避免读取或写入与输出类型不同的变量。

我相信 std::codecvt 的默认版本不执行任何操作,对所有内容都返回“noconv”。 它只有在使用“宽”字符流时才真正起作用。 你不能为codecvt设置一个类似的定义吗? 如果由于某种原因,为您的流定义无操作编解码器是不切实际的,那么我认为您的转换解决方案没有任何问题,特别是因为它被隔离到一个位置。

最后,您确定使用一些标准序列化代码不会更好吗,例如 Boost,而不是自己推出?

As I understand it, the stream properties that you're using to specify types would be more appropriate for specifying endian-ness, packing, or other "meta-data" values. The handling of types themselves should be done by the compiler. At least, that's the way the STL seems to be designed.

If you use overloads to separate the types automatically, you would need to specify the type only when it was different from the declared type of the variable:

Stream& operator<<(int8_t);
Stream& operator<<(uint8_t);
Stream& operator<<(int16_t);
Stream& operator<<(uint16_t);
etc.

uint32_t x;
stream << x << (uint16_t)x;

Reading types other than the declared type would be a little messier. In general, though, reading to or writing from variables of a type different from the output type should be avoided, I think.

I believe the default version of std::codecvt does nothing, returning "noconv" for everything. It only really does anything when using the "wide" character streams. Can't you set up a similar definition for codecvt? If, for some reason, it's impractical to define a no-op codecvt for your stream, then I don't see any problem with your casting solution, especially since it's isolated to one location.

Finally, are you sure you wouldn't be better off using some standard serialization code, like Boost, rather than rolling your own?

捶死心动 2024-08-02 23:12:32

我们需要做一些与您正在做的事情类似的事情,但我们走了另一条路。 我对你如何定义你的界面感兴趣。 我不知道您如何处理的部分内容是您定义的操纵器(binary::u32le、binaryu16le)。

使用 basic_streams,操纵器控制如何读取/写入以下所有元素,但在您的情况下,它可能没有意义,因为大小(操纵器信息的一部分)受到传入和传出的变量的影响。

binary_istream in;
int i;
int i2;
short s;
in >> binary::u16le >> i >> binary::u32le >> i2 >> s;

在上面的代码中,确定 i 变量是否为 32 位(假设 int 为 32 位)是有意义的,您只想从序列化流中提取 16 位,而您想要提取完整 32 位写入 i2。 之后,要么用户被迫为传入的每个其他类型引入操纵器,要么操纵器仍然有效,并且当传入短值并读取 32 位时可能会溢出,并且以任何方式用户可能会得到意想不到的结果。

(在我看来)尺寸似乎不属于操纵者。

顺便说一句,在我们的例子中,由于我们有其他约束作为类型的运行时定义,因此我们最终构建了自己的元类型系统来在运行时构建类型(一种变体类型),然后我们最终得到为这些类型实现反序列化(boost 风格),因此我们的序列化器不适用于基本的 C++ 类型,而是适用于序列化/数据对。

We needed to do something similar to what you are doing but we followed another path. I am interested in how you have defined your interface. Part of what I don't know how you can handle are the manipulators you have defined (binary::u32le, binaryu16le).

With basic_streams, the manipulator controls how all the following elements will be read/written, but in your case, it probably does not make sense, as the size (part of your manipulator information) is affected by the variable passed in and out.

binary_istream in;
int i;
int i2;
short s;
in >> binary::u16le >> i >> binary::u32le >> i2 >> s;

In the code above, it can make sense determining that whether the i variable is 32 bits (assuming int is 32 bits) you want to extract from the serialized stream only 16 bits, while you want to extract the full 32 bits into i2. After that, either the user is forced to introduce manipulators for each and every other type that is passed in, or else the manipulator still has effect and when the short is passed in and 32 bits are read with a possible overflow, and in any way the user will probably get unexpected results.

Size does not seem to belong (in my opinion) to manipulators.

Just as a side note, in our case, as we had other constraints as runtime definition of types, and we ended up building our own meta-type-system to build types at runtime (a type of variant), and then we ended up implementing de/serialization for those types (boost style), so our serializers don't work with basic C++ types, but rather with serialization/data pairs.

半衬遮猫 2024-08-02 23:12:32

我不会使用运算符<< 因为它与格式化文本 I/O 的联系过于紧密。

实际上,我根本不会为此使用运算符重载。 我会找到另一个成语。

I wouldn't use operator<< as its too intimately associated with formatted text I/O.

I wouldn't use an operator overload at all for this, actually. I'd find another idiom.

往日 2024-08-02 23:12:32

在现代 C++ 中,您可以使用 << 使用 string_view 处理二进制数据,因为它不是 null 终止的并且可以显式调整大小。

char buf[] = "this buffer can hold binary data, including null characters";
cout << string_view(buf, sizeof(buf));

In modern c++, you can use << with binary data by using string_view because it is not null terminated and can be explicitly sized.

char buf[] = "this buffer can hold binary data, including null characters";
cout << string_view(buf, sizeof(buf));
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文