Google 协议缓冲区以及对任意二进制数据使用 std::string

发布于 2025-01-07 04:27:13 字数 1021 浏览 1 评论 0原文

相关问题: 向量与二进制数据的字符串

我的代码使用 vector 来表示任意二进制数据。然而,我的很多代码必须与 Google 的协议缓冲区代码接口。 Protocol buffers 使用 std::string 来存储任意二进制数据。这会导致大量难看的分配/复制/释放周期,只是为了在 Google 协议缓冲区和我的代码之间移动数据。它还导致很多情况下我需要两个构造函数(一个接受向量,一个接受字符串)或两个函数将函数转换为二进制线格式。

该代码在内部大量处理原始结构,因为结构是内容可寻址的(通过哈希存储和检索)、签名的等等。所以这不仅仅是谷歌协议缓冲区接口的问题。代码的其他部分也以原始形式处理对象。

我能做的一件事就是剪切所有代码以使用 std::string 来处理任意二进制数据。我可以做的另一件事是尝试找出更有效的方法来将我的向量存储和检索到 Google 协议缓冲区对象中。我想我的另一个选择是创建标准、简单但缓慢的字符串转换函数并始终使用它们。这将避免大量的代码重复,但从性能的角度来看,这将是最糟糕的。

有什么建议吗?我还缺少更好的选择吗?

这就是我试图避免的:

if(SomeCase)
{
    std::vector<unsigned char> rawObject(objectdata().size());
    memcpy(&rawObject.front(), objectdata().data(), objectdata().size());
    DoSometingWith(rawObject);
}

当原始数据已经存在时,分配、复制、处理、释放是完全没有意义的。

Related Question: vector <unsigned char> vs string for binary data.

My code uses vector<unsigned char> for arbitrary binary data. However, a lot of my code has to interface to Google's protocol buffers code. Protocol buffers uses std::string for arbitrary binary data. This makes for a lot of ugly allocate/copy/free cycles just to move data between Google protocol buffers and my code. It also makes for a lot of cases where I need two constructors (one which takes a vector and one a string) or two functions to convert a function to binary wire format.

The code deals with raw structures a lot internally because structures are content-addressable (stored and retrieved by hash), signed, and so on. So it's not just a matter of the interface to Google's protocol buffers. Objects are handled in raw forms in other parts of the code as well.

One thing I could do is just cut all my code over to use std::string for arbitrary binary data. Another thing I could do is try to work out more efficient ways to store and retrieve my vectors into Google protocol buffer objects. I guess my other choice would be to create standard, simple, but slow conversion functions to strings and always use them. This would avoid the rampant code duplication, but would be worst from a performance standpoint.

Any suggestions? Any better choices I'm missing?

This is what I'm trying to avoid:

if(SomeCase)
{
    std::vector<unsigned char> rawObject(objectdata().size());
    memcpy(&rawObject.front(), objectdata().data(), objectdata().size());
    DoSometingWith(rawObject);
}

The allocate, copy, process, free is completely senseless when the raw data is already sitting there.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

你是我的挚爱i 2025-01-14 04:27:14

我知道并在使用中看到过两种避免复制的方法。

传统的方法确实是将指针/引用传递给已知实体。虽然这工作得很好并且没有什么大惊小怪,但问题是它将您与给定的表示联系起来,这需要在必要时进行转换(如您所经历的)。

我发现LLVM的另一种方式:

这个想法非常简单:两者都持有一个指向开头的 T*T 和指示元素数量的 size_t 组成的数组。

神奇的是,它们完全隐藏了实际的存储,无论是字符串、向量、动态或静态分配的 C 数组……都没关系。呈现的界面完全统一,不涉及复制。

唯一需要注意的是,它们拥有内存的所有权(Ref!),因此如果您不小心,可能会出现微妙的错误。不过,如果您只在瞬态操作中使用它们(例如在函数内)并且不存储它们以供以后使用,通常也没什么问题。

我发现它们在缓冲区操作方面非常方便,特别是由于免费切片操作。范围比成对的迭代器更容易操作。


我还经历过第三种方法,但到目前为止从未在严肃的代码中使用过。这个想法是矢量是一种非常低级的表示。通过提升抽象层并使用 Buffer 类,您可以完全封装内存存储的确切方式,这样就您的代码而言,它就不再是问题了。

然后,随意选择一种需要较少转换的内存表示形式。

There are two ways to avoid copying that I know of and have seen in use.

The traditional way is indeed to pass a pointer/reference to a known entity. While this works fine and with a minimum of fuss, the issue is that it ties you up to a given representation, which entails conversions (as you experienced) when necessary.

The other way I discovered with LLVM:

The idea is amazingly simple: both hold a T* pointing to the start of an array of T and a size_t indicating the number of elements.

What is magical is that they completely hide the actual storage, be it a string, a vector, a dynamically or statically allocated C-array... it does not matter. The interface presented is completely uniform and no copy is involved.

The only caveat is that they do not take ownership of the memory (Ref!) so subtle bugs might creep in if you do not take care. Still, it is usually fine if you only use them in transient operations (within a function, for example) and do not store them for later use.

I have found them incredibly handy in buffer manipulations, especially thanks to the free slicing operations. Ranges are just so much easier to manipulate than pairs of iterators.


There is also a third way I have experienced, but never used in serious code up until now. The idea is that a vector<unsigned char> is a very low-level representation. By raising the abstraction layer and use, say, a Buffer class, you can completely encapsulate the exact way the memory is stored so that it becomes a non-issue, as far as your code is concerned.

And then, feel free to choose the one memory representation that requires the less conversion.

晚风撩人 2025-01-14 04:27:14

为了避免此代码(您提供的),

if(SomeCase)
{
    std::vector<unsigned char> rawObject(objectdata().size());
    memcpy(&rawObject.front(), objectdata().data(), objectdata().size());
    DoSometingWith(rawObject);
}

其中大概 objectDatastd::string,请考虑

typedef unsigned char      Byte;
typedef std::vector<Byte>  ByteVector;

然后例如

if( someCase )
{
    auto const& s = objectData;
    doSomethingWith( ByteVector( s.begin(), s.end() ) );
}

To avoid this code (which you present),

if(SomeCase)
{
    std::vector<unsigned char> rawObject(objectdata().size());
    memcpy(&rawObject.front(), objectdata().data(), objectdata().size());
    DoSometingWith(rawObject);
}

where presumably objectData is a std::string, consider

typedef unsigned char      Byte;
typedef std::vector<Byte>  ByteVector;

and then e.g.

if( someCase )
{
    auto const& s = objectData;
    doSomethingWith( ByteVector( s.begin(), s.end() ) );
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文