Google 协议缓冲区以及对任意二进制数据使用 std::string
相关问题: 向量
我的代码使用 vector
来表示任意二进制数据。然而,我的很多代码必须与 Google 的协议缓冲区代码接口。 Protocol buffers 使用 std::string
来存储任意二进制数据。这会导致大量难看的分配/复制/释放周期,只是为了在 Google 协议缓冲区和我的代码之间移动数据。它还导致很多情况下我需要两个构造函数(一个接受向量,一个接受字符串)或两个函数将函数转换为二进制线格式。
该代码在内部大量处理原始结构,因为结构是内容可寻址的(通过哈希存储和检索)、签名的等等。所以这不仅仅是谷歌协议缓冲区接口的问题。代码的其他部分也以原始形式处理对象。
我能做的一件事就是剪切所有代码以使用 std::string
来处理任意二进制数据。我可以做的另一件事是尝试找出更有效的方法来将我的向量存储和检索到 Google 协议缓冲区对象中。我想我的另一个选择是创建标准、简单但缓慢的字符串转换函数并始终使用它们。这将避免大量的代码重复,但从性能的角度来看,这将是最糟糕的。
有什么建议吗?我还缺少更好的选择吗?
这就是我试图避免的:
if(SomeCase)
{
std::vector<unsigned char> rawObject(objectdata().size());
memcpy(&rawObject.front(), objectdata().data(), objectdata().size());
DoSometingWith(rawObject);
}
当原始数据已经存在时,分配、复制、处理、释放是完全没有意义的。
Related Question: vector <unsigned char> vs string for binary data.
My code uses vector<unsigned char>
for arbitrary binary data. However, a lot of my code has to interface to Google's protocol buffers code. Protocol buffers uses std::string
for arbitrary binary data. This makes for a lot of ugly allocate/copy/free cycles just to move data between Google protocol buffers and my code. It also makes for a lot of cases where I need two constructors (one which takes a vector and one a string) or two functions to convert a function to binary wire format.
The code deals with raw structures a lot internally because structures are content-addressable (stored and retrieved by hash), signed, and so on. So it's not just a matter of the interface to Google's protocol buffers. Objects are handled in raw forms in other parts of the code as well.
One thing I could do is just cut all my code over to use std::string
for arbitrary binary data. Another thing I could do is try to work out more efficient ways to store and retrieve my vectors into Google protocol buffer objects. I guess my other choice would be to create standard, simple, but slow conversion functions to strings and always use them. This would avoid the rampant code duplication, but would be worst from a performance standpoint.
Any suggestions? Any better choices I'm missing?
This is what I'm trying to avoid:
if(SomeCase)
{
std::vector<unsigned char> rawObject(objectdata().size());
memcpy(&rawObject.front(), objectdata().data(), objectdata().size());
DoSometingWith(rawObject);
}
The allocate, copy, process, free is completely senseless when the raw data is already sitting there.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我知道并在使用中看到过两种避免复制的方法。
传统的方法确实是将指针/引用传递给已知实体。虽然这工作得很好并且没有什么大惊小怪,但问题是它将您与给定的表示联系起来,这需要在必要时进行转换(如您所经历的)。
我发现LLVM的另一种方式:
这个想法非常简单:两者都持有一个指向开头的
T*
由T
和指示元素数量的size_t
组成的数组。神奇的是,它们完全隐藏了实际的存储,无论是字符串、向量、动态或静态分配的 C 数组……都没关系。呈现的界面完全统一,不涉及复制。
唯一需要注意的是,它们不拥有内存的所有权(
Ref
!),因此如果您不小心,可能会出现微妙的错误。不过,如果您只在瞬态操作中使用它们(例如在函数内)并且不存储它们以供以后使用,通常也没什么问题。我发现它们在缓冲区操作方面非常方便,特别是由于免费切片操作。范围比成对的迭代器更容易操作。
我还经历过第三种方法,但到目前为止从未在严肃的代码中使用过。这个想法是矢量是一种非常低级的表示。通过提升抽象层并使用 Buffer 类,您可以完全封装内存存储的确切方式,这样就您的代码而言,它就不再是问题了。
然后,随意选择一种需要较少转换的内存表示形式。
There are two ways to avoid copying that I know of and have seen in use.
The traditional way is indeed to pass a pointer/reference to a known entity. While this works fine and with a minimum of fuss, the issue is that it ties you up to a given representation, which entails conversions (as you experienced) when necessary.
The other way I discovered with LLVM:
The idea is amazingly simple: both hold a
T*
pointing to the start of an array ofT
and asize_t
indicating the number of elements.What is magical is that they completely hide the actual storage, be it a
string
, avector
, a dynamically or statically allocated C-array... it does not matter. The interface presented is completely uniform and no copy is involved.The only caveat is that they do not take ownership of the memory (
Ref
!) so subtle bugs might creep in if you do not take care. Still, it is usually fine if you only use them in transient operations (within a function, for example) and do not store them for later use.I have found them incredibly handy in buffer manipulations, especially thanks to the free slicing operations. Ranges are just so much easier to manipulate than pairs of iterators.
There is also a third way I have experienced, but never used in serious code up until now. The idea is that a
vector<unsigned char>
is a very low-level representation. By raising the abstraction layer and use, say, aBuffer
class, you can completely encapsulate the exact way the memory is stored so that it becomes a non-issue, as far as your code is concerned.And then, feel free to choose the one memory representation that requires the less conversion.
为了避免此代码(您提供的),
其中大概
objectData
是std::string
,请考虑然后例如
To avoid this code (which you present),
where presumably
objectData
is astd::string
, considerand then e.g.