寻找有关“Group varint 编码/解码”的更多详细信息在杰夫的幻灯片中呈现
我注意到 Jeff 的幻灯片“构建大规模信息检索系统的挑战”,也可以在此处下载:http://research.google.com/people/jeff/WSDM09-keynote.pdf,提到了一种称为“group varint编码”的整数压缩方法。据说比每字节 7 位整数编码快得多(快 2 倍)。我对此非常感兴趣,并正在寻找它的实现,或者任何可以帮助我自己实现它的更多细节。
我不是专业人士,也不是新手,欢迎任何帮助!
I noticed that in Jeff's slides "Challenges in Building Large-Scale Information Retrieval Systems", which can also be downloaded here: http://research.google.com/people/jeff/WSDM09-keynote.pdf, a method of integers compression called "group varint encoding" was mentioned. It was said much faster than 7 bits per byte integer encoding (2X more). I am very interested in this and looking for an implementation of this, or any more details that could help me implement this by myself.
I am not a pro and new to this, and any help is welcome!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这指的是“可变整数编码”,其中序列化时用于存储整数的位数不固定为 4 个字节。 protocol buffer 文档中对 varint 进行了详细说明。
它用于编码 Google 的协议缓冲区,您可以浏览 协议缓冲区源代码。
CodedOutputStream
包含确切的编码函数WriteVarint32FallbackToArrayInline:级联
if<如果
value
的大小保证了这些额外的字节, /code>s 只会在target
数组的末尾添加额外的字节。0x80
屏蔽正在写入的字节,并且值
向下移动。据我所知,0x7f
掩码使其表示“编码的最后一个字节”。 (当对0x80
进行 OR 运算时,最高位将始终为1
,然后最后一个字节清除最高位(通过对0x7f
进行 AND 运算)因此,在读取 varint 时,您会一直读取最高位为零的字节。抱歉,该代码是关于基本 VarInt 编码的(仍然比 7-更快)。 )。不幸的是,它不是用来在协议缓冲区中存储 64 位数字的,但如果该代码是开源的,我不会感到惊讶。
位 来自
varint
的想法和幻灯片中的“Group varint”图表,自己制作应该不会太难:)这是另一页描述 Group VarInt 压缩,包含解码代码。不幸的是,他们提到了公开可用的实现,但没有提供参考。
That's referring to "variable integer encoding", where the number of bits used to store an integer when serialized is not fixed at 4 bytes. There is a good description of varint in the protocol buffer documentation.
It is used in encoding Google's protocol buffers, and you can browse the protocol buffer source code.
The
CodedOutputStream
contains the exact encoding function WriteVarint32FallbackToArrayInline:The cascading
if
s will only add additional bytes onto the end of thetarget
array if the magnitude ofvalue
warrants those extra bytes. The0x80
masks the byte being written, and thevalue
is shifted down. From what I can tell, the0x7f
mask causes it to signify the "last byte of encoding". (When OR'ing0x80
, the highest bit will always be1
, then the last byte clears the highest bit (by AND'ing0x7f
). So, when reading varints you read until you get a byte with a zero in the highest bit.I just realized you asked about "Group VarInt encoding" specifically. Sorry, that code was about basic VarInt encoding (still faster than 7-bit). The basic idea looks to be similar. Unfortunately, it's not what's being used to store 64bit numbers in protocol buffers. I wouldn't be surprised if that code was open sourced somewhere though.
Using the ideas from
varint
and the diagrams of "Group varint" from the slides, it shouldn't be too too hard to cook up your own :)Here is another page describing Group VarInt compression, which contains decoding code. Unfortunately they allude to publicly available implementations, but they don't provide references.
我一直在寻找同样的东西,并发现了这个 Java 的 GitHub 项目:
https://github.com/stuhood/gvi/
看起来很有前途!
I was looking for the same thing and found this GitHub project in Java:
https://github.com/stuhood/gvi/
Looks promising !
在 c/c++ 中,您可以使用与第一个字节中的值相对应的预定义结构,而不是使用位掩码进行解码。使用此结构的完整示例: http://www.oschina.net/code/snippet_12_5083
Instead of decoding with bitmask, in c/c++ you could use predefined structures that corresponds to the value in the first byte.. complete example that uses this: http://www.oschina.net/code/snippet_12_5083
groupvarint 的另一个 Java 实现: https://github.com/catenamatteo/groupvarint
但我怀疑 Java 中非常大的 switch 有一些缺点
Another Java implementation for groupvarint: https://github.com/catenamatteo/groupvarint
But I suspect the very large switch has some drawback in Java