Lucene 中的 VInt 是什么?
I want to know what is the VInt in Lucene ?
I read this article , but i don't understand what is it and where does Lucene use it ?
Why Lucene doesn't use simple integer or big integer ?
Thanks .
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
VInt 非常节省空间。理论上可以节省高达 75% 的空间。
在 Lucene 中,许多结构都是整数列表。例如,给定术语的文档列表、文档中术语的位置(和偏移量)等。这些列表构成了 lucene 数据的大部分。
想一想需要数十 GB 空间的数百万文档的 Lucene 索引。将空间缩小一半以上可减少磁盘空间需求。虽然节省磁盘空间可能不是一个很大的胜利,但考虑到磁盘空间很便宜,真正的好处来自于减少磁盘 IO。读取 VInt 数据的磁盘 IO 低于读取整数,这会自动转化为更好的性能。
VInt is extremely space efficient. It could theoretically save upto 75% space.
In Lucene, many of the structures are list of integers. For example, list of documents for a given term, positions (and offsets) of the terms in documents, among others. These lists form bulk of the lucene data.
Think of Lucene indices for millions of documents that need tens of GBs of space. Shrinking space by more than half reduces disk space requirements. While savings of disk space may not be a big win, given that disk space is cheap, the real gain comes reduced disk IO. Disk IO for reading VInt data is lower than reading integers which automatically translates to better performance.
对于你的第一个问题:
定义正整数的可变长度格式,其中每个字节的高位指示是否还有更多字节需要读取。低七位作为结果整数值中越来越重要的位被附加。因此,从零到 127 的值可以存储在单个字节中,从 128 到 16,383 的值可以存储在两个字节中,等等。 https://lucene.apache.org/core/3_0_3/fileformats.html 。
因此,要保存 n 个整数的列表,您需要的内存量是[例如] 4*n 字节。但使用 Vint,所有 128 以下的数字都将仅使用 1 个字节来存储[等等],从而节省大量内存。
Vint 提供了整数的压缩表示,Shashikant 的答案已经解释了 Lucene 中压缩的要求和好处。
For your first question:
A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on. https://lucene.apache.org/core/3_0_3/fileformats.html.
So, to save a list of n integers the amount of memory you would need is [eg] 4*n bytes. But with Vint all numbers under 128 would be stored using only 1 byte [and so on] saving a lot of memory.
Vint provides a compressed representation of integers and Shashikant's answer already explains the requirements and benefits of compression in Lucene.
VInt 指 Lucene 的可变宽度整数编码方案。它仅使用每个字节的低七位对一个或多个字节中的整数进行编码。除了最后一个字节之外,所有字节的高位都设置为零,这就是长度的编码方式。
VInt refers to Lucene's variable-width integer encoding scheme. It encodes integers in one or more bytes, using only the low seven bits of each byte. The high bit is set to zero for all bytes except the last, which is how the length is encoded.