Cassandra = 密钥的内存/编码足迹(哈希/字节[]=>十六进制=>UTF16=>字节[])
我试图理解使用 MD5 哈希作为 Cassandra 密钥的含义,就“内存/存储消耗”而言:
- 我的内容的 MD5 哈希(在 Java 中)= byte[] 是 16 个字节长。 (16 字节来自通用 md5 的维基百科,我不确定 java 实现是否也返回 16 字节)
- 对该值进行十六进制编码,以便能够以人类可读的格式打印它 => 1byte 变成 2hex 值
- 我必须将每个十六进制值表示为 java => 中的“字符” result=“两个字符串字符值”(例如“FF”是长度/大小=2的字符串。)
- Java使用UTF-16 =>因此每个“字符串字符”都用两个字节进行编码。 “FF”需要 2x2 字节?
- 结论=>字节格式的 MD5 哈希值是 16 个字节,但表示为 java 十六进制 utf16 字符串会消耗 16x2x2 = 64Bytes(在内存中)!?!?这是正确的吗?
使用它作为行键,Cassandra 中的存储消耗是多少?
如果我直接使用哈希函数中的字节数组,我会假设它在 Cassandra 中消耗 16 个字节?
但是,如果我使用十六进制字符串表示形式(如上所述),cassandra 可以将其“压缩”为 16 个字节还是在 cassandra 中也需要 64 个字节?我假设 Cassandra 中为 64 字节,这是正确的吗?
您使用什么类型的钥匙?您是直接使用哈希函数的输出,还是先编码为十六进制字符串,然后使用该字符串? (在 MySQL 中,每当我使用哈希键时,我总是使用它的十六进制字符串表示形式......所以它可以在 MySQL 工具和整个应用程序中直接读取。但我现在意识到它浪费了存储空间? ?)
也许我的想法完全不正确,那么请解释一下我错在哪里。
非常感谢! 延斯
I am trying to understand the implications of using an MD5 Hash as Cassandra Key, in terms of "memory/storage consumption":
- MD5 Hash of my content (in Java) = byte[] is 16 bytes long. (16 bytes is from wikipedia for generic md5, I am not shure if the java implementations also returns 16 bytes)
- Hex encode this value, to be able to print it in human readable format => 1byte becomes 2hex values
- I have to represent every hex value as a "character" in java => result= "two string character values" (for examle "FF" is a string of length/size = 2.)
- Java uses UTF-16 => so every "string character" is encoded with two bytes. "FF" would require 2x2 bytes?
- Conclusion => The MD5 Hash in Bytes format is 16 bytes, but represented as a java hex utf16 string consumes 16x2x2 = 64Bytes (in memory)!?!? Is this correct?
What is the storage Consumption in Cassandra, using this as a row-key?
If I had directly used the byte-array from the Hash function i would assume it consumes 16 bytes in Cassandra?
But if I use the hex-String representation (as noted above), can cassandra "compress" it to a 16 byets or will it also take 64bytes in cassandra? I assume 64 bytes in Cassandra, is this correct?
What kind of keys do you use? Do you use directly the outpout of an hash function or do you first encode into a hex string and then use the string?
(In MySQL I always, whenever I used a hash-key, I used the hex-string representation of it...So it is directly readable in the MySQL Tools and in the whole application. But I now realize it wastes storage???)
Maybe my thinking is completely incorrect, then it would be kind to explain where I am wrong.
Thans very much!
jens
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
两个计数均正确:byte[] 为 16 个字节,utf16-as-hex 为 64。
在 0.8 中,Cassandra 具有关键元数据,因此您可以告诉它“此键是一个 byte[]”,它将以十六进制显示cli。
Correct on both counts: byte[] would be 16 bytes, utf16-as-hex would be 64.
In 0.8, Cassandra has key metadata so you can tell it "this key is a byte[]" and it will display in hex in the cli.