实现三重存储原子
我正在尝试在 SQL 数据库之上实现我自己的三重存储(是的,我知道那里已经完成了项目),并且我正在尝试确定实现符号“原子”的最佳方法。
在简单的设计中,我们可以通过创建一个包含三个 varchar 列(称为主语、谓词、对象)的“三重”表来在 SQL 中实现三元组存储。为了节省空间,我将创建一个“原子”表,该表将存储任何主语/谓语/宾语字段中使用的唯一文本,并将这些字段更改为链接回包含其文本的原子的外键。
然而,我看到了几种实现 Atom 表的方法。
将文本存储为 varchar。
- 优点:易于索引并强制文本的唯一性。
- 缺点:它无法存储任意大的文本。
将文本存储为文本 blob,以及在查询和强制唯一性时使用的文本哈希。
- 优点:可以存储任意大的文本。
- 缺点:有点复杂。尽管很少见,但可能会发生哈希冲突,具体取决于哈希算法(md5、sha 等)。
就性能、长期可靠性和存储任何类型数据的能力而言,哪种方法更好?如果我使用哈希,是否存在对冲突的有效担忧?即使冲突很少发生,也只需要发生一次就会损坏三重存储。
I'm trying to implement my own triplestore ontop of a SQL database (yes I know there are finished projects out there) and I'm trying to decide on the best way to implement a symbolic "atom".
In a naive design, we might implement a triplestore in SQL by creating a single "triple" table with three varchar columns called subject, predicate, object. To save space I was going to create an "atom" table, that would store the unique text used in any subject/predicate/object field, and change those fields to foreign keys linking back to the atoms that contain their text.
However, I see a couple ways to implement the Atom table.
Store the text as a varchar.
- Pros: Simple to index and enforce uniqueness of the text.
- Cons: It could not store arbitrarily large text.
Store the text as a text blob, as well as a hash of the text to use when querying and enforcing uniqueness.
- Pros: Could store arbitrarily large text.
- Cons: A little more complicated. Possibly, albeit rare, hash collisions depending on the hashing algorithm (md5, sha, etc).
Which is the better approach in terms of performance, long-term reliability, and ability to store any type of data? If I use a hash, is there a valid concern about collisions? Even if collisions are rare, it would only have to happen once to corrupt triplestore.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在您能够证明这是一个瓶颈并且是需要修复的最重要的事情之前,不要浪费任何时间尝试优化它。
“为了节省空间……”不要。空间几乎是免费的。除非您拥有超过 1 TB 的数据,否则您无需担心太多。您很容易在存储上浪费更多的时间,而不是存储的价值。
varchar 解决方案可以正常工作并扩展。 “字符串池”或“原子表”的想法实际上是一个很好的想法,因为您将拥有对同一底层对象的大量引用。为什么要重复varchar?为什么不直接重复一个索引号呢?
“任意大的文本”是一个奇怪的要求。何苦呢?
Blob 通常会比较慢。哈希冲突——虽然只是一个理论上的问题——可以通过两种方式处理。首先,使用超过 32 位的哈希值。其次,碰撞不会破坏任何东西,除非你(愚蠢地)未能检查实际的斑点以查看它们是否实际上相同。如果您想避免比较整个 blob 来确认不存在冲突,请通过不同的算法保留两个哈希值。
Don't waste any time trying to optimize this until you can prove that it's a bottleneck and is the most important thing to fix.
"To save space..." don't. Space is almost free. Unless you have over a terabyte of data, you don't have much to worry about. You can easily waste more time thinking about storage than the storage is worth.
The varchar solution will work and scale fine. The idea of a "string pool" or "atom table" is actually a good one because you'll have lots of references to the same underlying object. Why repeat the varchar? Why not just repeat an index number?
"Arbitrarily large text" is a strange requirement. Why bother?
A blob will be generally be slower. The hash collision -- while little more than a theoretical concern -- is something you handle two ways. First, use a hash with more than 32 bits. Second, a collision won't corrupt anything unless you (foolishly) fail check the actual blobs to see if they're actually the same. If you want to avoid comparing the entire blob to confirm that there's no collisions, keep two hashes by different algorithms.