如何使用近似查询存储数据?
我正在尝试找到一种快速访问存储数据的方法(优于 O(n))。
我的数据库由表示某些项目的一些信息的数据(4096 字节字符串)组成。
问题是,查询永远不准确。我得到一个 Item,然后需要使用函数 F(a,b)
找到最接近的匹配项。
只是一个例子:
1234
3456
6466
F(a,b) = return % of similar digits
GetClosest(1233,F) = 1234
问题是 F(a,b) 是一个复杂的算法(不是一个合适的度量)。
我现在所拥有的只是遍历整个数据库来搜索最佳匹配。
是否有一种树或其他集群数据库类型可以让我更快地发现复杂性?
更多信息:
F 返回百分比相似度值。其中 100% 是完美匹配。
I'm trying to find a way to store my data with fast access (better than O(n)).
My database consists of data (4096 byte strings) that represents some information about some items.
The problem is, that the query is never exact. I get one Item, and then need to find the closest match using a function F(a,b)
.
just an example:
1234
3456
6466
F(a,b) = return % of similar digits
GetClosest(1233,F) = 1234
The problem is that F(a,b) is a complicated algorithm, (not a proper metric).
What I have now is just go over the whole database to search for the best match.
Is there a kind of tree or other cluster database type that can give me faster finding complexity ?
More information:
F gives back a similarity value in %percentage. where 100% is a perfect match.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
抱歉,答案是“可能不会”,除非您的问题还有一些您尚未描述的结构。对于 4096 字节的字符串,您将遭受维度的诅咒。
如果您有较短的字符串和足够的数据,那么最近的匹配在字符串的很大一部分上很可能是相同的,那么您可以使用在字符串的不同块上索引的多个树状结构来存储数据。最近的很有可能足够接近,您可以仅根据这些树中的接近元素来证明它是最近的。然而,由于字符串的大小和计算机中可以存储的数据有限,这是不可能的。
也就是说,您需要完全接近的还是仅需要稍微接近的?如果只有可能接近的一个,那么您可以通过几个随机的稀疏位样本对其进行索引。在搜索中,您只能检查与其中一个元素完全匹配的元素。这将大大减少搜索空间,同时拒绝更少的近邻,并且可能会产生合理的(即使经常是错误的)答案。
Sorry, the answer is "probably not" unless there is some more structure to your problem that you haven't described. With 4096 byte strings you're suffering from the curse of dimensionality.
If you had shorter strings and enough data that there was a high likelihood of the nearest match being identical over a large chunk of the string, then you could store your data with multiple tree-like structures indexed over different chunks of the string. With high likelihood the nearest would be close enough that you could prove it was nearest based only on close elements in those trees. However with the size of your strings and the limited data that can be stored in a computer, there is no way this is possibly going to work.
That said, do you need the exact closest, or only a somewhat close one? If only a likely close one, then you could index it by several random sparse samples of bits. In your search you can only check elements that match exactly in one of the elements. This will greatly reduce the search space, while rejecting fewer of the close neighbors, and may produce reasonable (even though frequently wrong) answers.
有什么方法可以为每个数据分配一个“分数”吗?
您可以根据您的分数对数据进行索引/排序。
当您搜索时,您可以为搜索条件分配一个分数,然后查找分数最接近的项目。
这在很大程度上取决于您的数据和您对“差异”的定义。
Is there some way you could assign a 'score' to each datum.
You could index/sequence the data by your score.
When you search you assign a score to your search criteria, and look for the item with the closest score.
Depends very much on your data and your definition of "difference" whether this will work.