在数据集中查找相似记录
我有一个包含 25 个整数字段和 40k 条记录的数据集,例如
1:
field1: 0
field2: 3
field3: 1
field4: 2
[...]
field25: 1
2:
field1: 2
field2: 1
field3: 4
field4: 0
[...]
field25: 2
等。
我正在使用 MySQL 进行测试,但与它无关。
给定一条记录,我需要检索与其最相似的记录;类似于字段的最低平均差异。我开始查看以下内容,但我不知道如何将其映射到在大型数据集中搜索相似性的问题。
I have a dataset of 25 integer fields and 40k records, e.g.
1:
field1: 0
field2: 3
field3: 1
field4: 2
[...]
field25: 1
2:
field1: 2
field2: 1
field3: 4
field4: 0
[...]
field25: 2
etc.
I'm testing with MySQL but am not tied to it.
Given a single record, I need to retrieve the records most similar to it; something like the lowest average difference of the fields. I started looking at the following, but I don't know how to map this onto the problem of searching for similarities in a large dataset.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我知道这是一篇旧文章,但对于任何寻求类似算法的人来说,余弦相似度是特别有效的算法。找到一种方法来矢量化您的记录,然后寻找它们之间角度最小的矢量。如果对记录进行矢量化并不简单,那么您可以通过某种已知算法对它们之间的相似性进行矢量化,然后查看相似性向量与完美匹配向量的余弦相似度(假设完美匹配不是目标,因为它们很容易无论如何找到)。通过这种匹配,我得到了巨大的结果,甚至可以比较诸如不同国家从事特定项目的人员名单以及对该项目的各种贡献之类的信息。矢量化意味着查看国家/地区匹配的数量、国家/地区不匹配、两个数据集之间匹配国家/地区的人口比例等。我使用字符串编辑距离函数(如 Levenshtein 距离)从字符串差异中获取数值,但可以使用语音匹配等。只要目标数字不为0(向量[0 0 ... 0]是ANY向量的子空间,因此它的角度将是未定义的。有时为了逃避问题,例如编辑的情况距离,我给完美匹配 (ed 0) 一个负权重,这样真正强调了完美匹配 -1 和 1 比 1 和 2 更远,这很有意义 - 完美匹配比任何带有 Even 的匹配都要好。 1 拼写错误。
Cos(theta) = (A dot B) / (Norm(A)*Norm(B)) 其中 dot 是点积,Norm 是向量的欧几里德幅度
祝你好运!
I know it's an old post, but for anyone who comes by it seeking similar algorithms, one that works particularly well is Cosine Similarity. Find a way to vectorize your records, then look for vectors with minimum angle between them. If vectorizing a record is not trivial, then you can vectorize similarity between them via some known algorithm, and then look at cosine similarity of the similarity vectors to the perfect match vector (assuming perfect matches aren't the goal since they're easy to find anyway). I get tremendous results with this matching even comparing things like lists of people in various countries working on a particular project with various contributions to the project. Vectorization implies looking at number of country matches, country mismatches, ratio of people in a matching country between two datasets, etc etc etc. I use string edit distance functions like Levenshtein distance for getting numeric value from string dissimilarities, but one could use phonetic matching, etc. As long as the target number is not 0 (vector [0 0 ... 0] is the subspace of ANY vector and thus its angle would be undefined. Sometimes to get away from the problem, such as the case of edit distance, I give a perfect match (e.d. 0) a negative weight, so that perfect matches are really emphasized. -1 and 1 are farther away than 1 and 2, which makes a lot of sense - perfect match is better than anything with even 1 misspelling.
Cos(theta) = (A dot B) / (Norm(A)*Norm(B)) where dot is the dot-product, and Norm is the Euclidian magnitude of the vector.
Good luck!
这是每个字段之间的直线平均距离的可能性(每个减号之后的值来自需要匹配的给定记录):
Here's a possibility with straight average distance between each of the fields (the value after each minus is from the given record needing a match):