在数据集中查找相似记录

发布于 2024-10-18 08:53:49 字数 812 浏览 3 评论 0原文

我有一个包含 25 个整数字段和 40k 条记录的数据集，例如

1:
  field1: 0
  field2: 3
  field3: 1
  field4: 2
  [...]
  field25: 1
2:
  field1: 2
  field2: 1
  field3: 4
  field4: 0
  [...]
  field25: 2

等。

我正在使用 MySQL 进行测试，但与它无关。

给定一条记录，我需要检索与其最相似的记录；类似于字段的最低平均差异。我开始查看以下内容，但我不知道如何将其映射到在大型数据集中搜索相似性的问题。

原文

I have a dataset of 25 integer fields and 40k records, e.g.

1:
  field1: 0
  field2: 3
  field3: 1
  field4: 2
  [...]
  field25: 1
2:
  field1: 2
  field2: 1
  field3: 4
  field4: 0
  [...]
  field25: 2

etc.

I'm testing with MySQL but am not tied to it.

Given a single record, I need to retrieve the records most similar to it; something like the lowest average difference of the fields. I started looking at the following, but I don't know how to map this onto the problem of searching for similarities in a large dataset.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

近箐 2024-10-25 08:53:49

我知道这是一篇旧文章，但对于任何寻求类似算法的人来说，余弦相似度是特别有效的算法。找到一种方法来矢量化您的记录，然后寻找它们之间角度最小的矢量。如果对记录进行矢量化并不简单，那么您可以通过某种已知算法对它们之间的相似性进行矢量化，然后查看相似性向量与完美匹配向量的余弦相似度（假设完美匹配不是目标，因为它们很容易无论如何找到）。通过这种匹配，我得到了巨大的结果，甚至可以比较诸如不同国家从事特定项目的人员名单以及对该项目的各种贡献之类的信息。矢量化意味着查看国家/地区匹配的数量、国家/地区不匹配、两个数据集之间匹配国家/地区的人口比例等。我使用字符串编辑距离函数（如 Levenshtein 距离）从字符串差异中获取数值，但可以使用语音匹配等。只要目标数字不为0（向量[0 0 ... 0]是ANY向量的子空间，因此它的角度将是未定义的。有时为了逃避问题，例如编辑的情况距离，我给完美匹配 (ed 0) 一个负权重，这样真正强调了完美匹配 -1 和 1 比 1 和 2 更远，这很有意义 - 完美匹配比任何带有 Even 的匹配都要好。 1 拼写错误。

Cos(theta) = (A dot B) / (Norm(A)*Norm(B)) 其中 dot 是点积，Norm 是向量的欧几里德幅度

祝你好运！

回复收藏 0 原文

執念 2024-10-25 08:53:49

这是每个字段之间的直线平均距离的可能性（每个减号之后的值来自需要匹配的给定记录）：

SELECT id,
(
  ABS(field1-2)
  + ABS(field2-2)
  + ABS(field3-3)
  + ABS(field4-1)
  + ABS(field5-0)
  + ABS(field6-3)
  + ABS(field7-2)
  + ABS(field8-0)
  + ABS(field9-1)
  + ABS(field10-0)
  + ABS(field11-2)
  + ABS(field12-2)
  + ABS(field13-3)
  + ABS(field14-2)
  + ABS(field15-0)
  + ABS(field16-1)
  + ABS(field17-0)
  + ABS(field18-2)
  + ABS(field19-3)
  + ABS(field20-1)
  + ABS(field21-0)
  + ABS(field22-1)
  + ABS(field23-3)
  + ABS(field24-2)
  + ABS(field25-2)
)/25
AS distance 
FROM mytable
ORDER BY distance ASC
LIMIT 20;

Here's a possibility with straight average distance between each of the fields (the value after each minus is from the given record needing a match):

SELECT id,
(
  ABS(field1-2)
  + ABS(field2-2)
  + ABS(field3-3)
  + ABS(field4-1)
  + ABS(field5-0)
  + ABS(field6-3)
  + ABS(field7-2)
  + ABS(field8-0)
  + ABS(field9-1)
  + ABS(field10-0)
  + ABS(field11-2)
  + ABS(field12-2)
  + ABS(field13-3)
  + ABS(field14-2)
  + ABS(field15-0)
  + ABS(field16-1)
  + ABS(field17-0)
  + ABS(field18-2)
  + ABS(field19-3)
  + ABS(field20-1)
  + ABS(field21-0)
  + ABS(field22-1)
  + ABS(field23-3)
  + ABS(field24-2)
  + ABS(field25-2)
)/25
AS distance 
FROM mytable
ORDER BY distance ASC
LIMIT 20;

回复收藏 0 原文

~没有更多了~