如何存储集合,快速找到相似的模式?

发布于 2024-07-12 00:32:51 字数 904 浏览 10 评论 0原文

(这不是家庭作业,也不是工作问题。这只是我个人的兴趣/职业,完全是虚构的。但我对好的算法或数据结构感兴趣。)

让我们假设,我将经营一个约会网站。 我的特色是单曲与电影品味相匹配。 (为什么不呢?)

在这种情况下,我需要一种方法来存储每个用户的电影评分。 (到目前为止没有问题。)我需要一个数据结构来找到最合适的用户。 两种口味模式之间的距离将是两个用户做出的所有评分之间的平均距离。

示例

movies   A B C D E F G H I J K L M ...
user Xm  9 5   1   1   5
user Ym      4 6 1         8
user Zf  9   6 4           7

距离(X,Z) = avg(abs(9-9) + abs(1-4) ) = 1.5

距离(Y,Z) = avg(abs(4-6) + abs (6-4) + abs(8-7) ) = 1.666

因此,X 先生比 Y 先生更适合 Z 女士。

我喜欢这样的解决方案...

  • ...不需要对数据库进行太多操作
  • ...不需要处理大量数据
  • ...运行速度快
  • ...提供最好的匹配
  • 好吧,也许我也会考虑好的近似值。

请记住,这也应该适用于数千个可能的电影、仅对 20-50 部电影进行评分的用户以及数千个用户。

(因为这是一个心理难题,而不是一个真正的问题,所以解决方法并没有真正的帮助。)

您的搜索算法或数据结构是什么?

(This is no homework and no work issue. It's just my personal interest/occupation and completly fictional. But I am interested in a good algorithm or data structure.)

Let's assume, that I would run a dating site. And my special feature would be that the singles were matched by movie taste. (Why not?)

In that case I would need a way to store the movie ratings for each user. (So far no problem.) And I would need a data structure to find the best fitting user. The distance between two taste patterns would be the average distance between all ratings that both users made.

Example

movies   A B C D E F G H I J K L M ...
user Xm  9 5   1   1   5
user Ym      4 6 1         8
user Zf  9   6 4           7

Distance(X,Z) = avg( abs(9-9) + abs(1-4) ) = 1.5

Distance(Y,Z) = avg( abs(4-6) + abs(6-4) + abs(8-7) ) = 1.666

So Mr. X fits slightly better to Mrs. Z, than Mr. Y does.

I like soulution that ...

  • ... don't need many operations on the database
  • ... don't need to handle a lot of data
  • ... run fast
  • ... deliver the best matching
  • Ok, maybe I would consider good approximations too.

Try to keep in mind that this should also work with thousands of possible movies, users that rate only about 20-50 movies, and thousands of users.

(Because this is a mental puzzle and not a real problem, work-arrounds are not really helping.)

What would be your search algorithm or data structure?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

紧拥背影 2024-07-19 00:32:51

听起来很像 Netflix 大奖 挑战,更具体地说是最流行方法的前半部分。 您尝试做的事情的可能实现是多种多样的。 它们都不是特别有效,并且 L1 度量对于可靠的相关性来说并不是一个特别好的选择。

Sounds a lot like the Netflix Prize challenge, more specifically the first half of the most popular approach. The possible implementations of what you are trying to do are numerous and varied. None of them are exceptionally efficient, and the L1 metric is not a particularly good option for reliable correlations.

小伙你站住 2024-07-19 00:32:51

看起来您正在寻找 最近邻居电影空间。 您的距离函数是 L1 指标。 您可能可以使用某种空间索引。 也许您可以使用协作过滤中的技术。

Looks like you are looking for the nearest neighbor in the movie space. And your distance function is the L1 metric. You can probably use a spatial index of some kind. Maybe you can use techniques from collaborative filtering.

末が日狂欢 2024-07-19 00:32:51
CREATE TABLE data (user INTEGER, movie INTEGER, rate INTEGER);

SELECT  other.user, AVG(ABS(d1.rate - d2.rate)) AS distance
FROM    data me, data other
WHERE   me.user = :user
    AND other.user <> me.user
    AND other.movie = me.movie
GROUP BY
    other.user
ORDER BY
    distance

复杂度将为 O(n1.5)),而不是 O(n2),因为将与 sqrt( 进行 n 次比较n) 电影(每对填充在一起的电影的平均值)。

CREATE TABLE data (user INTEGER, movie INTEGER, rate INTEGER);

SELECT  other.user, AVG(ABS(d1.rate - d2.rate)) AS distance
FROM    data me, data other
WHERE   me.user = :user
    AND other.user <> me.user
    AND other.movie = me.movie
GROUP BY
    other.user
ORDER BY
    distance

Complexity will be O(n1.5)) rather than O(n2), as there will be n comparisons to sqrt(n) movies (average of movies filled together by each pair).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文