如何存储集合，快速找到相似的模式？

发布于 2024-07-12 00:32:51 字数 904 浏览 10 评论 0原文

（这不是家庭作业，也不是工作问题。这只是我个人的兴趣/职业，完全是虚构的。但我对好的算法或数据结构感兴趣。）

让我们假设，我将经营一个约会网站。我的特色是单曲与电影品味相匹配。（为什么不呢？）

在这种情况下，我需要一种方法来存储每个用户的电影评分。（到目前为止没有问题。）我需要一个数据结构来找到最合适的用户。两种口味模式之间的距离将是两个用户做出的所有评分之间的平均距离。

示例

movies   A B C D E F G H I J K L M ...
user Xm  9 5   1   1   5
user Ym      4 6 1         8
user Zf  9   6 4           7

距离(X,Z) = avg(abs(9-9) + abs(1-4) ) = 1.5

距离(Y,Z) = avg(abs(4-6) + abs (6-4) + abs(8-7) ) = 1.666

因此，X 先生比 Y 先生更适合 Z 女士。

我喜欢这样的解决方案...

...不需要对数据库进行太多操作
...不需要处理大量数据
...运行速度快
...提供最好的匹配
好吧，也许我也会考虑好的近似值。

请记住，这也应该适用于数千个可能的电影、仅对 20-50 部电影进行评分的用户以及数千个用户。

（因为这是一个心理难题，而不是一个真正的问题，所以解决方法并没有真正的帮助。）

您的搜索算法或数据结构是什么？

原文

(This is no homework and no work issue. It's just my personal interest/occupation and completly fictional. But I am interested in a good algorithm or data structure.)

Let's assume, that I would run a dating site. And my special feature would be that the singles were matched by movie taste. (Why not?)

In that case I would need a way to store the movie ratings for each user. (So far no problem.) And I would need a data structure to find the best fitting user. The distance between two taste patterns would be the average distance between all ratings that both users made.

Example

movies   A B C D E F G H I J K L M ...
user Xm  9 5   1   1   5
user Ym      4 6 1         8
user Zf  9   6 4           7

Distance(X,Z) = avg( abs(9-9) + abs(1-4) ) = 1.5

Distance(Y,Z) = avg( abs(4-6) + abs(6-4) + abs(8-7) ) = 1.666

So Mr. X fits slightly better to Mrs. Z, than Mr. Y does.

I like soulution that ...

... don't need many operations on the database
... don't need to handle a lot of data
... run fast
... deliver the best matching
Ok, maybe I would consider good approximations too.

Try to keep in mind that this should also work with thousands of possible movies, users that rate only about 20-50 movies, and thousands of users.

(Because this is a mental puzzle and not a real problem, work-arrounds are not really helping.)

What would be your search algorithm or data structure?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

紧拥背影 2024-07-19 00:32:51

听起来很像 Netflix 大奖挑战，更具体地说是最流行方法的前半部分。您尝试做的事情的可能实现是多种多样的。它们都不是特别有效，并且 L1 度量对于可靠的相关性来说并不是一个特别好的选择。

回复收藏 0 原文

小伙你站住 2024-07-19 00:32:51

看起来您正在寻找最近邻居电影空间。您的距离函数是 L1 指标。您可能可以使用某种空间索引。也许您可以使用协作过滤中的技术。

回复收藏 0 原文

末が日狂欢 2024-07-19 00:32:51

CREATE TABLE data (user INTEGER, movie INTEGER, rate INTEGER);

SELECT  other.user, AVG(ABS(d1.rate - d2.rate)) AS distance
FROM    data me, data other
WHERE   me.user = :user
    AND other.user <> me.user
    AND other.movie = me.movie
GROUP BY
    other.user
ORDER BY
    distance

复杂度将为 O(n^1.5))，而不是 O(n²)，因为将与 sqrt( 进行 n 次比较n) 电影（每对填充在一起的电影的平均值）。

CREATE TABLE data (user INTEGER, movie INTEGER, rate INTEGER);

SELECT  other.user, AVG(ABS(d1.rate - d2.rate)) AS distance
FROM    data me, data other
WHERE   me.user = :user
    AND other.user <> me.user
    AND other.movie = me.movie
GROUP BY
    other.user
ORDER BY
    distance

Complexity will be O(n^1.5)) rather than O(n²), as there will be n comparisons to sqrt(n) movies (average of movies filled together by each pair).

回复收藏 0 原文

~没有更多了~