如何按频率修剪数据集以符合论文的描述

发布于 2024-09-05 05:11:25 字数 714 浏览 7 评论 0原文

MovieLens 数据集提供了一个包含列的表格:

userid | movieid | tag | timestamp

我无法重现他们修剪 MovieLens 数据集的方式:

Tag Informed Collaborative Filtering,作者:Zhen、Li 和 Young

在上述论文的 4.1 数据集中,写道 “对于标签信息,我们只保留那些添加的标签 至少 3 部不同的电影。对于用户来说,我们只 保留那些至少使用 3 个不同标签的用户 标记历史。对于电影,我们只保留那些 至少有 3 个不同的标签进行注释。”

我尝试查询数据库:

select TMP.userid, count(*) as tagnum
from (select distinct T.userid as userid, T.tag as tag from tags T) AS TMP 
group by TMP.userid
having tagnum >= 3;

我得到了 1760 个用户的列表,他们标记了 3 个不同的标签。但是,其中一些标签 未添加到至少 3 部不同的电影中。

任何帮助表示赞赏。

The MovieLens data set provides a table with columns:

userid | movieid | tag | timestamp

I have trouble reproducing the way they pruned the MovieLens data set used in:

Tag Informed Collaborative Filtering, by Zhen, Li and Young

In 4.1 Data Set of the above paper, it writes
"For the tagging information, we only keep those tags which are added
on at least 3 distinct movies. As for the users, we only
keep those users who used at least 3 distinct tags in their
tagging history. For movies, we only keep those movies that
are annotated by at least 3 distinct tags."

I tried to query the database:

select TMP.userid, count(*) as tagnum
from (select distinct T.userid as userid, T.tag as tag from tags T) AS TMP 
group by TMP.userid
having tagnum >= 3;

I got a list of 1760 users who labeled 3 distinct tags. However, some of the tags
are not added on at least 3 distinct movies.

Any help is appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

っ左 2024-09-12 05:11:25

您无法在任何地方限制每个标签的电影。看来您应该首先丢弃至少三部电影和三个用户未使用过的标签。然后限制为已标记 3 次的用户。

此查询应该为您提供由三个以上用户标记的标签,以及三个以上电影的标签:

select T1.tag,
       (select count( distinct T2.movieid ) from tags T2 where T2.tag = T1.tag) as mcount,
       (select count( distinct T3.userid ) from tags T3 where T3.tag = T1.tag) as ucount
from tags T1
having mcount >= 3 and ucount >= 3;

如果您按用户查询,并将整个内容用作子查询,您应该能够检查也标记了三个以上的用户次数:

select T4.user, count(*) as ucount from
 (select T1.userid as user,
         (select count( distinct T2.movieid ) from tags T2 where T2.tag = T1.tag) as mcount,
         (select count( distinct T3.userid ) from tags T3 where T3.tag = T1.tag) as ucount
  from tags T1
  having mcount >= 3 and ucount >= 3) as T4
group by user
having ucount > 3;

You aren't anywhere limiting the movies per tag anywhere. It seems like you should first discard tags which have not been used on at least three movies and by three users. Then limit to users who have tagged three times.

This query should give you the tags that are both tagged by three+ users, and on three+ movies:

select T1.tag,
       (select count( distinct T2.movieid ) from tags T2 where T2.tag = T1.tag) as mcount,
       (select count( distinct T3.userid ) from tags T3 where T3.tag = T1.tag) as ucount
from tags T1
having mcount >= 3 and ucount >= 3;

If you query instead by users, and use the whole thing as a subquery, you should be able to check for the users who also have tagged three times:

select T4.user, count(*) as ucount from
 (select T1.userid as user,
         (select count( distinct T2.movieid ) from tags T2 where T2.tag = T1.tag) as mcount,
         (select count( distinct T3.userid ) from tags T3 where T3.tag = T1.tag) as ucount
  from tags T1
  having mcount >= 3 and ucount >= 3) as T4
group by user
having ucount > 3;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文