如何按频率修剪数据集以符合论文的描述
MovieLens 数据集提供了一个包含列的表格:
userid | movieid | tag | timestamp
我无法重现他们修剪 MovieLens 数据集的方式:
Tag Informed Collaborative Filtering,作者:Zhen、Li 和 Young
在上述论文的 4.1 数据集中,写道 “对于标签信息,我们只保留那些添加的标签 至少 3 部不同的电影。对于用户来说,我们只 保留那些至少使用 3 个不同标签的用户 标记历史。对于电影,我们只保留那些 至少有 3 个不同的标签进行注释。”
我尝试查询数据库:
select TMP.userid, count(*) as tagnum
from (select distinct T.userid as userid, T.tag as tag from tags T) AS TMP
group by TMP.userid
having tagnum >= 3;
我得到了 1760 个用户的列表,他们标记了 3 个不同的标签。但是,其中一些标签 未添加到至少 3 部不同的电影中。
任何帮助表示赞赏。
The MovieLens data set provides a table with columns:
userid | movieid | tag | timestamp
I have trouble reproducing the way they pruned the MovieLens data set used in:
Tag Informed Collaborative Filtering, by Zhen, Li and Young
In 4.1 Data Set of the above paper, it writes
"For the tagging information, we only keep those tags which are added
on at least 3 distinct movies. As for the users, we only
keep those users who used at least 3 distinct tags in their
tagging history. For movies, we only keep those movies that
are annotated by at least 3 distinct tags."
I tried to query the database:
select TMP.userid, count(*) as tagnum
from (select distinct T.userid as userid, T.tag as tag from tags T) AS TMP
group by TMP.userid
having tagnum >= 3;
I got a list of 1760 users who labeled 3 distinct tags. However, some of the tags
are not added on at least 3 distinct movies.
Any help is appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您无法在任何地方限制每个标签的电影。看来您应该首先丢弃至少三部电影和三个用户未使用过的标签。然后限制为已标记 3 次的用户。
此查询应该为您提供由三个以上用户标记的标签,以及三个以上电影的标签:
如果您按用户查询,并将整个内容用作子查询,您应该能够检查也标记了三个以上的用户次数:
You aren't anywhere limiting the movies per tag anywhere. It seems like you should first discard tags which have not been used on at least three movies and by three users. Then limit to users who have tagged three times.
This query should give you the tags that are both tagged by three+ users, and on three+ movies:
If you query instead by users, and use the whole thing as a subquery, you should be able to check for the users who also have tagged three times: