当前位置：文江博客话题详情

r group-by dataframe similarity

使用唯一标识符计算重复变量之间的相似度

发布于 2025-01-17 16:12:26 字数 1188 浏览 0 评论 0原文

我有一个看起来像这样的数据集，其中ID应该是唯一的标识符。有重复的第1和4行，但由于年份的差异，第1和6和3行。可变DUPFREQ显示数据集中是否有任何类似实例，包括该行。

	ID	年	Tlabor	Rev	Dupfreq
1	1419	2005	5	1072	2
2	1425	2005 2005	42	2945	1
3	1419	2005	4	950	2
4	1443	2006	18	3900	1
5	1485	2006	118	35034	1
6	1419	2006 2006 2006 2006	6	1851	1

I想要检查行相似（Tlabor） Rev）对于那些患有Dupfreq＆gt＆gt的人； 1，ID和年的组。

我在想类似的事情：

	id	sim	请注意，dupfreq可以是2; 2，但是如果我只能使用dupfreq == 2的行生成新表，
1	1419	2005	0.83

我也可以。

任何建议都非常感谢！提前致谢！

I have a data set that looks like this, where id is supposed to be the unique identifier. There are duplicates, for example, lines 1 and 4, but not lines 1 and 6 or 3 and 6 due to the year difference. Variable dupfreq shows if there are any similar instances within the dataset, including that row.

	id	year	tlabor	rev	dupfreq
1	1419	2005	5	1072	2
2	1425	2005	42	2945	1
3	1419	2005	4	950	2
4	1443	2006	18	3900	1
5	1485	2006	118	35034	1
6	1419	2006	6	1851	1

I want to check for row similarity (tlabor and rev) for those with dupfreq > 1, group by id and year.

I was thinking of something similar to this:

	id	year	sim
1	1419	2005	0.83

Note that dupfreq can be >2, but if I can only generate the new table using rows with dupfreq==2 I am ok with it too.

Any advice is greatly appreciated! Thanks in advance!

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

关于作者

笑看君怀她人

暂无简介

文章

评论

26 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

尘曦

文章 0 评论 0

在梵高的星空下

文章 0 评论 0

善良天后

文章 0 评论 0

韬韬不绝

文章 0 评论 0

qq_CgiN62

文章 0 评论 0

不美如何

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文