使用唯一标识符计算重复变量之间的相似度
我有一个看起来像这样的数据集,其中ID应该是唯一的标识符。有重复的第1和4行,但由于年份的差异,第1和6和3行。可变DUPFREQ显示数据集中是否有任何类似实例,包括该行。
ID | 年 | Tlabor | Rev | Dupfreq | |
---|---|---|---|---|---|
1 | 1419 | 2005 | 5 | 1072 | 2 |
2 | 1425 | 2005 2005 | 42 | 2945 | 1 |
3 | 1419 | 2005 | 4 | 950 | 2 |
4 | 1443 | 2006 | 18 | 3900 | 1 |
5 | 1485 | 2006 | 118 | 35034 | 1 |
6 | 1419 | 2006 2006 2006 2006 | 6 | 1851 | 1 |
I想要检查行相似(Tlabor) Rev)对于那些患有Dupfreq&gt&gt的人; 1,ID和年的组。
我在想类似的事情:
id | sim | 请注意,dupfreq可以是2; 2,但是如果我只能使用dupfreq == 2的行生成新表, | |
---|---|---|---|
1 | 1419 | 2005 | 0.83 |
我也可以。
任何建议都非常感谢!提前致谢!
I have a data set that looks like this, where id is supposed to be the unique identifier. There are duplicates, for example, lines 1 and 4, but not lines 1 and 6 or 3 and 6 due to the year difference. Variable dupfreq shows if there are any similar instances within the dataset, including that row.
id | year | tlabor | rev | dupfreq | |
---|---|---|---|---|---|
1 | 1419 | 2005 | 5 | 1072 | 2 |
2 | 1425 | 2005 | 42 | 2945 | 1 |
3 | 1419 | 2005 | 4 | 950 | 2 |
4 | 1443 | 2006 | 18 | 3900 | 1 |
5 | 1485 | 2006 | 118 | 35034 | 1 |
6 | 1419 | 2006 | 6 | 1851 | 1 |
I want to check for row similarity (tlabor and rev) for those with dupfreq > 1, group by id and year.
I was thinking of something similar to this:
id | year | sim | |
---|---|---|---|
1 | 1419 | 2005 | 0.83 |
Note that dupfreq can be >2, but if I can only generate the new table using rows with dupfreq==2 I am ok with it too.
Any advice is greatly appreciated! Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论