按时间差标记重复项，生成id

发布于 2024-12-26 20:37:03 字数 690 浏览 1 评论 0原文

好的，我刚刚开始使用 R，目前有些困难。我有一个包含选举结果的数据集，一个人的唯一标识符是带有他/她名字的字符串变量。许多政客因为参加不止一次选举而出现不止一次。

我想生成一个 ID 来识别每个政客。然而，有些名字更常见，实际上可以识别不同的人。我想通过观察出现的时间差来区分这些案例，即如果出现的时间间隔超过30年，则同一个名字属于不同的人。

我计算了每次发生之间的差异，每次发生之间的差异大于30年时，我想记录下所有后续发生的事件都属于不同的人。我尝试过循环，但没有让它们按照我想要的方式工作，我想有一种更惯用的方法来解决这个问题。

然后我想使用 name 变量和记录为每个人创建一个唯一的 id，但我想这可以简单地使用 id() 函数来完成。

df <- df[order(df$name, df$year),]

# difference between each occurence, NA for first occurence 
df$timediff <- ave(df$year, df$name, FUN=function(x) c(NA,diff(x)))

# absolute difference to first occurence, haven't used this so far
df$timediff.abs <- ave(df$year, df$name, FUN=function(x) x - x[1])

原文

ok, I am just starting out with R and somewhat stuck at the moment. I have a dataset with election results, and the only identifier for a person is a string variable with his/her name. Many politicians appear more than once as they participate in more than one election.

I want to generate an id to identify each politician. However, some names are more common and do actually identify different persons. I want to single out these cases by looking at the time difference of occurence, i.e. if there are more than 30 years between appearances, the same name belongs to a different person.

I have computed the difference between each occurence, and each time there is a difference larger than 30 years between occurences, I want to make a record that all subsequent occurences belong to a different person. I have dabbled with loops, but didn't get them to work the way I wanted, and I guess there's a more idiomatic way to solve this.

Then I want to create a unique id for each person using the name variable and the record, but i guess this can simply be done using the id() function.

df <- df[order(df$name, df$year),]

# difference between each occurence, NA for first occurence 
df$timediff <- ave(df$year, df$name, FUN=function(x) c(NA,diff(x)))

# absolute difference to first occurence, haven't used this so far
df$timediff.abs <- ave(df$year, df$name, FUN=function(x) x - x[1])

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

踏月而来 2025-01-02 20:37:03

您可以对数据重新排序，然后比较后续行。如果有一个新名字——那就是一个新人。如果差距大于30年，那就是新人了。如果名字相同，且年份相差<0 30、同一个人。当数据重新排序时，如果日期间隔小于 0，则名称已更改，因此显然是一个新人。

简而言之，如果名称发生变化或名称相同但间隔超过 30 年，则您不会假设与前一行具有相同的身份。（相反，如果您不假定相同的身份，则增加您的唯一标识符。）

下面是一个使用上述规则分配唯一标识符的示例。

set.seed(0)
d = sample((1900:2000), 100, replace = TRUE)
v = sample(letters, 100, replace = TRUE)
t1 = data.frame(v,d)
t2 = t1[order(t1$v,t1$d),]
t2$sameName = c(FALSE, t2$v[2:100] == t2$v[1:99])
t2$diffYrs = c(0,diff(t2$d))
t2$close = (t2$diffYrs >= 0) & (t2$diffYrs < 30)
t2$keepPerson = (t2$sameName & t2$close)
t2$identifier = cumsum(!t2$keepPerson)

You can reorder the data and then compare subsequent rows. If there is a new name - it is a new person. If there is a gap greater than 30 years, then it is a new person. If the name is the same, and the gap in years is < 30, same person. As the data is reordered, if the gap in dates is less than 0, then the name has changed, so it's obviously a new person.

Concisely, if there is either a change in name or the same name but a gap greater than 30 years, you do not assume the same identity as for the previous row. (Conversely, if you don't assume the same identity, then you increment your unique identifier.)

Here is an example that assigns a unique identifier, using the above rules.

set.seed(0)
d = sample((1900:2000), 100, replace = TRUE)
v = sample(letters, 100, replace = TRUE)
t1 = data.frame(v,d)
t2 = t1[order(t1$v,t1$d),]
t2$sameName = c(FALSE, t2$v[2:100] == t2$v[1:99])
t2$diffYrs = c(0,diff(t2$d))
t2$close = (t2$diffYrs >= 0) & (t2$diffYrs < 30)
t2$keepPerson = (t2$sameName & t2$close)
t2$identifier = cumsum(!t2$keepPerson)

回复收藏 0 原文

~没有更多了~