R：搜索字符串SIMILAR并返回条件符号

发布于 2024-11-17 21:26:14 字数 788 浏览 3 评论 0原文

我的 df 有以下条目：

A
xxx
xxx
xxx1
xx1x
yyyy
gggg

我想根据 A 列的相似性，基于以下条件，将符号添加到我的 df 的 B 列。

我将阈值设置为 = 或 > 75% 相似的。
A 列已排序。所以，检查上述一项的相似性是需要的。
如果上面的类似，则符号将从上一列的 B 中复制。
如果上一列不相似，则符号将从同一行的 A 列复制。

例如，第 1 行和第 2 行相同。它们的符号与 A 列相同。如第 3 行（4 个字母中的 3 个字母具有相同字母且顺序相同）与 row1 和 row2 75% 相似。 B 列中的 sybmol 将从上一列复制，即 xxx。由于 xx1x (row4) 只是 4 个字母中的 2 个与 row3 相似，因此它将仅使用自己的符号，即 xx1x。由于 yyyy 和 gggg 完全不同，因此它们将保留自己的 sybmol，如 A 列中所示。

因此，我的最终结果应该是这样的：

A      B
xxx    xxx
xxx    xxx
xxx1   xxx
xx1x   xx1x
yyyy   yyyy
gggg   gggg

我通过猜测计算出这种相似性％（如果有正式的形式，则不需要使用它）字符串相似性搜索的方法），如果 R 中有任何正式的方法来检查字符串相似性，那么使用它可能会很好。

您介意指导如何使用 R 有效地添加此符号列吗？

原文

My df has following entries:

A
xxx
xxx
xxx1
xx1x
yyyy
gggg

I want to add symbols to a column B of my df based on the similarity of column A, based on the following conditions.

I set the threshold as = or > 75%
similar.
Column A is sorted already. So,
checking similarity for the ONE above
is needed.
If upper one is similar, the symbol
will be copied from the upper one's column B.
If upper one is dissimilar, the
symbol will be copied from the same row's column A.

For instance, as row 1 and row 2 are the same. Their symbol is same as column A. As row 3 is (3 letters out of 4 letters are with same letters and in the same sequence) 75% similar to row1 and row2. its sybmol in column B will be copied from the upper one, i.e. xxx. As xx1x (row4) is only 2 out of 4 letters similar to row3, it will just use its own symbol, i.e. xx1x. Since yyyy and gggg are totally different, they will keep their own sybmol as in column A.

Thus, my final result should be like this:

A      B
xxx    xxx
xxx    xxx
xxx1   xxx
xx1x   xx1x
yyyy   yyyy
gggg   gggg

I figure out this similarity % by guess (it don't need to be use if there are formal method for string similarity search), if there are any formal method for checking string similarity in R, it could be nice to use.

Could you mind to instruct how to add this symbol column efficiently with R?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

玩世 2024-11-24 21:26:14

设置数据：

x=c("xxx", "xxx", "xxx1", "xx1x", "yyyy", "gggg")

代码：

same <- sapply(seq(length(x)-1), 
  function(i)any(agrep(x[i+1], x[1], max.distance=0.25)))
ex <- embed(x, 2)
cbind(A=x, B=c(x[1], ifelse(same, ex[, 2], ex[, 1])))

结果：

     A      B     
[1,] "xxx"  "xxx" 
[2,] "xxx"  "xxx" 
[3,] "xxx1" "xxx" 
[4,] "xx1x" "xxx1"
[5,] "yyyy" "yyyy"
[6,] "gggg" "gggg"

为什么它有效？

一些关键概念和真正有用的功能：

首先，agrep使用Levenshtein编辑距离提供了对字符串相似程度的测试，该距离有效地计数将一个字符串转换为另一个字符串所需的单个字符更改的数量。参数max.distance=0.25表示模式串允许有25%的不同。

例如，测试是否有任何原始字符串与“xxx”相似：这将返回 1:4：

agrep("xxx", x, max.distance=0.25)
[1] 1 2 3 4

其次，embed 提供了一种测试滞后变量的有用方法。例如，embed(x, 2) 将x` 转换为滞后数组。这使得比较 x[1] 和 x[2] 变得很容易，因为它们现在位于数组中的同一行：

embed(x, 2)
     [,1]   [,2]  
[1,] "xxx"  "xxx" 
[2,] "xxx1" "xxx" 
[3,] "xx1x" "xxx1"
[4,] "yyyy" "xx1x"
[5,] "gggg" "yyyy"

最后，我使用 cbind 和向量子集将原始向量和向量拼接在一起新向量。

为了使其在数据框而不是向量上工作，我将代码转换为函数，如下所示：

df <- data.frame(A=c("xxx", "xxx", "xxx1", "xx1x", "yyyy", "gggg"))

f <- function(x){
  x <- as.vector(x)
  same <- sapply(seq(length(x)-1), 
      function(i)any(agrep(x[i+1], x[1], max.distance=0.25)))
  ex <- embed(x, 2)
  c(x[1], ifelse(same, ex[, 2], ex[, 1]))
}
df$B <- f(df$A)
df

     A    B
1  xxx  xxx
2  xxx  xxx
3 xxx1  xxx
4 xx1x xxx1
5 yyyy yyyy
6 gggg gggg

Set up data:

x=c("xxx", "xxx", "xxx1", "xx1x", "yyyy", "gggg")

The code:

same <- sapply(seq(length(x)-1), 
  function(i)any(agrep(x[i+1], x[1], max.distance=0.25)))
ex <- embed(x, 2)
cbind(A=x, B=c(x[1], ifelse(same, ex[, 2], ex[, 1])))

The result:

     A      B     
[1,] "xxx"  "xxx" 
[2,] "xxx"  "xxx" 
[3,] "xxx1" "xxx" 
[4,] "xx1x" "xxx1"
[5,] "yyyy" "yyyy"
[6,] "gggg" "gggg"

Why does it work?

Some key concepts and really helpful functions:

Firstly, agrep provides a test for how similar strings are, using the Levenshtein edit distance, which effectively counts the number of individual character changes needed to transform one string to another. The parameter max.distance=0.25 means that 25% of the pattern string is allowed to be different.

For example, test whether any of the original strings are similar to "xxx": this returns 1:4:

agrep("xxx", x, max.distance=0.25)
[1] 1 2 3 4

Secondly, embed provides a useful way of testing lagged variables. For example, embed(x, 2) turnsx` into a lagged array. This makes it easy to compare x[1] to x[2] since they are now on the same row in the array:

embed(x, 2)
     [,1]   [,2]  
[1,] "xxx"  "xxx" 
[2,] "xxx1" "xxx" 
[3,] "xx1x" "xxx1"
[4,] "yyyy" "xx1x"
[5,] "gggg" "yyyy"

Finally, I use cbind and vector subsetting to stitch together the original vector and the new vector.

To make this work on a data frame rather than a vector, I turned the code into a function as follows:

df <- data.frame(A=c("xxx", "xxx", "xxx1", "xx1x", "yyyy", "gggg"))

f <- function(x){
  x <- as.vector(x)
  same <- sapply(seq(length(x)-1), 
      function(i)any(agrep(x[i+1], x[1], max.distance=0.25)))
  ex <- embed(x, 2)
  c(x[1], ifelse(same, ex[, 2], ex[, 1]))
}
df$B <- f(df$A)
df

     A    B
1  xxx  xxx
2  xxx  xxx
3 xxx1  xxx
4 xx1x xxx1
5 yyyy yyyy
6 gggg gggg

回复收藏 0 原文

忆梦 2024-11-24 21:26:14

这是一个更“基本”的解决方案（经过编辑以解决评论中提出的一些问题）：

dat <- data.frame(A=c('xxx','xxx','xxx1','xx1x','yyyy','gggg'))
dat$B <- rep(NA,nrow(dat))

tmp <- strsplit(as.character(dat$A),"")
dat$B[1] <- dat$A[1]
for (i in 2:length(tmp)){
    n <- min(length(tmp[[i]]),length(tmp[[i-1]]))
    x <- sum(tmp[[i]][1:n] == tmp[[i-1]][1:n]) / length(tmp[[i]])
    if (x >= 0.75){
        dat$B[i] <- paste(tmp[[i-1]],collapse="")
    }
    else{ dat$B[i] <- paste(tmp[[i]],collapse="")}
}

Here's a more 'basic' solution (edited to fix some problems raised in the comments):

dat <- data.frame(A=c('xxx','xxx','xxx1','xx1x','yyyy','gggg'))
dat$B <- rep(NA,nrow(dat))

tmp <- strsplit(as.character(dat$A),"")
dat$B[1] <- dat$A[1]
for (i in 2:length(tmp)){
    n <- min(length(tmp[[i]]),length(tmp[[i-1]]))
    x <- sum(tmp[[i]][1:n] == tmp[[i-1]][1:n]) / length(tmp[[i]])
    if (x >= 0.75){
        dat$B[i] <- paste(tmp[[i-1]],collapse="")
    }
    else{ dat$B[i] <- paste(tmp[[i]],collapse="")}
}

回复收藏 0 原文

~没有更多了~