stringr :: str_detect的模糊版本用于过滤dataframe

发布于 2025-01-23 02:09:36 字数 1626 浏览 0 评论 0原文

我有一个具有免费文本字段的数据库，我想将其用于过滤器 a data.frame或tibble。我可能可以在大量工作中创建一个目前发生在数据中发生的搜索词的所有可能拼写的列表（请参阅下面一个术语的所有拼写的示例），然后我只能使用stringr ::::: str_detect如下示例代码中。但是，如果将来可能会有更多的拼写错误，这将不安全。如果我愿意接受一些限制 /做出一些假设（例如，拼写错误之间的编辑距离可能是多远，或者在某些其他区别上，人们不会使用完全不同的术语等），是否有一些简单的解决方案，用于制作str_detect的模糊版本？

据我所知，例如StringDist之类的明显软件包似乎没有直接执行此操作的函数。我想我可以编写自己的函数，该功能应用于StringDist :: afind或stringdist :: Amatch 到向量的每个元素，并后处理结果以最终返回结果true或false booleans的向量，但是我想知道此功能是否不存在某个地方（并且比我更有效地实现）。

这是一个示例，说明了如何使用str_detect我想念我想要的一行：

library(tidyverse)

search_terms = c("preclinical", "Preclincal", "Preclincial", "Preclinial", 
                 "Precllinical", "Preclilnical", "Preclinica", "Preclnical", 
                 "Peclinical", "Prclinical", "Peeclinical", "Pre clinical", 
                 "Precclinical", "Preclicnial", "Precliical", "Precliinical", 
                 "Preclinal", "Preclincail", "Preclinicgal", "Priclinical")

example_data = tibble(project=c("A111", "A123", "B112", "A224", "C149"),
                      disease_phase=c("Diabetes, Preclinical", "Lipid lowering, Perlcinical", 
                                      "Asthma, Phase I", "Phase II; Hypertension", "Phase 3"),
                      startdate = c("01DEC2018", "17-OKT-2017", "11/15/2019", "1. Dezember 2004", "2005-11-30")) 

# Finds only project A111, but not A123
example_data %>%
  filter(str_detect(tolower(disease_phase), paste0(tolower(search_terms), collapse="|")))

原文

I've got a database with free text fields that I want to use to filter a data.frame or tibble. I could perhaps with lots of work create a list of all possible misspellings of my search terms that currently occur in the data (see example of all the spellings I had of one term below) and then I could just use stringr::str_detect as in the example code below. However, this will not be safe when there might be more misspellings in the future. If I'm willing to accept some limitations / make some assumptions (e.g. how far the edit distance between the misspellings could be, or in terms of some other difference, that people won't use completely different terms etc.), is there some simple solution for doing a fuzzy version of str_detect?

As far as I could see the obvious packages like stringdist do not seem to have a function that directly does this. I guess I could write my own function that applies something like stringdist::afind or stringdist::amatch to each element of a vector and post-processes the results to eventually return a vector of TRUE or FALSE booleans, but I wonder whether this function does not exist somewhere (and is more efficiently implemented than I would do it).

Here's an example that illustrates how with str_detect I might miss one row I would want:

library(tidyverse)

search_terms = c("preclinical", "Preclincal", "Preclincial", "Preclinial", 
                 "Precllinical", "Preclilnical", "Preclinica", "Preclnical", 
                 "Peclinical", "Prclinical", "Peeclinical", "Pre clinical", 
                 "Precclinical", "Preclicnial", "Precliical", "Precliinical", 
                 "Preclinal", "Preclincail", "Preclinicgal", "Priclinical")

example_data = tibble(project=c("A111", "A123", "B112", "A224", "C149"),
                      disease_phase=c("Diabetes, Preclinical", "Lipid lowering, Perlcinical", 
                                      "Asthma, Phase I", "Phase II; Hypertension", "Phase 3"),
                      startdate = c("01DEC2018", "17-OKT-2017", "11/15/2019", "1. Dezember 2004", "2005-11-30")) 

# Finds only project A111, but not A123
example_data %>%
  filter(str_detect(tolower(disease_phase), paste0(tolower(search_terms), collapse="|")))

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

星軌x 2025-01-30 02:09:36

您可以使用conseppl进行 base 中的近似字符串匹配（模糊匹配）。

example_data[agrep(paste(search_terms, collapse = "|"),
  example_data$disease_phase, 2, ignore.case=TRUE, fixed=FALSE),]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

或使用Repard而不是|在正则> | 。

example_data[Reduce(\(y, x) y | agrepl(x, example_data$disease_phase, 2,
  ignore.case=TRUE), search_terms, FALSE),]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

替代方案可能是adist，也可以在 base 中计算距离矩阵 - 因此，对于较大的向量，可能不建议使用矩阵，因为矩阵可以变大。在这里，我还选择了2个字符的不匹配。

example_data[colSums(adist(unique(search_terms), example_data$disease_phase,
                           partial=TRUE) < 3) > 0,]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

如果仅比较单词单词可能会更有效，因此在 base sive> sive_phase 分为单词 >。

. <- strsplit(example_data$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_data[unique(unlist(.[Reduce(\(y, x) `[<-`(y, !y, agrepl(x, names(.)[!y],
   2)), tolower(search_terms), logical(length(.)))], FALSE, FALSE)),]
#example_data[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
#   tolower(search_terms), FALSE)], FALSE, FALSE)),] #Alternative
#  project               disease_phase   startdate
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017
#1    A111       Diabetes, Preclinical   01DEC2018

使用consepp的一些更简单的示例：

#Allow 1 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 1)
#[1]  TRUE  TRUE FALSE

#Allow 2 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 2)
#[1] TRUE TRUE TRUE

#Use boundaries to match words
agrepl("\\bpreclinical\\b", c("xyz precinical xyz", "xyzpreclinicalxyz"), 1, fixed=FALSE)
#[1]  TRUE FALSE

使用max.distance可以设置多少差异：

max.distance: Maximum distance allowed for a match.  Expressed either
          as integer, or as a fraction of the _pattern_ length times
          the maximal transformation cost (will be replaced by the
          smallest integer not less than the corresponding fraction),
          or a list with possible components

          ‘cost’: maximum number/fraction of match cost (generalized
              Levenshtein distance)

          ‘all’: maximal number/fraction of _all_ transformations
              (insertions, deletions and substitutions)

          ‘insertions’: maximum number/fraction of insertions

          ‘deletions’: maximum number/fraction of deletions

          ‘substitutions’: maximum number/fraction of substitutions

以及基于@jbgruber的基准：

system.time({  #Libraries needed for method of JBGruber
library(dplyr);
library(stringdist);
library(Rfast);
library(tidytext)
})
#       User      System verstrichen 
#      1.008       0.040       1.046 

set.seed(42)
example_large <- example_data %>% sample_n(5000, replace = TRUE)

stringdist_detect <- function(a, b, method = "osa", thres = 2) {
  Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}

bench::mark(check = FALSE,
  stringdist_detect = {
     example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
  },
  GKi ={. <- strsplit(example_large$disease_phase, "[ ,;]+")
   . <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
   example_large[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
     tolower(search_terms), FALSE)], FALSE, FALSE)),]
})
#  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 stringdist_detect  17.42ms  18.65ms      52.8    7.15MB    19.4     19     7
#2 GKi                 5.64ms   6.04ms     165.   869.08KB     6.27    79     3

当只有一个时，还可以节省很多时间，正确的书面，search_terms中感兴趣的单词的变体。

You can use agrepl for Approximate String Matching (Fuzzy Matching) which is in base.

example_data[agrep(paste(search_terms, collapse = "|"),
  example_data$disease_phase, 2, ignore.case=TRUE, fixed=FALSE),]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

Or using Reduce instead of | in the regex.

example_data[Reduce(\(y, x) y | agrepl(x, example_data$disease_phase, 2,
  ignore.case=TRUE), search_terms, FALSE),]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

An alternative might be adist, also in base, which calculates a distance matrix - so it might not be recommended for larger vectors, as the matrix can get large. Here I also choose that a mismatch by 2 characters will be OK.

example_data[colSums(adist(unique(search_terms), example_data$disease_phase,
                           partial=TRUE) < 3) > 0,]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

In case only single words are compared it might be more efficient so split the disease_phase into words using strsplit also in base.

. <- strsplit(example_data$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_data[unique(unlist(.[Reduce(\(y, x) `[<-`(y, !y, agrepl(x, names(.)[!y],
   2)), tolower(search_terms), logical(length(.)))], FALSE, FALSE)),]
#example_data[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
#   tolower(search_terms), FALSE)], FALSE, FALSE)),] #Alternative
#  project               disease_phase   startdate
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017
#1    A111       Diabetes, Preclinical   01DEC2018

Some simpler examples using agrep:

#Allow 1 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 1)
#[1]  TRUE  TRUE FALSE

#Allow 2 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 2)
#[1] TRUE TRUE TRUE

#Use boundaries to match words
agrepl("\\bpreclinical\\b", c("xyz precinical xyz", "xyzpreclinicalxyz"), 1, fixed=FALSE)
#[1]  TRUE FALSE

How much difference will be allowed can be set with max.distance:

max.distance: Maximum distance allowed for a match.  Expressed either
          as integer, or as a fraction of the _pattern_ length times
          the maximal transformation cost (will be replaced by the
          smallest integer not less than the corresponding fraction),
          or a list with possible components

          ‘cost’: maximum number/fraction of match cost (generalized
              Levenshtein distance)

          ‘all’: maximal number/fraction of _all_ transformations
              (insertions, deletions and substitutions)

          ‘insertions’: maximum number/fraction of insertions

          ‘deletions’: maximum number/fraction of deletions

          ‘substitutions’: maximum number/fraction of substitutions

And also a Benchmark based on @JBGruber:

system.time({  #Libraries needed for method of JBGruber
library(dplyr);
library(stringdist);
library(Rfast);
library(tidytext)
})
#       User      System verstrichen 
#      1.008       0.040       1.046 

set.seed(42)
example_large <- example_data %>% sample_n(5000, replace = TRUE)

stringdist_detect <- function(a, b, method = "osa", thres = 2) {
  Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}

bench::mark(check = FALSE,
  stringdist_detect = {
     example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
  },
  GKi ={. <- strsplit(example_large$disease_phase, "[ ,;]+")
   . <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
   example_large[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
     tolower(search_terms), FALSE)], FALSE, FALSE)),]
})
#  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 stringdist_detect  17.42ms  18.65ms      52.8    7.15MB    19.4     19     7
#2 GKi                 5.64ms   6.04ms     165.   869.08KB     6.27    79     3

Also much time could be saved when there is only one, right written, variant of the words of interest in search_terms.

回复收藏 0 原文

谎言 2025-01-30 02:09:36

我认为最有效/最快的方法是：

stringdist_detect <- function(a, b, method = "osa", thres = 2) {
  Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}

StringDist :: StringDistMatrix计算所有之间的距离矩阵
a和b中的值。我从未听说过Rfast :: Colmins，但是有些谷歌搜索
告诉我，这是找到每行最小值的最快方法
矩阵（<代码>应用（x，2，min）将实现相同的操作）。仅此而已
我们想要：最低限度，因为它告诉我们单词之间的最小距离
在A和b中。我们可以将其与阈值值进行比较。看
？stringdist :: stringdist-metrics有关方法参数的更多信息。
我只是遵循@SHS的建议，这似乎是合理的。

现在，我要做的第二件事是在比较距离之前对文本进行标记化，因为在代币中找到拼写错误的情况更加有意义。 tidytext :: unnest_tokens是一个不错的功能，将文本分为单词（即，象征化）：

example_data %>%
  tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
  filter(stringdist_detect(word, tolower(search_terms)))

## # A tibble: 2 × 4
##   project disease_phase               startdate   word       
##   <chr>   <chr>                       <chr>       <chr>      
## 1 A111    Diabetes, Preclinical       01DEC2018   preclinical
## 2 A123    Lipid lowering, Perlcinical 17-OKT-2017 perlcinical

Tokenisation具有额外的优势，您有一列告诉您
哪个词匹配了。这应该使测试不同
阈值容易得多。但是，正如@SHS建议的那样，如果确定了两个拼写错误，您将获得一些重复。您可以使用过滤器（！重复（项目））在下一部分中可以摆脱重复的拼写错误。

如果您不想定义自己的功能，也可以关注
@Maël的建议。在这里拼写：

search_terms <- data.frame(word = search_terms)
example_data %>%
  tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
  fuzzyjoin::stringdist_inner_join(search_terms, by = "word", max_dist = 2) %>% 
  filter(!duplicated(project))

## # A tibble: 2 × 5
##   project disease_phase               startdate   word.x      word.y     
##   <chr>   <chr>                       <chr>       <chr>       <chr>      
## 1 A111    Diabetes, Preclinical       01DEC2018   preclinical preclinical
## 2 A123    Lipid lowering, Perlcinical 17-OKT-2017 perlcinical preclinical

您可以看到的基准

example_large <- example_data %>%
      # upsample for more realistic scenario
      sample_n(5000, replace = TRUE)

res <- bench::mark(
  stringdist_detect = {
     example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
  },
  fuzzyjoin = {
    example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      fuzzyjoin::stringdist_inner_join(data.frame(word = search_terms), by = "word", max_dist = 2) %>% 
      filter(!duplicated(project))
  },
  agrepl = {
    example_large %>% 
      filter(agrepl(paste(search_terms, collapse = "|"), disease_phase, 2, ignore.case=TRUE, fixed=FALSE))
  },
  agrepl_reduce = {
    example_large[Reduce(\(y, x) y | agrepl(x, example_large$disease_phase, 2,
                                           ignore.case=TRUE), search_terms, FALSE),]
  },
  check = FALSE
)
summary(res)

## # A tibble: 4 × 6
##   expression             min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 stringdist_detect   21.3ms   23.3ms     42.8         NA     13.4
## 2 fuzzyjoin           57.4ms   60.1ms     16.8         NA     13.4
## 3 agrepl             224.7ms  226.4ms      4.33        NA      0  
## 4 agrepl_reduce        229ms  229.1ms      4.36        NA      0

summary(res, relative =TRUE)

## # A tibble: 4 × 6
##   expression          min median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>        <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
## 1 stringdist_detect  1      1         9.88        NA      Inf
## 2 fuzzyjoin          2.70   2.59      3.88        NA      Inf
## 3 agrepl            10.6    9.73      1           NA      NaN
## 4 agrepl_reduce     10.8    9.85      1.01        NA      NaN

，stringdist_detect是最快的，其次是fuzzyjoin（它也使用string> Stringdist ）。我使用conseppl包括 @gki的方法。在较小的数据集上，calspl实际上更快，但是我认为您的真实数据集中的5行可能还要多。在数据中尝试这些功能并进行报告不会有任何伤害。

I think the most efficient/fastest way is this:

stringdist_detect <- function(a, b, method = "osa", thres = 2) {
  Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}

stringdist::stringdistmatrix calculates a distance matrix between all
values in a and b. I’ve never heard of Rfast::colMins but some googling
tells me it is the fastest way to find the minimum value in each row of a
matrix (apply(x, 2, min) would accomplish the same). And that is all
we want: the minimum, as it tells us the smallest distance between words
in a and b. We can compare this to a threshold value. Look at
?stringdist::stringdist-metrics for more infos on the method argument.
I simply followed @shs suggestion, which seems plausible.

Now the second thing I would do is to tokenize the text before comparing distances, as finding misspellings in tokens makes a lot more sense. tidytext::unnest_tokens is a nice function that splits text into words (i.e., tokenization):

example_data %>%
  tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
  filter(stringdist_detect(word, tolower(search_terms)))

## # A tibble: 2 × 4
##   project disease_phase               startdate   word       
##   <chr>   <chr>                       <chr>       <chr>      
## 1 A111    Diabetes, Preclinical       01DEC2018   preclinical
## 2 A123    Lipid lowering, Perlcinical 17-OKT-2017 perlcinical

Tokenisation has the extra advantage that you have a column telling you
which word hast been matched. Which should make testing different
threshold much easier. However, as @shs suggested, you get some duplication if two misspellings are identified. You can use filter(!duplicated(project)) as in the next part to get rid of duplicated misspelling.

If you don’t want to define your own function, you can also follow
@Maël’s suggestion. Here it is spelled out:

search_terms <- data.frame(word = search_terms)
example_data %>%
  tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
  fuzzyjoin::stringdist_inner_join(search_terms, by = "word", max_dist = 2) %>% 
  filter(!duplicated(project))

## # A tibble: 2 × 5
##   project disease_phase               startdate   word.x      word.y     
##   <chr>   <chr>                       <chr>       <chr>       <chr>      
## 1 A111    Diabetes, Preclinical       01DEC2018   preclinical preclinical
## 2 A123    Lipid lowering, Perlcinical 17-OKT-2017 perlcinical preclinical

benchmark

example_large <- example_data %>%
      # upsample for more realistic scenario
      sample_n(5000, replace = TRUE)

res <- bench::mark(
  stringdist_detect = {
     example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
  },
  fuzzyjoin = {
    example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      fuzzyjoin::stringdist_inner_join(data.frame(word = search_terms), by = "word", max_dist = 2) %>% 
      filter(!duplicated(project))
  },
  agrepl = {
    example_large %>% 
      filter(agrepl(paste(search_terms, collapse = "|"), disease_phase, 2, ignore.case=TRUE, fixed=FALSE))
  },
  agrepl_reduce = {
    example_large[Reduce(\(y, x) y | agrepl(x, example_large$disease_phase, 2,
                                           ignore.case=TRUE), search_terms, FALSE),]
  },
  check = FALSE
)
summary(res)

## # A tibble: 4 × 6
##   expression             min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 stringdist_detect   21.3ms   23.3ms     42.8         NA     13.4
## 2 fuzzyjoin           57.4ms   60.1ms     16.8         NA     13.4
## 3 agrepl             224.7ms  226.4ms      4.33        NA      0  
## 4 agrepl_reduce        229ms  229.1ms      4.36        NA      0

summary(res, relative =TRUE)

## # A tibble: 4 × 6
##   expression          min median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>        <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
## 1 stringdist_detect  1      1         9.88        NA      Inf
## 2 fuzzyjoin          2.70   2.59      3.88        NA      Inf
## 3 agrepl            10.6    9.73      1           NA      NaN
## 4 agrepl_reduce     10.8    9.85      1.01        NA      NaN

As you can see, stringdist_detect is the fastest, followed by fuzzyjoin (which uses stringdist under the hood as well). I aso included @GKi's approach using agrepl. On smaller datasets, agrepl was actually faster, but I think you probably have more than the 5 rows in your real dataset. It would not hurt to try these functions in your data and report back.

回复收藏 0 原文

千纸鹤带着心事 2025-01-30 02:09:36

damerau noreferrer“> damerau – damerau – levenshtein距离它是错别字的。在以下代码中，我将disevian_phase拆分，查看任何一个子字符串是否与“临床前”紧密匹配。

library(tidyverse)
library(stringdist)

example_data |> 
  filter(str_split(disease_phase, "\\W+") |> 
           map_lgl(\(x) x |> 
                 stringdist("preclinical", "dl") |> 
                 (`<=`)(4) |> # Threshold for distance
                 any()
               )
         )
#> # A tibble: 2 × 3
#>   project disease_phase               startdate  
#>   <chr>   <chr>                       <chr>      
#> 1 A111    Diabetes, Preclinical       01DEC2018  
#> 2 A123    Lipid lowering, Perlcinical 17-OKT-2017

我选择了＆lt; = 4的相当保守的阈值距离，因为如下所示，您的错别字示例都低于该示例。您可能想对良好的阈值进行一些测试。

stringdist(search_terms, "preclinical")
#>  [1] 0 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 3 3 2 2

^由

创建：

正如我在对JBGruber回答的评论中所说的那样，长时间而不是嵌套具有很大的性能好处。更好的做法：

example_large |>
  tidytext::unnest_tokens(word, disease_phase, drop = F) |>
  mutate(str_dist = stringdist(word, "preclinical", method = "dl")) |>
  filter(str_dist < 4) |>
  group_by(project, disease_phase) |>
  slice(which.min(str_dist))

最后两行是避免“临床前”在同一字符串中出现两次时，它在示例数据中没有，但在大型人类生成中不太可能在同一字符串中出现。数据集。

The Damerau–Levenshtein distance is a good choice for measuring string distance when it comes to typos. In the following piece of code I split the disease_phase and look if any of the substrings match closely with "preclinical".

library(tidyverse)
library(stringdist)

example_data |> 
  filter(str_split(disease_phase, "\\W+") |> 
           map_lgl(\(x) x |> 
                 stringdist("preclinical", "dl") |> 
                 (`<=`)(4) |> # Threshold for distance
                 any()
               )
         )
#> # A tibble: 2 × 3
#>   project disease_phase               startdate  
#>   <chr>   <chr>                       <chr>      
#> 1 A111    Diabetes, Preclinical       01DEC2018  
#> 2 A123    Lipid lowering, Perlcinical 17-OKT-2017

I chose a rather conservative threshold distance of <=4, because as you can see below, your typo examples all fell below that. You may want to do a bit of testing for a good threshold.

stringdist(search_terms, "preclinical")
#>  [1] 0 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 3 3 2 2

^{Created on 2022-04-23 by the reprex package (v2.0.1)}

edit:

As I stated in my comments of JBGruber's answer, going long instead of nested has a significant performance benefit. So better do:

example_large |>
  tidytext::unnest_tokens(word, disease_phase, drop = F) |>
  mutate(str_dist = stringdist(word, "preclinical", method = "dl")) |>
  filter(str_dist < 4) |>
  group_by(project, disease_phase) |>
  slice(which.min(str_dist))

The last two lines are to avoid potential duplicates when "preclinical" appears twice in the same string, which it doesn't in the sample data, but is not unlikely in a large human generated data set.

回复收藏 0 原文

~没有更多了~