stringr :: str_detect的模糊版本用于过滤dataframe
我有一个具有免费文本字段的数据库,我想将其用于过滤器
a data.frame
或tibble
。我可能可以在大量工作中创建一个目前发生在数据中发生的搜索词的所有可能拼写的列表(请参阅下面一个术语的所有拼写的示例),然后我只能使用stringr ::::: str_detect
如下示例代码中。但是,如果将来可能会有更多的拼写错误,这将不安全。如果我愿意接受一些限制 /做出一些假设(例如,拼写错误之间的编辑距离可能是多远,或者在某些其他区别上,人们不会使用完全不同的术语等),是否有一些简单的解决方案,用于制作str_detect
的模糊版本?
据我所知,例如StringDist
之类的明显软件包似乎没有直接执行此操作的函数。我想我可以编写自己的函数,该功能应用于StringDist :: afind
或stringdist :: Amatch
到向量的每个元素,并后处理结果以最终返回结果true
或false
booleans的向量,但是我想知道此功能是否不存在某个地方(并且比我更有效地实现)。
这是一个示例,说明了如何使用str_detect
我想念我想要的一行:
library(tidyverse)
search_terms = c("preclinical", "Preclincal", "Preclincial", "Preclinial",
"Precllinical", "Preclilnical", "Preclinica", "Preclnical",
"Peclinical", "Prclinical", "Peeclinical", "Pre clinical",
"Precclinical", "Preclicnial", "Precliical", "Precliinical",
"Preclinal", "Preclincail", "Preclinicgal", "Priclinical")
example_data = tibble(project=c("A111", "A123", "B112", "A224", "C149"),
disease_phase=c("Diabetes, Preclinical", "Lipid lowering, Perlcinical",
"Asthma, Phase I", "Phase II; Hypertension", "Phase 3"),
startdate = c("01DEC2018", "17-OKT-2017", "11/15/2019", "1. Dezember 2004", "2005-11-30"))
# Finds only project A111, but not A123
example_data %>%
filter(str_detect(tolower(disease_phase), paste0(tolower(search_terms), collapse="|")))
I've got a database with free text fields that I want to use to filter
a data.frame
or tibble
. I could perhaps with lots of work create a list of all possible misspellings of my search terms that currently occur in the data (see example of all the spellings I had of one term below) and then I could just use stringr::str_detect
as in the example code below. However, this will not be safe when there might be more misspellings in the future. If I'm willing to accept some limitations / make some assumptions (e.g. how far the edit distance between the misspellings could be, or in terms of some other difference, that people won't use completely different terms etc.), is there some simple solution for doing a fuzzy version of str_detect
?
As far as I could see the obvious packages like stringdist
do not seem to have a function that directly does this. I guess I could write my own function that applies something like stringdist::afind
or stringdist::amatch
to each element of a vector and post-processes the results to eventually return a vector of TRUE
or FALSE
booleans, but I wonder whether this function does not exist somewhere (and is more efficiently implemented than I would do it).
Here's an example that illustrates how with str_detect
I might miss one row I would want:
library(tidyverse)
search_terms = c("preclinical", "Preclincal", "Preclincial", "Preclinial",
"Precllinical", "Preclilnical", "Preclinica", "Preclnical",
"Peclinical", "Prclinical", "Peeclinical", "Pre clinical",
"Precclinical", "Preclicnial", "Precliical", "Precliinical",
"Preclinal", "Preclincail", "Preclinicgal", "Priclinical")
example_data = tibble(project=c("A111", "A123", "B112", "A224", "C149"),
disease_phase=c("Diabetes, Preclinical", "Lipid lowering, Perlcinical",
"Asthma, Phase I", "Phase II; Hypertension", "Phase 3"),
startdate = c("01DEC2018", "17-OKT-2017", "11/15/2019", "1. Dezember 2004", "2005-11-30"))
# Finds only project A111, but not A123
example_data %>%
filter(str_detect(tolower(disease_phase), paste0(tolower(search_terms), collapse="|")))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用
conseppl
进行 base 中的近似字符串匹配(模糊匹配)。或使用
Repard
而不是|
在正则> | 。替代方案可能是
adist
,也可以在 base 中计算距离矩阵 - 因此,对于较大的向量,可能不建议使用矩阵,因为矩阵可以变大。在这里,我还选择了2个字符的不匹配。如果仅比较单词单词可能会更有效,因此在 base sive> sive_phase 分为单词 >。
使用
consepp
的一些更简单的示例:使用max.distance可以设置多少差异:
以及基于@jbgruber的基准:
当只有一个时,还可以节省很多时间,正确的书面,
search_terms中感兴趣的单词的变体
。You can use
agrepl
for Approximate String Matching (Fuzzy Matching) which is in base.Or using
Reduce
instead of|
in the regex.An alternative might be
adist
, also in base, which calculates a distance matrix - so it might not be recommended for larger vectors, as the matrix can get large. Here I also choose that a mismatch by 2 characters will be OK.In case only single words are compared it might be more efficient so split the
disease_phase
into words usingstrsplit
also in base.Some simpler examples using
agrep
:How much difference will be allowed can be set with max.distance:
And also a Benchmark based on @JBGruber:
Also much time could be saved when there is only one, right written, variant of the words of interest in
search_terms
.我认为最有效/最快的方法是:
StringDist :: StringDistMatrix
计算所有之间的距离矩阵a和b中的值。我从未听说过Rfast :: Colmins,但是有些谷歌搜索
告诉我,这是找到每行最小值的最快方法
矩阵(<代码>应用(x,2,min)将实现相同的操作)。仅此而已
我们想要:最低限度,因为它告诉我们单词之间的最小距离
在A和b中。我们可以将其与阈值值进行比较。看
?stringdist ::
stringdist-metrics
有关方法参数的更多信息。我只是遵循@SHS的建议,这似乎是合理的。
现在,我要做的第二件事是在比较距离之前对文本进行标记化,因为在代币中找到拼写错误的情况更加有意义。
tidytext :: unnest_tokens
是一个不错的功能,将文本分为单词(即,象征化):Tokenisation具有额外的优势,您有一列告诉您
哪个词匹配了。这应该使测试不同
阈值容易得多。但是,正如@SHS建议的那样,如果确定了两个拼写错误,您将获得一些重复。您可以使用
过滤器(!重复(项目))
在下一部分中可以摆脱重复的拼写错误。如果您不想定义自己的功能,也可以关注
@Maël的建议。在这里拼写:
您可以看到的基准
,
stringdist_detect
是最快的,其次是fuzzyjoin
(它也使用string> Stringdist
)。我使用conseppl
包括 @gki的方法。在较小的数据集上,calspl
实际上更快,但是我认为您的真实数据集中的5行可能还要多。在数据中尝试这些功能并进行报告不会有任何伤害。I think the most efficient/fastest way is this:
stringdist::stringdistmatrix
calculates a distance matrix between allvalues in a and b. I’ve never heard of Rfast::colMins but some googling
tells me it is the fastest way to find the minimum value in each row of a
matrix (
apply(x, 2, min)
would accomplish the same). And that is allwe want: the minimum, as it tells us the smallest distance between words
in a and b. We can compare this to a threshold value. Look at
?stringdist::
stringdist-metrics
for more infos on the method argument.I simply followed @shs suggestion, which seems plausible.
Now the second thing I would do is to tokenize the text before comparing distances, as finding misspellings in tokens makes a lot more sense.
tidytext::unnest_tokens
is a nice function that splits text into words (i.e., tokenization):Tokenisation has the extra advantage that you have a column telling you
which word hast been matched. Which should make testing different
threshold much easier. However, as @shs suggested, you get some duplication if two misspellings are identified. You can use
filter(!duplicated(project))
as in the next part to get rid of duplicated misspelling.If you don’t want to define your own function, you can also follow
@Maël’s suggestion. Here it is spelled out:
benchmark
As you can see,
stringdist_detect
is the fastest, followed byfuzzyjoin
(which usesstringdist
under the hood as well). I aso included @GKi's approach usingagrepl
. On smaller datasets,agrepl
was actually faster, but I think you probably have more than the 5 rows in your real dataset. It would not hurt to try these functions in your data and report back.damerau noreferrer“> damerau – damerau – levenshtein距离它是错别字的。在以下代码中,我将
disevian_phase
拆分,查看任何一个子字符串是否与“临床前”
紧密匹配。我选择了
&lt; = 4
的相当保守的阈值距离,因为如下所示,您的错别字示例都低于该示例。您可能想对良好的阈值进行一些测试。由
创建:
正如我在对JBGruber回答的评论中所说的那样,长时间而不是嵌套具有很大的性能好处。更好的做法:
最后两行是避免
“临床前”
在同一字符串中出现两次时,它在示例数据中没有,但在大型人类生成中不太可能在同一字符串中出现。数据集。The Damerau–Levenshtein distance is a good choice for measuring string distance when it comes to typos. In the following piece of code I split the
disease_phase
and look if any of the substrings match closely with"preclinical"
.I chose a rather conservative threshold distance of
<=4
, because as you can see below, your typo examples all fell below that. You may want to do a bit of testing for a good threshold.Created on 2022-04-23 by the reprex package (v2.0.1)
edit:
As I stated in my comments of JBGruber's answer, going long instead of nested has a significant performance benefit. So better do:
The last two lines are to avoid potential duplicates when
"preclinical"
appears twice in the same string, which it doesn't in the sample data, but is not unlikely in a large human generated data set.