我使用Claypl进行了两组地址之间的模糊匹配。 文档说默认值是:
如果未给出成本,则所有默认为10%,另一个默认为10%
转换号码范围默认为全部。组件名称可以
缩写。
但是,与阅读此Q& a 这个示例,这似乎不匹配。这是一个示例:
agrepl("cold", "cool")
#> [1] FALSE
agrepl("cool", "cold")
#> [1] TRUE
从描述中,我想计算10%的10个字母单词中会有1个更改,但这是4中的1个。这是如何计算的?
I used Agrepl for fuzzy matching between two sets of addresses. The documentation says that the default is:
If cost is not given, all defaults to 10%, and the other
transformation number bounds default to all. The component names can
be abbreviated.
However, reading this q&a with this example, that doesn't seem to match up. Here is that example:
agrepl("cold", "cool")
#> [1] FALSE
agrepl("cool", "cold")
#> [1] TRUE
From the description, I'd imagine that calculating the 10% would be having 1 change in a 10 letter word, but this is 1 in 4. How exactly is this calculated?
发布评论
评论(1)
诚然,这是非常令人困惑的(至少对我来说!),但这是我的尝试。 链接答案说:
我们如何从0.1(成本)×4(图案长度)到1?好吧,
?cyspl
指出max.dist
表示为(添加了强调);我将括号的子句表示意味着最大转换数为
上限(0.1*4)
= 1。我们需要一个带有长度≥11的模式,以便for
for
for
tote> tatter_length )
要从1增加到2 ...如果您想找出实际实现的位置,则必须深入研究C源代码,即功能,我们看到的地方
This is admittedly very confusing (at least to me!), but here's my attempt to explain it. The linked answer says:
How do we get from 0.1 (cost) × 4 (pattern length) to 1? Well,
?agrepl
notes that themax.dist
is expressed as(emphasis added); I take the parenthetical clause to mean that the maximum number of transformations is
ceiling(0.1*4)
= 1. We would need a pattern with length ≥ 11 in order forceiling(0.1*pattern_length)
to increase from 1 to 2 ...If you want to find out where this is actually implemented, you have to dig fairly deep into the C source code, i.e. lines 59-60 of agrep.c, in the
amatch_regparams
function, where we see