语言特定怪癖的 Damerau–Levenshtein 距离

发布于 2024-10-10 03:36:35 字数 316 浏览 13 评论 0原文

对于讲荷兰语的人来说,两个字符“ij”被认为是一个字母,可以很容易地与“y”交换。

对于我正在从事的项目,我想要一个 Damerau– 的变体 - Levenshtein distance 计算“ij”和“y”之间的距离为 1,而不是当前值 2。

我自己一直在尝试,但失败了。我的问题是,我不知道如何处理两个文本长度不同的事实。 有人对如何解决这个问题有建议/代码片段吗?

谢谢。

To Dutch speaking people the two characters "ij" are considered to be a single letter that is easily exchanged with "y".

For a project I'm working on I would like to have a variant of the Damerau–Levenshtein distance that calculates the distance between "ij" and "y" as 1 instead of the current value of 2.

I've been trying this myself but failed. My problem is that I do not have a clue on how to handle the fact that both texts are of different lengths.
Does anyone have a suggestion/code fragment on how to solve this?

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

心意如水 2024-10-17 03:36:36

维基百科文章的术语相当宽松。 “自然语言”中不存在“字符串”这样的东西。自然语言中存在可以用书面字符和字符组合来表示的音素。

一些字符组合是历史惯例的遗迹,一直延续到现代,例如在现代英语“rough”中,“gh”听起来像 -f- 或根本不发出声音。在我看来,在关注原始“字符串”时,算法必须对语言和拼写约定的历史关系不可知,这会在字符组合与单个音素相关时导致一些任意的度量。如何衡量“粗糙”到“ruf”?还是“通过”到“通过”?
或者德语中的 o 元音变音“oe”?

在您的情况下,-y- 可以在语音和拼写上与 -ij- 交换。那么根据算法,两次删除后插入,还是一次删除 -j- 或 -i- 然后将剩余字符转置为 -y- 是什么?或者 -ij- 被合并并且合并之后是转置?

我建议您在应用算法之前为 -ij- 使用另一个未使用的组合字符,也许是 U00EC,带有重音符号的拉丁小写字母 i。

该算法如何处理多码点字符?

The Wikipedia article is rather loose with terminology. There are no such things as "strings" in "natural language". There are phonemes in natural language which can be represented by written characters and character-combinations.

Some character-combinations are vestiges of historical conventions which have survived into modern times, as in modern English "rough" where the "gh" can sound like -f- or make no sound at all. It seems to me that in focusing on raw "strings" the algorithm must be agnostic about the historical relationship of language and orthographic convention, which leads to some arbitrary metrics whenever character-combinations correlate to a single phoneme. How would it measure "rough" to "ruf"? Or "through" to "thru"?
Or German o-umlaut to "oe"?

In your case the -y- can be exchanged phonetically and orthographically with -ij-. So what is that according to the algorithm, two deletions followed by an insertion, or a single deletion of the -j- or of the -i- followed by a transposition of the remaining character to -y-? Or is -ij- being coalesced and the coalescence is followed by a transposition?

I would recommend that you use another unused comnbining character for -ij- before applying the algorithm, perhaps U00EC, Latin small letter i with grave accent.

How does the algorithm handle multi-codepoint characters?

睫毛上残留的泪 2024-10-17 03:36:36

由于 DL 距离测量距离的方式,它本身无法为您处理这个问题。

由于这里不涉及任何代码(或语言),我只能给您一个建议,以确保所有字符串都遵循相同的结构。

为了澄清您一般性询问后的情况,

请记住 DL 距离会逐个字符进行比较,并且实际上不会读取您的字符串本身,因此您必须在比较之前进行解析,就像 ij 应该'的情况一样t 与 y 交换反而会导致其他问题。

Well the D-L distance itself isn't going to handle it for you, due to the way it measure distances.

As there is no code (or language) involved here, I can only leave you with a suggestion to ensure all strings adhere to the same structure.

To clarify the situation since your asking in general terms,

bear in mind that the D-L distance compares character for character and doesn't actually read your strings in themselves, as such you'll have to parse before compare, as cases where ij shouldn't be exchanged with y will cause other issues instead.

掩耳倾听 2024-10-17 03:36:36

一种想法是将每个字符串翻译成某种构造的正字法表示,其中诸如“ij”和英语“gh”、“th”和朋友之类的二合字母只有一个字符长。在进行 Damerau-Levenshtein 时,所有类型的替换的距离度量不必相同,因此您可以使用您想要的任何惩罚,但表格需要在本地填充,因此您确实希望每个声音都是表格中的一个单元格。

然而,当“ij”不是“ij”而是拼写错误或在分词边界时(我不知道这是否会在荷兰语中发生),或者在任何其他情况下它实际上不是(意味着作为)有向图。

否则,您将需要进行一些环视,这会使事情变得复杂,但不应该改变算法的增长顺序(我相信),前提是您只查看周围恒定数量的单元格。不过,恒定因素仍然会大得多。

An idea is to translate each string into some sort of constructed orthographemic representation, where digraphs such as "ij" and the english "gh" "th" and friends are only one character long. The distance metric does not have to be equal for all types of replactements when doing Damerau-Levenshtein so you can use whatever penalties you want, but the table needs to be filled locally, therefore you really want each sound to be one cell in the table.

This however breaks when the "ij" was not intended as "ij" but a misspelling or at a word-segmentation border (I don't know if that can happen in Dutch), or in any other situation it is not actually (meant as) a digraph.

Otherwise you will need to do some lookaround, this will complicate things but should not change the growth order of the algorithm (I believe), provided you only look at constant number of cells around. The constant factors will still be much bigger though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文