非英语字符串上的编辑距离
Levenshtein 距离 算法也适用于非英语字符串吗?
更新:在比较亚洲字符时,这会在 Java 等语言中自动运行吗?
Will the Levenshtein distance algorithm work well for non-English language strings too?
Update: Would this work automatically in a language like Java when comparing Asian characters?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
仅当语言基于字母时。例如俄语、德语……但象形文字(例如中国)或音节(例如老挝)- 不是。
Only if language is letter based. For example Russian, German,... but hieroglyph (China for example) or syllable (like Laos) - not.
是的。但是您必须将非英语字符视为“1 个字符”,而不是多个字符(例如使用 utf-8)。
例如,在 python 中,您将使用 unicode 类来表示字符串(和字符)。
Yes. But you have to treat the non-english characters as "1 character", not as multiple characters (for example with utf-8).
For example, in python you would use the unicode class to represent the string (and characters).
Levenshtein 不关心语言,它只是告诉您需要更改(添加、删除、交换)多少个字符才能从一个字符串转换为另一个字符串。
所以:是的,但是您必须检查您的字符集,一些外国“单个”字符可能会被视为两个(或更多)字符。
Levenshtein doesn't care about languages, it just tells you how many characters need to be changed (added, removed, exchanged) to get from one string to the other.
So: yes, but you'll have to check your charset, some foreign "single" characters my otherwise be treated as two (or more) characters.