如何将看起来像英文的 Unicode 文本转换为 ASCII?

发布于 2024-12-27 12:44:21 字数 242 浏览 0 评论 0原文

我有一个类似“以前”的文本,看起来是英文,但其中包含非 ASCII 字符:

在此处输入图像描述

什么将其转换为英文文本的最简单方法(例如,“P”将是拉丁大写字母 P)?

为简单起见,我们假设非英语字符是俄语。 但是,更通用的解决方案将不胜感激!

首选语言:Javascript、Ruby、Bash 脚本。

I have a text like "Previously" which looks English but has a non-ASCII characters in it:

enter image description here

What would be the easiest way to convert it to English text (so that "P" would be a latin capital letter P, for example) ?

For simplicity, let's assume that the non-English characters are Russian.
But, a more general solution will be much appreciated!

Preferable languages: Javascript, Ruby, Bash script.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

浅暮の光 2025-01-03 12:44:21

尽管某些西里尔字母(和希腊字母)在图形外观上与某些拉丁字母相同(即,包含这两种字母的任何字体可能没有区别),但它们之间没有定义正式的映射。因此,您需要自己定义映射。对于俄语来说,此类字母的数量相当有限,因此只需一个小的映射表即可。但是,如果您希望涵盖所有 Unicode,则存在大量相似和几乎相似的字符,因此困难的部分是确定哪些字符足够相似。

Although some Cyrillic (and Greek) letters are identical with some Latin letters in graphic appearance (i.e., there is probably no difference in any font that contains both), there is no formal mapping defined between them. Thus, you would need to define the mapping yourself. For Russian, there is a rather limited number of such letters, so it would be a matter of a small mapping table. But if you wish to cover all of Unicode, there is a large numer of lookalikes and almost lookalikes, so the hard part would be to decide which characters are similar enough.

娇俏 2025-01-03 12:44:21

我会推荐 Unicode 包,它将希腊语和西里尔字母映射到最接近的 ascii 符号并删除任何变音符号:

unidecode('Lillı Celiné Никита Ödipus');

“莉莉·席琳·尼基塔·奥迪普斯”

I would recommend Unicode package, it will map Greek and Cyrillic letters to their closest ascii symbol and removes any diacritics:

unidecode('Lillı Celiné Никита Ödipus');

'Lilli Celine Nikita Odipus'

情场扛把子 2025-01-03 12:44:21

如果要转换的字符数量很少(例如俄语字母),那么将输入字符映射到输出字符的简单字典就足够了。

只需循环遍历字符串,并为每个字符查看它是否在字典中,如果是,则将其替换为存储在字典中的替换字符。

If the number of characters to be converted is small (e.g. Russian alphabet), then a simple dictionary mapping input characters to output characters would suffice.

Simply loop through the string and for each character look if it's in the dictionary and if yes, replace it with the replacement character stored in the dictionary.

梦回旧景 2025-01-03 12:44:21

俄罗斯政府提供了一份看起来像拉丁字符的西里尔字符的官方列表。您可以使用此列表来构建表格。对于更一般的变音符号检查,您可以使用 标准化形式 和过滤器(如果您需要的话)

function Americanize(str){
   const lookup = {
      А:'A',В:'B',Е:'E',К:'K',М:'M',Н:'H',О:'O',
      Р:'P',С:'C',Т:'T',У:'y',Х:'X',а:'a',в:'B',
      е:'e',к:'K',м:'M',н:'H',о:'o',р:'p',с:'c',
      т:'T',у:'y',х:'x'
    };
    return Array.from(str.normalize('NFKD'))
       .filter(E => (lookup[E] ?? E).charCodeAt(0) < 128).join('');
}

这显然是遗漏了许多可以被解释为外观相似但形式不同的字符(И vs N)

The Russian government has provided an official list of Cyrillic characters that look like Latin characters. You can use this list to construct a table. For the more general diacritic check you can use a normalized form and a filter if you need that

function Americanize(str){
   const lookup = {
      А:'A',В:'B',Е:'E',К:'K',М:'M',Н:'H',О:'O',
      Р:'P',С:'C',Т:'T',У:'y',Х:'X',а:'a',в:'B',
      е:'e',к:'K',м:'M',н:'H',о:'o',р:'p',с:'c',
      т:'T',у:'y',х:'x'
    };
    return Array.from(str.normalize('NFKD'))
       .filter(E => (lookup[E] ?? E).charCodeAt(0) < 128).join('');
}

This obviously leaves off many characters that can be interpreted as similar looking but not the same form ( И vs N )

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文