PHP:将变音符号替换为 UTF-8 字符串中最接近的 7 位 ASCII 等效项

发布于 2024-07-06 15:57:35 字数 552 浏览 16 评论 0原文

我想要做的是从字符串中删除所有重音和变音符号,将“lärm”变成“larm”或“andré”变成“andre”。 我尝试做的是对字符串进行 utf8_decode,然后对其使用 strtr,但由于我的源文件保存为 UTF-8 文件,因此我无法为所有变音输入 ISO-8859-15 字符 - 编辑器插入UTF-8 字符。

显然,解决此问题的一个解决方案是包含一个 ISO-8859-15 文件,但一定有比另一个必需的包含更好的方法吗?

echo strtr(utf8_decode($input), 
           'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ',
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');

更新:也许我对我尝试做的事情有点不准确:我实际上并不想删除元音变音,而是用最接近的“单字符 ASCII”等效项替换它们。

What I want to do is to remove all accents and umlauts from a string, turning "lärm" into "larm" or "andré" into "andre". What I tried to do was to utf8_decode the string and then use strtr on it, but since my source file is saved as UTF-8 file, I can't enter the ISO-8859-15 characters for all umlauts - the editor inserts the UTF-8 characters.

Obviously a solution for this would be to have an include that's an ISO-8859-15 file, but there must be a better way than to have another required include?

echo strtr(utf8_decode($input), 
           'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ',
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');

UPDATE: Maybe I was a bit inaccurate with what I try to do: I do not actually want to remove the umlauts, but to replace them with their closest "one character ASCII" equivalent.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

冷情 2024-07-13 15:57:35
iconv("utf-8","ascii//TRANSLIT",$input);

扩展示例

iconv("utf-8","ascii//TRANSLIT",$input);

Extended example

烟酉 2024-07-13 15:57:35

一个不需要设置语言环境或拥有巨大翻译表的小技巧:

function Unaccent($string)
{
    if (strpos($string = htmlentities($string, ENT_QUOTES, 'UTF-8'), '&') !== false)
    {
        $string = html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '$1', $string), ENT_QUOTES, 'UTF-8');
    }

    return $string;
}

它正常工作的唯一要求是将文件保存为 UTF-8(正如您应该已经做的那样)。

A little trick that doesn't require setting locales or having huge translation tables:

function Unaccent($string)
{
    if (strpos($string = htmlentities($string, ENT_QUOTES, 'UTF-8'), '&') !== false)
    {
        $string = html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '$1', $string), ENT_QUOTES, 'UTF-8');
    }

    return $string;
}

The only requirement for it to work properly is to save your files in UTF-8 (as you should already).

为你拒绝所有暧昧 2024-07-13 15:57:35

你也可以尝试这个

$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);

,但你需要有 http://php.net/manual/en /book.intl.php 可用

you can also try this

$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);

but you need to have http://php.net/manual/en/book.intl.php available

极致的悲 2024-07-13 15:57:35

好吧,我自己找到了一个明显的解决方案,但这并不是性能方面最好的解决方案......

echo strtr(utf8_decode($input), 
           utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');

Okay, found an obvious solution myself, but it's not the best concerning performance...

echo strtr(utf8_decode($input), 
           utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
魔法唧唧 2024-07-13 15:57:35

如果您使用的是 WordPress,则可以使用内置函数 remove_accents( $string )

https://codex.wordpress.org/Function_Reference/remove_accents

但是我注意到一个错误:它不适用于具有单个字符的字符串。

If you are using WordPress, you can use the built-in function remove_accents( $string )

https://codex.wordpress.org/Function_Reference/remove_accents

However I noticed a bug : it doesn’t work on a string with a single character.

娇柔作态 2024-07-13 15:57:35

对于阿拉伯语和波斯语用户,我建议通过这种方式删除变音符号:

    $diacritics = array('َ','ِ','ً','ٌ','ٍ','ّ','ْ','ـ');
    $search_txt = str_replace($diacritics, '', $diacritics);

要在阿拉伯语键盘中键入变音符号,您可以在 Windows 编辑器中使用此 Asci(这些代码是 Asci 而不是 Unicode)代码
直接输入变音符号或按住 Alt +(输入变音符号代码)
这是代码

ـَ(0243) ـِ(0246) ـُ(0245) ـً(0240) ـٍ(0242) ـٌ(0241) ـْ(0250) ـّ(0248) ـ
ـ(0220)

For Arabic and Persian users i recommend this way to remove diacritics:

    $diacritics = array('َ','ِ','ً','ٌ','ٍ','ّ','ْ','ـ');
    $search_txt = str_replace($diacritics, '', $diacritics);

For typing diacritics in Arabic keyboards u can use this Asci(those codes are Asci not Unicode) codes in windows editors
typing diacritics directly or holding Alt + (type the code of diacritic character)
This is the codes

ـَ(0243) ـِ(0246) ـُ(0245) ـً(0240) ـٍ(0242) ـٌ(0241) ـْ(0250) ـّ(0248) ـ
ـ(0220)

临走之时 2024-07-13 15:57:35

我发现这个在法语和德语中给出了最一致的结果。
将元标记设置为 utf-8 后,我将其放置在一个函数中,以从单词数组中返回一行,并且效果完美。

htmlentities (  $line, ENT_SUBSTITUTE   , 'utf-8' ) 

I found that this one gives the most consistent results in French and German.
with the meta tag set to utf-8, I have place it in a function to return a line from a array of words and it works perfect.

htmlentities (  $line, ENT_SUBSTITUTE   , 'utf-8' ) 
野却迷人 2024-07-13 15:57:35

执行此操作的规范方法:

  1. 获取文本的规范化形式规范分解。 请参阅 https://unicode.org/reports/tr15/ 了解 Unicode 规范化形式。
  2. 删除非间距标记。
  3. 获取剩余文本的规范化形式规范组合。

https://unicode-org.github.io/icu/userguide/transforms /一般/

例如,要删除字符中的重音符号,请使用以下转换:

NFD; [:非空格标记:] 删除; NFC。

我有点不确定为什么他们给出这个例子,因为页面还指出

每个转换规则由两个冒号和后跟转换名称组成。

所以我们将添加这些。 您需要包装 ICU 库的 intl 扩展。

$t = \Transliterator::createFromRules(':: NFD; ::[:Nonspacing Mark:] Remove; :: NFC;');

示例

print $t->transliterate('أ');

这会将 U+0623(上面带有 Hamza 的阿拉伯字母 Alef)转换为 U+0627(阿拉伯字母 Alef),即它也适用于非拉丁字母及其重音。

您可以将 [:Nonspacing Mark:] 替换为 [:Mn:]

The canonical way to do this:

  1. Obtain the Normalization Form Canonical Decomposition of the text. See https://unicode.org/reports/tr15/ for Unicode Normalization Forms.
  2. Remove nonspacing marks.
  3. Obtain the Normalization Form Canonical Composition of the remaining text.

https://unicode-org.github.io/icu/userguide/transforms/general/

For example, to remove accents from characters, use the following transform:

NFD; [:Nonspacing Mark:] Remove; NFC.

I am a bit unsure why they have given this example as such when the page also notes

each transform rule consists of two colons followed by a transform name.

So we will add those. You need the intl extension which wraps the ICU library.

$t = \Transliterator::createFromRules(':: NFD; ::[:Nonspacing Mark:] Remove; :: NFC;');

Example

print $t->transliterate('أ');

This transforms U+0623 (Arabic Letter Alef with Hamza Above) to U+0627 (Arabic Letter Alef) ie it works with non-latin letters and their accents as well.

You can replace [:Nonspacing Mark:] with [:Mn:].

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文