PHP：将变音符号替换为 UTF-8 字符串中最接近的 7 位 ASCII 等效项

发布于 2024-07-06 15:57:35 字数 552 浏览 22 评论 0原文

我想要做的是从字符串中删除所有重音和变音符号，将“lärm”变成“larm”或“andré”变成“andre”。我尝试做的是对字符串进行 utf8_decode，然后对其使用 strtr，但由于我的源文件保存为 UTF-8 文件，因此我无法为所有变音输入 ISO-8859-15 字符 - 编辑器插入UTF-8 字符。

显然，解决此问题的一个解决方案是包含一个 ISO-8859-15 文件，但一定有比另一个必需的包含更好的方法吗？

echo strtr(utf8_decode($input), 
           'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ',
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');

更新：也许我对我尝试做的事情有点不准确：我实际上并不想删除元音变音，而是用最接近的“单字符 ASCII”等效项替换它们。

原文

What I want to do is to remove all accents and umlauts from a string, turning "lärm" into "larm" or "andré" into "andre". What I tried to do was to utf8_decode the string and then use strtr on it, but since my source file is saved as UTF-8 file, I can't enter the ISO-8859-15 characters for all umlauts - the editor inserts the UTF-8 characters.

Obviously a solution for this would be to have an include that's an ISO-8859-15 file, but there must be a better way than to have another required include?

echo strtr(utf8_decode($input), 
           'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ',
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');

UPDATE: Maybe I was a bit inaccurate with what I try to do: I do not actually want to remove the umlauts, but to replace them with their closest "one character ASCII" equivalent.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷情 2024-07-13 15:57:35

iconv("utf-8","ascii//TRANSLIT",$input);

扩展示例

iconv("utf-8","ascii//TRANSLIT",$input);

Extended example

回复收藏 0 原文

烟酉 2024-07-13 15:57:35

一个不需要设置语言环境或拥有巨大翻译表的小技巧：

function Unaccent($string)
{
    if (strpos($string = htmlentities($string, ENT_QUOTES, 'UTF-8'), '&') !== false)
    {
        $string = html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '$1', $string), ENT_QUOTES, 'UTF-8');
    }

    return $string;
}

它正常工作的唯一要求是将文件保存为 UTF-8（正如您应该已经做的那样）。

A little trick that doesn't require setting locales or having huge translation tables:

function Unaccent($string)
{
    if (strpos($string = htmlentities($string, ENT_QUOTES, 'UTF-8'), '&') !== false)
    {
        $string = html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '$1', $string), ENT_QUOTES, 'UTF-8');
    }

    return $string;
}

The only requirement for it to work properly is to save your files in UTF-8 (as you should already).

回复收藏 0 原文

为你拒绝所有暧昧 2024-07-13 15:57:35

你也可以尝试这个

$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);

，但你需要有 http://php.net/manual/en /book.intl.php 可用

you can also try this

$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);

but you need to have http://php.net/manual/en/book.intl.php available

回复收藏 0 原文

极致的悲 2024-07-13 15:57:35

好吧，我自己找到了一个明显的解决方案，但这并不是性能方面最好的解决方案......

echo strtr(utf8_decode($input), 
           utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');

Okay, found an obvious solution myself, but it's not the best concerning performance...

echo strtr(utf8_decode($input), 
           utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');

回复收藏 0 原文

魔法唧唧 2024-07-13 15:57:35

如果您使用的是 WordPress，则可以使用内置函数 remove_accents( $string )

https://codex.wordpress.org/Function_Reference/remove_accents

但是我注意到一个错误：它不适用于具有单个字符的字符串。

回复收藏 0 原文

娇柔作态 2024-07-13 15:57:35

对于阿拉伯语和波斯语用户，我建议通过这种方式删除变音符号：

    $diacritics = array('َ','ِ','ً','ٌ','ٍ','ّ','ْ','ـ');
    $search_txt = str_replace($diacritics, '', $diacritics);

要在阿拉伯语键盘中键入变音符号，您可以在 Windows 编辑器中使用此 Asci（这些代码是 Asci 而不是 Unicode）代码
直接输入变音符号或按住 Alt +（输入变音符号代码）
这是代码

ـَ(0243) ـِ(0246) ـُ(0245) ـً(0240) ـٍ(0242) ـٌ(0241) ـْ(0250) ـّ(0248) ـ
ـ(0220)

For Arabic and Persian users i recommend this way to remove diacritics:

    $diacritics = array('َ','ِ','ً','ٌ','ٍ','ّ','ْ','ـ');
    $search_txt = str_replace($diacritics, '', $diacritics);

For typing diacritics in Arabic keyboards u can use this Asci(those codes are Asci not Unicode) codes in windows editors
typing diacritics directly or holding Alt + (type the code of diacritic character)
This is the codes

ـَ(0243) ـِ(0246) ـُ(0245) ـً(0240) ـٍ(0242) ـٌ(0241) ـْ(0250) ـّ(0248) ـ
ـ(0220)

回复收藏 0 原文

临走之时 2024-07-13 15:57:35

我发现这个在法语和德语中给出了最一致的结果。
将元标记设置为 utf-8 后，我将其放置在一个函数中，以从单词数组中返回一行，并且效果完美。

htmlentities (  $line, ENT_SUBSTITUTE   , 'utf-8' )

I found that this one gives the most consistent results in French and German.
with the meta tag set to utf-8, I have place it in a function to return a line from a array of words and it works perfect.

htmlentities (  $line, ENT_SUBSTITUTE   , 'utf-8' )

回复收藏 0 原文

野却迷人 2024-07-13 15:57:35

执行此操作的规范方法：

获取文本的规范化形式规范分解。请参阅 https://unicode.org/reports/tr15/ 了解 Unicode 规范化形式。
删除非间距标记。
获取剩余文本的规范化形式规范组合。

https://unicode-org.github.io/icu/userguide/transforms /一般/

例如，要删除字符中的重音符号，请使用以下转换：

NFD; [:非空格标记:] 删除； NFC。

我有点不确定为什么他们给出这个例子，因为页面还指出

每个转换规则由两个冒号和后跟转换名称组成。

所以我们将添加这些。您需要包装 ICU 库的 intl 扩展。

$t = \Transliterator::createFromRules(':: NFD; ::[:Nonspacing Mark:] Remove; :: NFC;');

示例

print $t->transliterate('أ');

这会将 U+0623（上面带有 Hamza 的阿拉伯字母 Alef）转换为 U+0627（阿拉伯字母 Alef），即它也适用于非拉丁字母及其重音。

您可以将 [:Nonspacing Mark:] 替换为 [:Mn:]。

The canonical way to do this:

Obtain the Normalization Form Canonical Decomposition of the text. See https://unicode.org/reports/tr15/ for Unicode Normalization Forms.
Remove nonspacing marks.
Obtain the Normalization Form Canonical Composition of the remaining text.

https://unicode-org.github.io/icu/userguide/transforms/general/

For example, to remove accents from characters, use the following transform:

NFD; [:Nonspacing Mark:] Remove; NFC.

I am a bit unsure why they have given this example as such when the page also notes

each transform rule consists of two colons followed by a transform name.

So we will add those. You need the intl extension which wraps the ICU library.

$t = \Transliterator::createFromRules(':: NFD; ::[:Nonspacing Mark:] Remove; :: NFC;');

Example

print $t->transliterate('أ');

This transforms U+0623 (Arabic Letter Alef with Hamza Above) to U+0627 (Arabic Letter Alef) ie it works with non-latin letters and their accents as well.

You can replace [:Nonspacing Mark:] with [:Mn:].

回复收藏 0 原文

~没有更多了~