PHP:将变音符号替换为 UTF-8 字符串中最接近的 7 位 ASCII 等效项
我想要做的是从字符串中删除所有重音和变音符号,将“lärm”变成“larm”或“andré”变成“andre”。 我尝试做的是对字符串进行 utf8_decode,然后对其使用 strtr,但由于我的源文件保存为 UTF-8 文件,因此我无法为所有变音输入 ISO-8859-15 字符 - 编辑器插入UTF-8 字符。
显然,解决此问题的一个解决方案是包含一个 ISO-8859-15 文件,但一定有比另一个必需的包含更好的方法吗?
echo strtr(utf8_decode($input),
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ',
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
更新:也许我对我尝试做的事情有点不准确:我实际上并不想删除元音变音,而是用最接近的“单字符 ASCII”等效项替换它们。
What I want to do is to remove all accents and umlauts from a string, turning "lärm" into "larm" or "andré" into "andre". What I tried to do was to utf8_decode the string and then use strtr on it, but since my source file is saved as UTF-8 file, I can't enter the ISO-8859-15 characters for all umlauts - the editor inserts the UTF-8 characters.
Obviously a solution for this would be to have an include that's an ISO-8859-15 file, but there must be a better way than to have another required include?
echo strtr(utf8_decode($input),
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ',
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
UPDATE: Maybe I was a bit inaccurate with what I try to do: I do not actually want to remove the umlauts, but to replace them with their closest "one character ASCII" equivalent.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
扩展示例
Extended example
一个不需要设置语言环境或拥有巨大翻译表的小技巧:
它正常工作的唯一要求是将文件保存为 UTF-8(正如您应该已经做的那样)。
A little trick that doesn't require setting locales or having huge translation tables:
The only requirement for it to work properly is to save your files in UTF-8 (as you should already).
你也可以尝试这个
,但你需要有 http://php.net/manual/en /book.intl.php 可用
you can also try this
but you need to have http://php.net/manual/en/book.intl.php available
好吧,我自己找到了一个明显的解决方案,但这并不是性能方面最好的解决方案......
Okay, found an obvious solution myself, but it's not the best concerning performance...
如果您使用的是 WordPress,则可以使用内置函数
remove_accents( $string )
https://codex.wordpress.org/Function_Reference/remove_accents
但是我注意到一个错误:它不适用于具有单个字符的字符串。
If you are using WordPress, you can use the built-in function
remove_accents( $string )
https://codex.wordpress.org/Function_Reference/remove_accents
However I noticed a bug : it doesn’t work on a string with a single character.
对于阿拉伯语和波斯语用户,我建议通过这种方式删除变音符号:
要在阿拉伯语键盘中键入变音符号,您可以在 Windows 编辑器中使用此 Asci(这些代码是 Asci 而不是 Unicode)代码
直接输入变音符号或按住 Alt +(输入变音符号代码)
这是代码
ـَ(0243) ـِ(0246) ـُ(0245) ـً(0240) ـٍ(0242) ـٌ(0241) ـْ(0250) ـّ(0248) ـ
ـ(0220)
For Arabic and Persian users i recommend this way to remove diacritics:
For typing diacritics in Arabic keyboards u can use this Asci(those codes are Asci not Unicode) codes in windows editors
typing diacritics directly or holding Alt + (type the code of diacritic character)
This is the codes
ـَ(0243) ـِ(0246) ـُ(0245) ـً(0240) ـٍ(0242) ـٌ(0241) ـْ(0250) ـّ(0248) ـ
ـ(0220)
我发现这个在法语和德语中给出了最一致的结果。
将元标记设置为
utf-8
后,我将其放置在一个函数中,以从单词数组中返回一行,并且效果完美。I found that this one gives the most consistent results in French and German.
with the meta tag set to
utf-8
, I have place it in a function to return a line from a array of words and it works perfect.执行此操作的规范方法:
https://unicode-org.github.io/icu/userguide/transforms /一般/
NFD; [:非空格标记:] 删除; NFC。
我有点不确定为什么他们给出这个例子,因为页面还指出
所以我们将添加这些。 您需要包装
ICU
库的intl
扩展。示例
这会将 U+0623(上面带有 Hamza 的阿拉伯字母 Alef)转换为 U+0627(阿拉伯字母 Alef),即它也适用于非拉丁字母及其重音。
您可以将
[:Nonspacing Mark:]
替换为[:Mn:]
。The canonical way to do this:
https://unicode-org.github.io/icu/userguide/transforms/general/
NFD; [:Nonspacing Mark:] Remove; NFC.
I am a bit unsure why they have given this example as such when the page also notes
So we will add those. You need the
intl
extension which wraps theICU
library.Example
This transforms U+0623 (Arabic Letter Alef with Hamza Above) to U+0627 (Arabic Letter Alef) ie it works with non-latin letters and their accents as well.
You can replace
[:Nonspacing Mark:]
with[:Mn:]
.