javascript+动态删除阿拉伯文本变音符号
如何动态删除阿拉伯语变音符号 我正在设计一本电子书“chm”,并且有多个 html 页面包含阿拉伯文本 但有时搜索引擎想要突出显示一些 阿拉伯语单词因为它的变音符号,所以当页面加载时是否有可能使用 JavaScript 函数来删除阿拉伯语变音符号文本? 但必须有再次启用的选项,所以我 不想从 HTML 中物理删除它,但只是暂时的,
问题是我不知道从哪里开始以及什么是正确的函数,
谢谢:)
例如
Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين
how to remove dynamically Arabic diacritic
I'm designing an ebook "chm" and have multi html pages contain Arabic text
but some time the search engine want highlight some of
Arabic words because its diacritic so is it possible when page load to use JavaScript functions that would strip the Arabic diacritic text ??
but must have option to enabled again so i
don't want to remove it from HTML physically but temporary,
the thing is i don't know where to start and what is the right function to use
thank you :)
For Example
Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我编写了这个函数,它处理具有混合阿拉伯语和英语字符的字符串,删除特殊字符(包括变音符号)并规范化一些阿拉伯字符,例如将所有 É 转换为 ه 。
I wrote this function which handles strings with mixed Arabic and English characters, removing special characters (including diacritics) and normalizing some Arabic characters like converting all ة's into ه's.
试试这个
http://www.suhailkaleem。 com/2009/08/26/remove-diacritics-from-arabic-text-quran/
代码是 C# 而不是 javascript。
仍在尝试找出如何在 javascript 中实现这一点
编辑:显然这在 javascript 中非常简单。变音符号存储为单独的“字母”,并且可以很容易地删除它们。
编辑:这是使用 BuckData http://qurandev.github.com/ 的另一种方法
Try this
http://www.suhailkaleem.com/2009/08/26/remove-diacritics-from-arabic-text-quran/
The code is C# not javascript though.
Still trying to figure out how to achieve this in javascript
EDIT: Apparently it's very easy in javascript. The diacratics are stored as separate "letters" and they can be removed quite easily.
Edit: Here is another way to do it using BuckData http://qurandev.github.com/
下面是一段 javascript 代码,几乎可以随时删除阿拉伯语变音符号。
Here's a javascript code that can handle removing Arabic diacritics nearly all the time.
使用此正则表达式捕获所有 tashkeel
[ًًٟٟ]
Use this regex to catch all tashkeel
[ؐ-ًؚٟ]
我尝试了以下解决方案,效果很好:
Reference: https://www.overdoe.com/javascript/2020/06/18/arabic-diacritics.html
I tried the following solution and it works fine:
Reference: https://www.overdoe.com/javascript/2020/06/18/arabic-diacritics.html
此站点有一些 Javascript Unicode 规范化例程,可用于执行您正在尝试的操作。如果不出意外的话,它可以提供一个良好的起点。
如果您可以预处理数据,Python 具有良好的 Unicode 例程来轻松完成此类转换。如果您可以预处理 CHM 文件以生成单独的索引文件,然后将其合并到 CHM 中,这可能是一个不错的选择:
This site has some routines for Javascript Unicode normalization which could be used to do what you're attempting. If nothing else it could provide a good starting point.
If you can preprocess the data, Python has good Unicode routines to make easy work of these sorts of transformations. This might be a good option if you can preprocess your CHM file to produe a separate index file which could be then merged into your CHM:
删除阿拉伯语变音符号(8 个基本变音符号或全部 52 个变音符号)的较短方法如下:
删除基本变音符号
删除所有阿拉伯语变音符号
A shorter approach to remove the Arabic diacritics (either the 8 Basic diacritics or the full 52 diacritics) could be as follows:
Remove Basic Diacritics
Remove All Arabic Diacritics
这是另一种基于 阿拉伯 Unicode 块 的方法:
仍然可以考虑某些字母具有变音符号,例如
带
“jeh”,看起来像瑞
“reh”。但由于它在阿拉伯语中被赋予了一个不同的基本名称,所以我不让它被剥夺“额外标记”而变成“reh”。在少数情况下会发生这种情况,例如带
“feh”和带
“feh 下面的点”,但是带
和带
被赋予了基本名称,但例如带
则没有。不确定解决这些问题的最佳方法。我不知道什么是变音符号、什么不是 100% 的确切定义,但这应该是一个好的开始。此外,“hamza + 字母”连字被分别转换为 hamza 和字母。
如果您知道如何改进此问题,请发表评论并添加修复程序(如果您愿意)。
Here is another approach based on the Arabic Unicode block:
Some letters could still be considered to have diacritics such as
ژ
"jeh" which looks likeر
"reh". But since it is given a different fundamental name in Arabic, I made it not get stripped of its "extra markings" to become "reh". That happened in a few cases, such as withڡ
"feh" andڢ
"dot below feh", butڤ
andڦ
were given fundamental names, but notڥ
for example. Not sure the best way to approach those. I don't know the exact definition of what is a diacritic and what is not to a 100% degree, but this should be a good start.Also, the "hamza + letter" ligatures were converted into hamza and the letter separately.
If you know how to improve this, please comment and add a fix if you'd like.