javascript+动态删除阿拉伯文本变音符号

发布于 2024-10-20 20:47:10 字数 342 浏览 7 评论 0原文

如何动态删除阿拉伯语变音符号我正在设计一本电子书“chm”，并且有多个 html 页面包含阿拉伯文本但有时搜索引擎想要突出显示一些阿拉伯语单词因为它的变音符号，所以当页面加载时是否有可能使用 JavaScript 函数来删除阿拉伯语变音符号文本？但必须有再次启用的选项，所以我不想从 HTML 中物理删除它，但只是暂时的，

问题是我不知道从哪里开始以及什么是正确的函数，

谢谢:)

例如

Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين

原文

how to remove dynamically Arabic diacritic
I'm designing an ebook "chm" and have multi html pages contain Arabic text
but some time the search engine want highlight some of
Arabic words because its diacritic so is it possible when page load to use JavaScript functions that would strip the Arabic diacritic text ??
but must have option to enabled again so i
don't want to remove it from HTML physically but temporary,

the thing is i don't know where to start and what is the right function to use

thank you :)

For Example

Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

灰色世界里的红玫瑰 2024-10-27 20:47:10

我编写了这个函数，它处理具有混合阿拉伯语和英语字符的字符串，删除特殊字符（包括变音符号）并规范化一些阿拉伯字符，例如将所有 É 转换为 ه 。

normalize_text = function(text) {

  //remove special characters
  text = text.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');

  //normalize Arabic
  text = text.replace(/(آ|إ|أ)/g, 'ا');
  text = text.replace(/(ة)/g, 'ه');
  text = text.replace(/(ئ|ؤ)/g, 'ء')
  text = text.replace(/(ى)/g, 'ي');

  //convert arabic numerals to english counterparts.
  var starter = 0x660;
  for (var i = 0; i < 10; i++) {
    text.replace(String.fromCharCode(starter + i), String.fromCharCode(48 + i));
  }

  return text;
}

<input value="الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ" type="text" id="input">
<button onclick="document.getElementById('input').value = normalize_text(document.getElementById('input').value)">Normalize</button>

I wrote this function which handles strings with mixed Arabic and English characters, removing special characters (including diacritics) and normalizing some Arabic characters like converting all ة's into ه's.

normalize_text = function(text) {

  //remove special characters
  text = text.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');

  //normalize Arabic
  text = text.replace(/(آ|إ|أ)/g, 'ا');
  text = text.replace(/(ة)/g, 'ه');
  text = text.replace(/(ئ|ؤ)/g, 'ء')
  text = text.replace(/(ى)/g, 'ي');

  //convert arabic numerals to english counterparts.
  var starter = 0x660;
  for (var i = 0; i < 10; i++) {
    text.replace(String.fromCharCode(starter + i), String.fromCharCode(48 + i));
  }

  return text;
}

<input value="الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ" type="text" id="input">
<button onclick="document.getElementById('input').value = normalize_text(document.getElementById('input').value)">Normalize</button>

回复收藏 0 原文

好久不见√ 2024-10-27 20:47:10

试试这个

Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين

http://www.suhailkaleem。 com/2009/08/26/remove-diacritics-from-arabic-text-quran/

代码是 C# 而不是 javascript。
仍在尝试找出如何在 javascript 中实现这一点

编辑：显然这在 javascript 中非常简单。变音符号存储为单独的“字母”，并且可以很容易地删除它们。

var CHARCODE_SHADDA = 1617;
var CHARCODE_SUKOON = 1618;
var CHARCODE_SUPERSCRIPT_ALIF = 1648;
var CHARCODE_TATWEEL = 1600;
var CHARCODE_ALIF = 1575;

function isCharTashkeel(letter)
{
    if (typeof(letter) == "undefined" || letter == null)
        return false;

    var code = letter.charCodeAt(0);
    //1648 - superscript alif
    //1619 - madd: ~
    return (code == CHARCODE_TATWEEL || code == CHARCODE_SUPERSCRIPT_ALIF || code >= 1612 && code <= 1631); //tashkeel
}

function stripTashkeel(input)
{
  var output = "";
  //todo consider using a stringbuilder to improve performance
  for (var i = 0; i < input.length; i++)
  {
    var letter = input.charAt(i);
    if (!isCharTashkeel(letter)) //tashkeel
      output += letter;                                
  }


return output;                   
}

编辑：这是使用 BuckData http://qurandev.github.com/ 的另一种方法

优点
Buck 使用更少的带宽在 Javascript 中，你可以通过搜索
1 个镜头内完整的巴克古兰经文本。与阿拉伯语搜索相比直观
Buck 到阿拉伯语和阿拉伯语到 Buck 是一个简单的 js 调用。玩现场
示例如下： http://jsfiddle.net/BrxJP/ 您可以去掉所有元音
从巴克文本几毫秒内。为什么要这样做？你可以搜索
javascript，忽略任务跟差异（Fathah、Dammah、
卡斯拉）。这会带来更多的点击量。正则表达式 + 降压文本可能会导致
很棒的优化。所有搜索都可以在本地运行。
http://qurandev.appspot.com 数据是如何生成的？只是一对一
映射使用： http://corpus.quran.com/java/buckwalter.jsp< /p>

Try this

Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين

http://www.suhailkaleem.com/2009/08/26/remove-diacritics-from-arabic-text-quran/

The code is C# not javascript though.
Still trying to figure out how to achieve this in javascript

EDIT: Apparently it's very easy in javascript. The diacratics are stored as separate "letters" and they can be removed quite easily.

var CHARCODE_SHADDA = 1617;
var CHARCODE_SUKOON = 1618;
var CHARCODE_SUPERSCRIPT_ALIF = 1648;
var CHARCODE_TATWEEL = 1600;
var CHARCODE_ALIF = 1575;

function isCharTashkeel(letter)
{
    if (typeof(letter) == "undefined" || letter == null)
        return false;

    var code = letter.charCodeAt(0);
    //1648 - superscript alif
    //1619 - madd: ~
    return (code == CHARCODE_TATWEEL || code == CHARCODE_SUPERSCRIPT_ALIF || code >= 1612 && code <= 1631); //tashkeel
}

function stripTashkeel(input)
{
  var output = "";
  //todo consider using a stringbuilder to improve performance
  for (var i = 0; i < input.length; i++)
  {
    var letter = input.charAt(i);
    if (!isCharTashkeel(letter)) //tashkeel
      output += letter;                                
  }


return output;                   
}

Edit: Here is another way to do it using BuckData http://qurandev.github.com/

Advantages
Buck uses less bandwidth In Javascript, u can search thru
entire Buck quran text in 1 shot. intuitive compared to Arabic search
Buck to Arabic and Arabic to Buck is a simple js call. Play with live
sample here: http://jsfiddle.net/BrxJP/ You can strip out all vowels
from Buck text in few millisecs. Why do this? u can search in
javascript, ignoring the taskheel differences (Fathah, Dammah,
Kasrah). Which leads to more hits. Regex + buck text can lead to
awesome optimizations. All the searches can be run locally.
http://qurandev.appspot.com How data generated? just one-to-one
mapping using: http://corpus.quran.com/java/buckwalter.jsp

回复收藏 0 原文

情痴 2024-10-27 20:47:10

下面是一段 javascript 代码，几乎可以随时删除阿拉伯语变音符号。

var arabicNormChar = {
    'ك': 'ک', 'ﻷ': 'لا', 'ؤ': 'و', 'ى': 'ی', 'ي': 'ی', 'ئ': 'ی', 'أ': 'ا', 'إ': 'ا', 'آ': 'ا', 'ٱ': 'ا', 'ٳ': 'ا', 'ة': 'ه', 'ء': '', 'ِ': '', 'ْ': '', 'ُ': '', 'َ': '', 'ّ': '', 'ٍ': '', 'ً': '', 'ٌ': '', 'ٓ': '', 'ٰ': '', 'ٔ': '', '�': ''
}

var simplifyArabic  = function (str) {
    return str.replace(/[^\u0000-\u007E]/g, function(a){ 
        var retval = arabicNormChar[a]
        if (retval == undefined) {retval = a}
        return retval; 
    }).normalize('NFKD').toLowerCase();
}

//now you can use simplifyArabic(str) on Arabic strings to remove the diacritics

注意：您可以根据自己的喜好覆盖 arabicNormChar。

Here's a javascript code that can handle removing Arabic diacritics nearly all the time.

var arabicNormChar = {
    'ك': 'ک', 'ﻷ': 'لا', 'ؤ': 'و', 'ى': 'ی', 'ي': 'ی', 'ئ': 'ی', 'أ': 'ا', 'إ': 'ا', 'آ': 'ا', 'ٱ': 'ا', 'ٳ': 'ا', 'ة': 'ه', 'ء': '', 'ِ': '', 'ْ': '', 'ُ': '', 'َ': '', 'ّ': '', 'ٍ': '', 'ً': '', 'ٌ': '', 'ٓ': '', 'ٰ': '', 'ٔ': '', '�': ''
}

var simplifyArabic  = function (str) {
    return str.replace(/[^\u0000-\u007E]/g, function(a){ 
        var retval = arabicNormChar[a]
        if (retval == undefined) {retval = a}
        return retval; 
    }).normalize('NFKD').toLowerCase();
}

//now you can use simplifyArabic(str) on Arabic strings to remove the diacritics

Note: you may override the arabicNormChar to your own preferences.

回复收藏 0 原文

空城旧梦 2024-10-27 20:47:10

使用此正则表达式捕获所有 tashkeel

[ًًٟٟ]

回复收藏 0 原文

孤独难免 2024-10-27 20:47:10

我尝试了以下解决方案，效果很好：

const str = 'الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ';
const withoutDiacs = str.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');
console.log(withoutDiacs); //الحمد لله رب العالمين

Reference: https://www.overdoe.com/javascript/2020/06/18/arabic-diacritics.html

I tried the following solution and it works fine:

const str = 'الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ';
const withoutDiacs = str.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');
console.log(withoutDiacs); //الحمد لله رب العالمين

Reference: https://www.overdoe.com/javascript/2020/06/18/arabic-diacritics.html

回复收藏 0 原文

扶醉桌前 2024-10-27 20:47:10

此站点有一些 Javascript Unicode 规范化例程，可用于执行您正在尝试的操作。如果不出意外的话，它可以提供一个良好的起点。

如果您可以预处理数据，Python 具有良好的 Unicode 例程来轻松完成此类转换。如果您可以预处理 CHM 文件以生成单独的索引文件，然后将其合并到 CHM 中，这可能是一个不错的选择：

import unicodedata

def _strip(text):
    return ''.join([c for c in unicodedata.normalize('NFD', text) \
        if unicodedata.category(c) != 'Mn'])

composed = u'\xcd\xf1\u0163\u0115\u0155\u0148\u0101\u0163\u0129\u014d' \
    u'\u0146\u0105\u013c\u012d\u017e\u0119'

_strip(composed)
'Internationalize'

This site has some routines for Javascript Unicode normalization which could be used to do what you're attempting. If nothing else it could provide a good starting point.

If you can preprocess the data, Python has good Unicode routines to make easy work of these sorts of transformations. This might be a good option if you can preprocess your CHM file to produe a separate index file which could be then merged into your CHM:

import unicodedata

def _strip(text):
    return ''.join([c for c in unicodedata.normalize('NFD', text) \
        if unicodedata.category(c) != 'Mn'])

composed = u'\xcd\xf1\u0163\u0115\u0155\u0148\u0101\u0163\u0129\u014d' \
    u'\u0146\u0105\u013c\u012d\u017e\u0119'

_strip(composed)
'Internationalize'

回复收藏 0 原文

灵芸 2024-10-27 20:47:10

删除阿拉伯语变音符号（8 个基本变音符号或全部 52 个变音符号）的较短方法如下：

删除基本变音符号

function removeTashkeelBasic(s) {return s.replace(/[ً-ْ]/g,'');}



//===================
//     Test Cases
//===================
console.log(removeTashkeelBasic('حِسَابٌ وَحِسَابًا مِنْ ثَلَاثُمِئَةِ رِيَالٍ قَطَرِيٍّ'));
console.log(removeTashkeelBasic('بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ'));

删除所有阿拉伯语变音符号

function removeTashkeelAll(s) {return s.replace(/[ؐ-ًؕ-ٖٓ-ٟۖ-ٰٰۭ]/g,'');}


//===================
//     Test Cases
//===================
console.log(removeTashkeelAll('حِسَابٌ وَحِسَابًا مِنْ ثَلَاثُمِئَةِ رِيَالٍ قَطَرِيٍّ'));
console.log(removeTashkeelAll('بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ'));

A shorter approach to remove the Arabic diacritics (either the 8 Basic diacritics or the full 52 diacritics) could be as follows:

Remove Basic Diacritics

function removeTashkeelBasic(s) {return s.replace(/[ً-ْ]/g,'');}



//===================
//     Test Cases
//===================
console.log(removeTashkeelBasic('حِسَابٌ وَحِسَابًا مِنْ ثَلَاثُمِئَةِ رِيَالٍ قَطَرِيٍّ'));
console.log(removeTashkeelBasic('بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ'));

Remove All Arabic Diacritics

function removeTashkeelAll(s) {return s.replace(/[ؐ-ًؕ-ٖٓ-ٟۖ-ٰٰۭ]/g,'');}


//===================
//     Test Cases
//===================
console.log(removeTashkeelAll('حِسَابٌ وَحِسَابًا مِنْ ثَلَاثُمِئَةِ رِيَالٍ قَطَرِيٍّ'));
console.log(removeTashkeelAll('بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ'));

回复收藏 0 原文

心安伴我暖 2024-10-27 20:47:10

这是另一种基于阿拉伯 Unicode 块的方法：

const map = {
  'آ': 'ا',
  'أ': 'ا',
  'إ': 'ا',
  'ا': 'ا',
  'ٱ': 'ا',
  'ٲ': 'ا',
  'ٳ': 'ا',
  'ؤ': 'و',
  'ئ': 'ى',
  'ؽ': 'ؽ',
  'ؾ': 'ؾ',
  'ؿ': 'ؿ',
  'ي': 'ى',
  'ب': 'ب',
  'ت': 'ت',
  'ؠ': 'ؠ',
  'ة': 'ه',
  'ث': 'ث',
  'ج': 'ج',
  'ح': 'ح',
  'خ': 'خ',
  'د': 'د',
  'ذ': 'ذ',
  'ر': 'ر',
  'ز': 'ز',
  'س': 'س',
  'ش': 'ش',
  'ص': 'ص',
  'ض': 'ض',
  'ط': 'ط',
  'ظ': 'ظ',
  'ع': 'ع',
  'غ': 'غ',
  'ػ': 'ک',
  'ؼ': 'ک',
  'ف': 'ف',
  'ق': 'ق',
  'ك': 'ك',
  'ګ': 'ك',
  'ڬ': 'ك',
  'ڭ': 'ڭ',
  'ڮ': 'ك',
  'ل': 'ل',
  'م': 'م',
  'ن': 'ن',
  'ه': 'ه',
  'و': 'و',
  'ى': 'ى',
  'ٸ': 'ى',
  'ٵ': 'ءا', // hamza alef?
  'ٶ': 'ءو', // hamza waw?
  'ٹ': 'ٹ',
  'ٺ': 'ٺ',
  'ٻ': 'ٻ',
  'ټ': 'ت',
  'ٽ': 'ت',
  'پ': 'پ',
  'ٿ': 'ٿ',
  'ڀ': 'ڀ',
  'ځ': 'ءح',
  'ڂ': 'ح',
  'ڃ': 'ڃ',
  'ڄ': 'ڄ',
  'څ': 'ح',
  'چ': 'چ',
  'ڇ': 'ڇ',
  'ڈ': 'ڈ',
  'ډ': 'د',
  'ڊ': 'د',
  'ڋ': 'د',
  'ڌ': 'ڌ',
  'ڍ': 'ڍ',
  'ڎ': 'ڎ',
  'ڏ': 'د',
  'ڐ': 'د',
  'ڑ': 'ڑ',
  'ڒ': 'ر',
  'ړ': 'ر',
  'ڔ': 'ر',
  'ڕ': 'ر',
  'ږ': 'ر',
  'ڗ': 'ر',
  'ژ': 'ژ',
  'ڙ': 'ڙ',
  'ښ': 'س',
  'ڛ': 'س',
  'ڜ': 'س',
  'ڝ': 'ص',
  'ڞ': 'ص',
  'ڟ': 'ط',
  'ڠ': 'ع',
  'ڡ': 'ڡ',
  'ڢ': 'ڡ',
  'ڣ': 'ڡ',
  'ڤ': 'ڤ',
  'ڥ': 'ڡ',
  'ڦ': 'ڦ',
  'ڧ': 'ق',
  'ڨ': 'ق',
  'ک': 'ک',
  'ڪ': 'ڪ',
  'گ': 'گ',
  'ڰ': 'گ',
  'ڱ': 'ڱ',
  'ڲ': 'گ',
  'ڳ': 'ڳ',
  'ڴ': 'گ',
  'ڵ': 'ل',
  'ڶ': 'ل',
  'ڷ': 'ل',
  'ڸ': 'ل',
  'ڹ': 'ن',
  'ں': 'ں',
  'ڻ': 'ڻ',
  'ڼ': 'ن',
  'ڽ': 'ن',
  'ھ': 'ه',
  'ڿ': 'چ',
  'ۀ': 'ه',
  'ہ': 'ہ',
  'ۂ': 'ءہ',
  'ۃ': 'ہ',
  'ۄ': 'و',
  'ۅ': 'ۅ',
  'ۆ': 'ۆ',
  'ۇ': 'ۇ',
  'ۈ': 'ۈ',
  'ۉ': 'ۉ',
  'ۊ': 'و',
  'ۋ': 'ۋ',
  'ی': 'ی',
  'ۍ': 'ي',
  'ێ': 'ي',
  'ۏ': 'و',
  'ې': 'ې',
  'ۑ': 'ي',
  'ے': 'ے',
  'ۓ': 'ے',
  'ە': 'ە',
  'ۺ': 'ش',
  'ۻ': 'ض',
  'ۼ': 'ۼ',
  'ۿ': 'ه'
}

function removeDiacritics(text) {
  const symbols = [...text]
  const result = []
  for (const symbol of symbols) {
    if (map[symbol]) {
      result.push(symbol)
    }
  }
  return result.join('')
}

仍然可以考虑某些字母具有变音符号，例如 带“jeh”，看起来像 瑞“reh”。但由于它在阿拉伯语中被赋予了一个不同的基本名称，所以我不让它被剥夺“额外标记”而变成“reh”。在少数情况下会发生这种情况，例如 带“feh”和 带“feh 下面的点”，但是 带 和 带 被赋予了基本名称，但例如 带 则没有。不确定解决这些问题的最佳方法。我不知道什么是变音符号、什么不是 100% 的确切定义，但这应该是一个好的开始。

此外，“hamza + 字母”连字被分别转换为 hamza 和字母。

如果您知道如何改进此问题，请发表评论并添加修复程序（如果您愿意）。

Here is another approach based on the Arabic Unicode block:

const map = {
  'آ': 'ا',
  'أ': 'ا',
  'إ': 'ا',
  'ا': 'ا',
  'ٱ': 'ا',
  'ٲ': 'ا',
  'ٳ': 'ا',
  'ؤ': 'و',
  'ئ': 'ى',
  'ؽ': 'ؽ',
  'ؾ': 'ؾ',
  'ؿ': 'ؿ',
  'ي': 'ى',
  'ب': 'ب',
  'ت': 'ت',
  'ؠ': 'ؠ',
  'ة': 'ه',
  'ث': 'ث',
  'ج': 'ج',
  'ح': 'ح',
  'خ': 'خ',
  'د': 'د',
  'ذ': 'ذ',
  'ر': 'ر',
  'ز': 'ز',
  'س': 'س',
  'ش': 'ش',
  'ص': 'ص',
  'ض': 'ض',
  'ط': 'ط',
  'ظ': 'ظ',
  'ع': 'ع',
  'غ': 'غ',
  'ػ': 'ک',
  'ؼ': 'ک',
  'ف': 'ف',
  'ق': 'ق',
  'ك': 'ك',
  'ګ': 'ك',
  'ڬ': 'ك',
  'ڭ': 'ڭ',
  'ڮ': 'ك',
  'ل': 'ل',
  'م': 'م',
  'ن': 'ن',
  'ه': 'ه',
  'و': 'و',
  'ى': 'ى',
  'ٸ': 'ى',
  'ٵ': 'ءا', // hamza alef?
  'ٶ': 'ءو', // hamza waw?
  'ٹ': 'ٹ',
  'ٺ': 'ٺ',
  'ٻ': 'ٻ',
  'ټ': 'ت',
  'ٽ': 'ت',
  'پ': 'پ',
  'ٿ': 'ٿ',
  'ڀ': 'ڀ',
  'ځ': 'ءح',
  'ڂ': 'ح',
  'ڃ': 'ڃ',
  'ڄ': 'ڄ',
  'څ': 'ح',
  'چ': 'چ',
  'ڇ': 'ڇ',
  'ڈ': 'ڈ',
  'ډ': 'د',
  'ڊ': 'د',
  'ڋ': 'د',
  'ڌ': 'ڌ',
  'ڍ': 'ڍ',
  'ڎ': 'ڎ',
  'ڏ': 'د',
  'ڐ': 'د',
  'ڑ': 'ڑ',
  'ڒ': 'ر',
  'ړ': 'ر',
  'ڔ': 'ر',
  'ڕ': 'ر',
  'ږ': 'ر',
  'ڗ': 'ر',
  'ژ': 'ژ',
  'ڙ': 'ڙ',
  'ښ': 'س',
  'ڛ': 'س',
  'ڜ': 'س',
  'ڝ': 'ص',
  'ڞ': 'ص',
  'ڟ': 'ط',
  'ڠ': 'ع',
  'ڡ': 'ڡ',
  'ڢ': 'ڡ',
  'ڣ': 'ڡ',
  'ڤ': 'ڤ',
  'ڥ': 'ڡ',
  'ڦ': 'ڦ',
  'ڧ': 'ق',
  'ڨ': 'ق',
  'ک': 'ک',
  'ڪ': 'ڪ',
  'گ': 'گ',
  'ڰ': 'گ',
  'ڱ': 'ڱ',
  'ڲ': 'گ',
  'ڳ': 'ڳ',
  'ڴ': 'گ',
  'ڵ': 'ل',
  'ڶ': 'ل',
  'ڷ': 'ل',
  'ڸ': 'ل',
  'ڹ': 'ن',
  'ں': 'ں',
  'ڻ': 'ڻ',
  'ڼ': 'ن',
  'ڽ': 'ن',
  'ھ': 'ه',
  'ڿ': 'چ',
  'ۀ': 'ه',
  'ہ': 'ہ',
  'ۂ': 'ءہ',
  'ۃ': 'ہ',
  'ۄ': 'و',
  'ۅ': 'ۅ',
  'ۆ': 'ۆ',
  'ۇ': 'ۇ',
  'ۈ': 'ۈ',
  'ۉ': 'ۉ',
  'ۊ': 'و',
  'ۋ': 'ۋ',
  'ی': 'ی',
  'ۍ': 'ي',
  'ێ': 'ي',
  'ۏ': 'و',
  'ې': 'ې',
  'ۑ': 'ي',
  'ے': 'ے',
  'ۓ': 'ے',
  'ە': 'ە',
  'ۺ': 'ش',
  'ۻ': 'ض',
  'ۼ': 'ۼ',
  'ۿ': 'ه'
}

function removeDiacritics(text) {
  const symbols = [...text]
  const result = []
  for (const symbol of symbols) {
    if (map[symbol]) {
      result.push(symbol)
    }
  }
  return result.join('')
}

Some letters could still be considered to have diacritics such as ژ "jeh" which looks like ر "reh". But since it is given a different fundamental name in Arabic, I made it not get stripped of its "extra markings" to become "reh". That happened in a few cases, such as with ڡ "feh" and ڢ "dot below feh", but ڤ and ڦ were given fundamental names, but not ڥ for example. Not sure the best way to approach those. I don't know the exact definition of what is a diacritic and what is not to a 100% degree, but this should be a good start.

Also, the "hamza + letter" ligatures were converted into hamza and the letter separately.

If you know how to improve this, please comment and add a fix if you'd like.

回复收藏 0 原文

~没有更多了~