有没有办法将文本从 Unicode 简化为 ASCII？

发布于 2024-11-01 18:49:56 字数 371 浏览 7 评论 0原文

我需要的是，对于每个 ASCII 字符，都有一个等效的 Unicode 字符列表。

问题在于，当人们在文档中键入内容时，Microsoft Excel 和 Word 等程序会插入非 ASCII 双引号、单引号、破折号等。我想将此文本存储在“varchar”类型的数据库字段中，该字段需要单字节字符。

为了存储 ASCII（单字节）文本，其中一些 Unicode 字符可以被认为与特定 ASCII 字符等效或足够相似，因此用等效的 ASCII 字符替换 Unicode 字符就可以了。

我想要一个像 MapToASCII 这样的简单函数，它将 Unicode 文本转换为等效的 ASCII 文本，从而允许我为与任何 ASCII 字符都不相似的任何 Unicode 字符指定替换字符。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

才能让你更想念 2024-11-08 18:49:57

Win32 API WideCharToMultiByte 可用于此转换（Unicode 到 ANSI）。使用 CP_ACP 作为第一个参数。类似的东西可能比尝试构建自己的映射函数更好。

编辑冒着听起来像是我试图将其推广为违背OP意愿的解决方案的风险，似乎值得指出的是，这个API做了很多（全部？）正在做的事情要求。目标是将（我认为）Unicode 字符串尽可能多地映射到“ANSI”（在这种情况下，ANSI 可能是一个移动目标）。另一个要求是能够为无法映射的字符指定一些替代字符。以下示例执行此操作。它将 Unicode 字符串“转换”为 char，并对那些无法转换的字符使用下划线（倒数第二个参数）。

ret = WideCharToMultiByte( CP_ACP, 0, L"abc個חあЖdef", -1, 
                           ac, sizeof( ac ), "_", NULL );
for ( i = 0; i < strlen( ac ); i++ )
  printf( "%c %02x\n", ac[i], ac[i] );

The Win32 API WideCharToMultiByte can be used for this conversion (Unicode to ANSI). Use CP_ACP as the first parameter. Something like that would likely be better than trying to build your own mapping function.

Edit At the risk of sounding like I am trying to promote this as a solution against the OP's wishes, it seems that it may be worth pointing out that this API does much (all?) of what is being asking for. The goal is to map (I think) a Unicode string as much as possible to "ANSI" (where ANSI may be something of a moving target in this case). An additional requirement is to be able to specify some alternative character for those that cannot be mapped. The following example does this. It "converts" a Unicode string to char and uses an underscore (second to last parameter) for those characters that cannot be converted.

ret = WideCharToMultiByte( CP_ACP, 0, L"abc個חあЖdef", -1, 
                           ac, sizeof( ac ), "_", NULL );
for ( i = 0; i < strlen( ac ); i++ )
  printf( "%c %02x\n", ac[i], ac[i] );

回复收藏 0 原文

绿萝 2024-11-08 18:49:57

一个高度相关的问题在这里：Replacing unicode punctuation with ASCII approximations

虽然答案还有不足之处，给了我一个想法。我可以将基本多语言平面 (0) 中的每个 Unicode 代码点映射到等效的 ASCII 字符（如果存在）。以下 C# 代码将帮助您创建一个 HTML 表单，您可以在其中为每个值键入替换字符。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Globalization;
using System.IO;

namespace UnicodeCharacterCategorizer
{
    class Program
    {
        static void Main(string[] args)
        {
            string output_filename = "output.htm"; //set a filename if not specifying one through the command line
            Dictionary<UnicodeCategory,List<char>> category_character_sets = new Dictionary<UnicodeCategory,List<char>>();
            foreach (UnicodeCategory c in Enum.GetValues(typeof(UnicodeCategory)))
                category_character_sets.Add( c, new List<char>() );
            for (int i = 0; i <= 0xFFFF; i++)
            {
                if (i >= 0xD800 && i <= 0xDFFF) continue; //Skip ranges reserved for high/low surrogate pairs.
                char c = (char)i;
                UnicodeCategory category = char.GetUnicodeCategory( c );
                category_character_sets[category].Add( c );
            }
            StringBuilder file_data = new StringBuilder( @"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""><html xmlns=""http://www.w3.org/1999/xhtml""><head><title>Unicode Category Character Sets</title><style>.categoryblock{border:3px solid black;margin-bottom:10px;padding:5px;} .characterblock{display:inline-block;border:1px solid grey;padding:5px;margin-right:5px;} .character{display:inline-block;font-weight:bold;background-color:#ffeeee} .numericvalue{color:blue;}</style></head><body><form id=""charactermap"">" );
            foreach (KeyValuePair<UnicodeCategory,List<char>> entry in category_character_sets)
            {
                file_data.Append( @"<div class=""categoryblock""><h1>" + entry.Key.ToString() + ":</h1><br />" );
                foreach (char c in entry.Value)
                {
                    string hex_value = ((int)c).ToString( "x" );
                    file_data.Append( @"<div class=""characterblock""><span class=""character"">&#x" + hex_value + @";<br /><span class=""numericvalue"">" + hex_value + @"</span><br /><input type=""text"" name=""r_" + hex_value + @""" /></div>" );
                }
                file_data.Append( "</div>" );
            }
            file_data.Append("</form></body></html>" );
            File.WriteAllText( output_filename, file_data.ToString(), Encoding.Unicode );
        }
    }
}

具体来说，该代码将生成一个 HTML 表单，其中包含 BMP 中的所有字符，以及以前缀为“r_”（r 表示“替换值”）的十六进制值命名的输入文本框。如果将其移植到 ASP.NET 页面，则可以编写额外的代码来尽可能预填充替换值：

如果已经是 ASCII，则使用它们自己的值，或者
使用 Unicode 规范化的 FormD 或 FormKD 分解的等效项，或者
单个 ASCII 值对于整个类别（即所有带有 ASCII 双引号的“标点符号首字母”字符），

您可以手动进行检查并进行调整，并且可能不会像您想象的那样花费很长时间。只有 64512 个代码点，整个类别的大部分内容可能会被认为“根本不接近任何 ASCII”而被忽略。所以，我将构建这个地图和函数。

A highly relevant question is here: Replacing unicode punctuation with ASCII approximations

Although the answer there is insufficient, it gave me an idea. I could map each of the Unicode code points in the Basic Multilingual Plane (0) to an equivalent ASCII character, if one exists. The following C# code will help by creating an HTML form in which you can type a replacement character for each value.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Globalization;
using System.IO;

namespace UnicodeCharacterCategorizer
{
    class Program
    {
        static void Main(string[] args)
        {
            string output_filename = "output.htm"; //set a filename if not specifying one through the command line
            Dictionary<UnicodeCategory,List<char>> category_character_sets = new Dictionary<UnicodeCategory,List<char>>();
            foreach (UnicodeCategory c in Enum.GetValues(typeof(UnicodeCategory)))
                category_character_sets.Add( c, new List<char>() );
            for (int i = 0; i <= 0xFFFF; i++)
            {
                if (i >= 0xD800 && i <= 0xDFFF) continue; //Skip ranges reserved for high/low surrogate pairs.
                char c = (char)i;
                UnicodeCategory category = char.GetUnicodeCategory( c );
                category_character_sets[category].Add( c );
            }
            StringBuilder file_data = new StringBuilder( @"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""><html xmlns=""http://www.w3.org/1999/xhtml""><head><title>Unicode Category Character Sets</title><style>.categoryblock{border:3px solid black;margin-bottom:10px;padding:5px;} .characterblock{display:inline-block;border:1px solid grey;padding:5px;margin-right:5px;} .character{display:inline-block;font-weight:bold;background-color:#ffeeee} .numericvalue{color:blue;}</style></head><body><form id=""charactermap"">" );
            foreach (KeyValuePair<UnicodeCategory,List<char>> entry in category_character_sets)
            {
                file_data.Append( @"<div class=""categoryblock""><h1>" + entry.Key.ToString() + ":</h1><br />" );
                foreach (char c in entry.Value)
                {
                    string hex_value = ((int)c).ToString( "x" );
                    file_data.Append( @"<div class=""characterblock""><span class=""character"">&#x" + hex_value + @";<br /><span class=""numericvalue"">" + hex_value + @"</span><br /><input type=""text"" name=""r_" + hex_value + @""" /></div>" );
                }
                file_data.Append( "</div>" );
            }
            file_data.Append("</form></body></html>" );
            File.WriteAllText( output_filename, file_data.ToString(), Encoding.Unicode );
        }
    }
}

Specifically, that code will generate an HTML form containing all characters in the BMP, along with input text boxes named after the hex values prefixed with "r_" (r is for "replacement value"). If this ported over to an ASP.NET page, additional code could be written to pre-populate replacement values as much as possible:

with their own value if already ASCII, or
with Unicode normalized FormD or FormKD decomposed equivalents, or
a single ASCII value for an entire category (i.e. all "punctuation initial" characters with a ASCII double quote)

You could then go through manually and make adjustments, and it probably wouldn't take as long as you'd think. There are only 64512 code points, and large chunks of entire categories can probably be dismissed as "no even close to anything ASCII". So, I'm going to build this map and function.

回复收藏 0 原文

~没有更多了~