如何将发布的“英语”转换为“英语”? ASP.NET 中来自国际 PC 的字符? (例如2205)

发布于 2024-09-08 21:26:57 字数 148 浏览 10 评论 0原文

我有一个 WebForm 搜索页面,偶尔会受到国际访问者的点击。当它们输入文本时,它似乎是纯 ASCII az,0-9,但它们以粗体打印,我的“是这个文本”逻辑无法处理输入。 ASP.NET 中是否有任何简单的方法可以将相当于 AZ、0-9 的 Unicode 字符转换为纯旧文本?

I have a WebForm search page that gets occasional hits from international visitors. When they enter in text, it appears to be plain ASCII a-z, 0-9 but they are printed in bold and my "is this text" logic can't handle the input. Is there any easy way in ASP.NET to convert Unicode characters that equate to A-Z, 0-9 into plain old text?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

喵星人汪星人 2024-09-15 21:26:57

您将得到所谓的“全角形式”字符。在 Unicode 中,它们在代码点 U+FF01 到 U+FF5E 处进行编码。要从它们获取 ASCII 代码点(U+0021 到 U+007E),您必须获取它们的代码点并从中减去 (0xFF01 - 0x0021)。

ASCII:http://unicode.org/charts/PDF/U0000.pdf< br>
全角表单:http://unicode.org/charts/PDF/UFF00.pdf

我不会说 ASP.NET,但在 Java 中,代码将如下所示:

String decodeFullwidth(String s) {
  StringBuilder sb = new StringBuilder();
  for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    if (0xFF01 <= c && c <= 0xFF5E) {
      sb.append((char) (c - (0xFF01 - 0x0021)));
    } else {
      sb.append(c);
    }
  }
  return sb.toString();
}

You are getting so-called "Fullwidth Forms" of the characters. In Unicode, these are encoded at codepoints U+FF01 to U+FF5E. To get the ASCII codepoint (U+0021 to U+007E) from them, you have to get their codepoint and subtract (0xFF01 - 0x0021) from it.

ASCII: http://unicode.org/charts/PDF/U0000.pdf
Fullwidth Forms: http://unicode.org/charts/PDF/UFF00.pdf

I don't speak ASP.NET, but in Java the code would look like this:

String decodeFullwidth(String s) {
  StringBuilder sb = new StringBuilder();
  for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    if (0xFF01 <= c && c <= 0xFF5E) {
      sb.append((char) (c - (0xFF01 - 0x0021)));
    } else {
      sb.append(c);
    }
  }
  return sb.toString();
}
雨后咖啡店 2024-09-15 21:26:57

它似乎是纯 ASCII az,0-9
但它们以粗体打印

这可能是 Unicode“数学粗体”字符

it appears to be plain ASCII a-z, 0-9
but they are printed in bold

This could be the Unicode "mathematical bold" characters ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????. But more likely it's the "fullwidth" characters abcdefghijklmnopqrstuvwxyz0123456789. (These are common in East Asian character encodings: "Fullwidth" refers to being the same width as a Hanzi/Kanji character.)

To convert either set to ASCII, use the Unicode normalization form KC or KD.

原来是傀儡 2024-09-15 21:26:57

您应该查看此问题的答案。

它包括以下方法(来自Michael Kaplan的博客条目“剥离是一项有趣的工作< /a>"):

static string RemoveDiacritics(string stIn) {
  string stFormD = stIn.Normalize(NormalizationForm.FormD);
  StringBuilder sb = new StringBuilder();

  for(int ich = 0; ich < stFormD.Length; ich++) {
    UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
    if(uc != UnicodeCategory.NonSpacingMark) {
      sb.Append(stFormD[ich]);
    }
  }

  return(sb.ToString().Normalize(NormalizationForm.FormC));
}

这将从字符串中去除所有 NonSpacingMark 字符。这意味着它将把 é 转换为 e,因为 é 实际上是由 e´ 构建的 字符。
´ 是一个“NonSpacingMark”,这意味着它将被添加到前一个字符中。该方法尝试检测此特殊字符,并重建没有 NonSpacingMark 字符的字符串。 (这是我的理解,可能不正确)。

这不适用于所有 unicode 字符,但使用拉丁字符集(英语、西班牙语、法语、德语等)的用户输入将被“清理”。我对亚洲字符集没有经验。


经过反馈后,

我根据从该问题的评论和答案中获得的信息调整了例程。我当前的版本是:

    public static string RemoveDiacritics(string stIn) {
        string stFormD = stIn.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for (int ich = 0; ich < stFormD.Length; ich++) {
            UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
            switch (uc) {
                case UnicodeCategory.NonSpacingMark:
                    break;
                case UnicodeCategory.DecimalDigitNumber:
                    sb.Append(CharUnicodeInfo.GetDigitValue(stFormD[ich]).ToString());
                    break;
                default:
                    sb.Append(stFormD[ich]);
                    break;
            }
        }

        return (sb
            .ToString()
            .Normalize(NormalizationForm.FormKC));
    }

此路由将删除变音符号(尽可能多),并将其他“奇怪”字符转换为“正常”形式。

You should look at the answer from this question.

It includes the following method (from Michael Kaplan's blog entry "Stripping is an interesting job"):

static string RemoveDiacritics(string stIn) {
  string stFormD = stIn.Normalize(NormalizationForm.FormD);
  StringBuilder sb = new StringBuilder();

  for(int ich = 0; ich < stFormD.Length; ich++) {
    UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
    if(uc != UnicodeCategory.NonSpacingMark) {
      sb.Append(stFormD[ich]);
    }
  }

  return(sb.ToString().Normalize(NormalizationForm.FormC));
}

This will strip all the NonSpacingMark characters from a string. This means it will convert é to e, because é is actually build from an e and ´ character.
The ´ is a "NonSpacingMark", meaning that it will be added to the previous character. The method tries to detect this special characters, and rebuilds a string without NonSpacingMark characters. (This is how I understand it, this might not be true).

This will not work for all unicode characters, but an input from users using a latin-based character set (English, Spanish, French, German, etc) will be "cleaned". I have no experience with Asian character sets.


After feedback

I adjusted the routine to the info I got from comments and answers to this question. My current version is:

    public static string RemoveDiacritics(string stIn) {
        string stFormD = stIn.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for (int ich = 0; ich < stFormD.Length; ich++) {
            UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
            switch (uc) {
                case UnicodeCategory.NonSpacingMark:
                    break;
                case UnicodeCategory.DecimalDigitNumber:
                    sb.Append(CharUnicodeInfo.GetDigitValue(stFormD[ich]).ToString());
                    break;
                default:
                    sb.Append(stFormD[ich]);
                    break;
            }
        }

        return (sb
            .ToString()
            .Normalize(NormalizationForm.FormKC));
    }

This routing, will remove diacritics (as much as possible), and will convert the other "strange" characters into their "normal" form.

不知所踪 2024-09-15 21:26:57

您可以尝试这样的操作:

Encoding.ASCII.GetString(Encoding.Convert(UnicodeEncoding, ASCIIEncoding, Encoding.Unicode.GetBytes(myString)));

尽管如此,我不确定输入有什么问题。你到底在用文字做什么?如果它不仅仅包含 ascii 字符,这有关系吗?而且,我特别不知道你所说的“它们以粗体打印”是什么意思。

You might try something like this:

Encoding.ASCII.GetString(Encoding.Convert(UnicodeEncoding, ASCIIEncoding, Encoding.Unicode.GetBytes(myString)));

Although, I'm not quire sure what the problem is with the input. What exactly are you doing with the text? Does it matter if it contains more than just ascii characters? And, I especially don't know what you mean by "they are printed in bold".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文