如何将发布的“英语”转换为“英语”? ASP.NET 中来自国际 PC 的字符? (例如2205)
我有一个 WebForm 搜索页面,偶尔会受到国际访问者的点击。当它们输入文本时,它似乎是纯 ASCII az,0-9,但它们以粗体打印,我的“是这个文本”逻辑无法处理输入。 ASP.NET 中是否有任何简单的方法可以将相当于 AZ、0-9 的 Unicode 字符转换为纯旧文本?
I have a WebForm search page that gets occasional hits from international visitors. When they enter in text, it appears to be plain ASCII a-z, 0-9 but they are printed in bold and my "is this text" logic can't handle the input. Is there any easy way in ASP.NET to convert Unicode characters that equate to A-Z, 0-9 into plain old text?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您将得到所谓的“全角形式”字符。在 Unicode 中,它们在代码点 U+FF01 到 U+FF5E 处进行编码。要从它们获取 ASCII 代码点(U+0021 到 U+007E),您必须获取它们的代码点并从中减去 (0xFF01 - 0x0021)。
ASCII:http://unicode.org/charts/PDF/U0000.pdf< br>
全角表单:http://unicode.org/charts/PDF/UFF00.pdf
我不会说 ASP.NET,但在 Java 中,代码将如下所示:
You are getting so-called "Fullwidth Forms" of the characters. In Unicode, these are encoded at codepoints U+FF01 to U+FF5E. To get the ASCII codepoint (U+0021 to U+007E) from them, you have to get their codepoint and subtract (0xFF01 - 0x0021) from it.
ASCII: http://unicode.org/charts/PDF/U0000.pdf
Fullwidth Forms: http://unicode.org/charts/PDF/UFF00.pdf
I don't speak ASP.NET, but in Java the code would look like this:
这可能是 Unicode“数学粗体”字符
This could be the Unicode "mathematical bold" characters ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????. But more likely it's the "fullwidth" characters abcdefghijklmnopqrstuvwxyz0123456789. (These are common in East Asian character encodings: "Fullwidth" refers to being the same width as a Hanzi/Kanji character.)
To convert either set to ASCII, use the Unicode normalization form KC or KD.
您应该查看此问题的答案。
它包括以下方法(来自Michael Kaplan的博客条目“剥离是一项有趣的工作< /a>"):
这将从字符串中去除所有 NonSpacingMark 字符。这意味着它将把
é
转换为e
,因为é
实际上是由e
和´ 构建的
字符。´
是一个“NonSpacingMark”,这意味着它将被添加到前一个字符中。该方法尝试检测此特殊字符,并重建没有 NonSpacingMark 字符的字符串。 (这是我的理解,可能不正确)。这不适用于所有 unicode 字符,但使用拉丁字符集(英语、西班牙语、法语、德语等)的用户输入将被“清理”。我对亚洲字符集没有经验。
经过反馈后,
我根据从该问题的评论和答案中获得的信息调整了例程。我当前的版本是:
此路由将删除变音符号(尽可能多),并将其他“奇怪”字符转换为“正常”形式。
You should look at the answer from this question.
It includes the following method (from Michael Kaplan's blog entry "Stripping is an interesting job"):
This will strip all the NonSpacingMark characters from a string. This means it will convert
é
toe
, becauseé
is actually build from ane
and´
character.The
´
is a "NonSpacingMark", meaning that it will be added to the previous character. The method tries to detect this special characters, and rebuilds a string without NonSpacingMark characters. (This is how I understand it, this might not be true).This will not work for all unicode characters, but an input from users using a latin-based character set (English, Spanish, French, German, etc) will be "cleaned". I have no experience with Asian character sets.
After feedback
I adjusted the routine to the info I got from comments and answers to this question. My current version is:
This routing, will remove diacritics (as much as possible), and will convert the other "strange" characters into their "normal" form.
您可以尝试这样的操作:
尽管如此,我不确定输入有什么问题。你到底在用文字做什么?如果它不仅仅包含 ascii 字符,这有关系吗?而且,我特别不知道你所说的“它们以粗体打印”是什么意思。
You might try something like this:
Although, I'm not quire sure what the problem is with the input. What exactly are you doing with the text? Does it matter if it contains more than just ascii characters? And, I especially don't know what you mean by "they are printed in bold".