HTML 编码不在字符集中的字符
我们有一个使用 ISO-8859-1 字符集的 Web 应用程序。有时,用户会使用包含 Š 等字符的“奇怪”名称(为了方便起见,此处进行了 html 编码)。 我们将其存储在数据库中,但是我们无法正确显示它。
处理这个问题的最佳方法是什么?我想我应该使用其 HTML 实体编号编码( Š 到 Š
)自动转换字符集之外的字符,
但我在找出如何自动执行此操作时遇到问题(无需使用所有值的表)。
此代码适用于扩展 ASCII 字符,例如“å”(存在于 ISO-8859-1 中)。我想对其他角色做同样的事情。这些 HTML 实体编码值中是否有我可以使用的模式?
unsigned int c;
for( int i=0; i < html.GetLength(); i++)
{
c = html[i];
if( c > 255 || c < 0 )
{
CString orig = CString(html[i]);
CString encoded = "&#";
encoded += CTool::String((byte)c);
encoded += ";";
html.Replace(orig, encoded);
}
}
We have a web app which uses the ISO-8859-1 character set. Occationaly users have 'strange' names which contain characters like Š (html encoded here for your convenience). We store this in our database, but we can't display it correctly.
What is the best way of dealing with this? I'm thinking I should automatically convert characters outside the character set with its HTML Entity number encoding ( Š to Š
)
But I'm having problems finding out how to do this automatically (without using a table of all values).
This code works for extended ASCII characters like 'å' (that are present in ISO-8859-1). I would like to do the same with other characters. Is there a pattern in these HTML entity encoding values I can use?
unsigned int c;
for( int i=0; i < html.GetLength(); i++)
{
c = html[i];
if( c > 255 || c < 0 )
{
CString orig = CString(html[i]);
CString encoded = "";
encoded += CTool::String((byte)c);
encoded += ";";
html.Replace(orig, encoded);
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
网页应指示浏览器以 UTF-8 格式显示响应。这通常是通过在
Content-Type
响应标头中提供字符集(例如text/html;charset=UTF-8
)来实现的。HTML/XML 实体仅存在于此,以便您能够以 UTF-8 以外的编码保存网页源代码。
The webpage should instruct the browser to display the response in UTF-8. This usually happens by supplying the charset in the
Content-Type
response header liketext/html;charset=UTF-8
.The HTML/XML entities are solely there so that you will be able to save the webpage source in an encoding other than UTF-8.
html 似乎是一个“Unicode”CString。这意味着它是 UTF-16 编码的。 “&#ddd”语法使用 Unicode 代码点编号。通常,这非常简单。
Š
是 U+0160,这意味着它在 UTF-16 中是 0x0160。当然,十进制是 352,所以你得到Š
。只有当您遇到基本多语言平面 (BMP)(超过 U+FFFF)之外的字符时,才会出现问题。它不再适合 16 位,因此将在
html
字符串中占用两个字符。然而,它应该只产生一个&#ddddd
值。这种情况非常罕见,以至于您常常可以忽略它。html appears to be a "Unicode" CString. That means it's UTF-16 encoded. The "&#ddd" syntax uses the Unicode code point number. Usually, this is quite simple.
Š
is U+0160, which means it's 0x0160 in UTF-16. Tha's of course 352 decimal, so you getŠ
.You only have a problem when you encounter a character outside the Basic Multilingual Plane (BMP), which is past U+FFFF. This no longer fits in 16 bits, and will therefore take TWO characters in your
html
string. Yet, it should produce only one&#ddddd
value. This is so rare that you often can ignore it.