UTF8 XML 文档中 ASCII 字符串在字节数组中的编码
我有以下一些要求:
...文档必须采用UTF-8编码...姓氏字段仅允许(扩展) ASCII ... 城市仅允许ISOLatin1 ...消息必须作为 IBytesMessage 放入 (IBM Websphere) MessageQueue
为简单起见,XML 文档如下所示:
<?xml version="1.0" encoding="utf-8"?>
<foo>
<lastname>John ÐØë</lastname>
<city>John ÐØë</city>
<other>UTF-8 string</other>
</foo>
“ÐØë”部分是(或应该是)ASCII 值 分别为 208、216、235。
我也有一个对象:
public class foo {
public string lastname { get; set; }
}
所以我实例化一个对象并设置姓氏:
var x = new foo() { lastname = "John ÐØë", city = "John ÐØë" };
现在这就是我头痛的地方(或 inception 如果你愿意...):
- Visual studio /源代码采用Unicode格式,
- 因此:对象有一个Unicode姓氏
- XML序列化器使用UTF-8 对文档
- 姓氏进行编码应仅包含(扩展)ASCII 字符;这些字符是有效的 ASCII 字符,但当然是 UTF-8 编码形式,
我的编码通常不会遇到任何问题;我熟悉 每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低限度(没有任何借口) !) 但这把我难住了...
我知道 UTF-8 文档将完全能够“包含”这两种编码,因为代码点“重叠”。但当我需要将序列化消息转换为字节数组时,我会迷失方向。在进行转储时,我看到 C3 XX C3 XX C3 XX
(我手头没有实际的转储)。很明显(或者我已经关注这个问题太久了),姓氏/城市字符串以 unicode 形式放入序列化文档中;字节数组表明了这一点。
现在我必须做什么,在哪里,以确保 Lastname 字符串进入 XML 文档,最后将字节数组作为 ASCII 字符串(以及实际的 208、216、235 字节序列) ,并且该城市将其列为 ISOLatin1?
我知道要求是向后的,但我无法改变这些(第三方)。我总是在内部项目中使用 UTF-8,因此我必须支持 unicode-utf8=>ASCII/ISOLatin1 转换(当然,仅适用于这些集合中的字符)。
我的头很痛...
I have some the folowing requirements:
...The document must be encoded in UTF-8... The Lastname field only allows (Extended) ASCII ... City only allows ISOLatin1
...The message must be put on the (IBM Websphere) MessageQueue as a IBytesMessage
The XML document, for simplicities sake, looks like this:
<?xml version="1.0" encoding="utf-8"?>
<foo>
<lastname>John ÐØë</lastname>
<city>John ÐØë</city>
<other>UTF-8 string</other>
</foo>
The "ÐØë" part are (or should be) ASCII values 208, 216, 235 respectively.
I also have an object:
public class foo {
public string lastname { get; set; }
}
So I instantiate an object and set the lastname:
var x = new foo() { lastname = "John ÐØë", city = "John ÐØë" };
Now this is where my headache sets in (or the inception if you will...):
- Visual studio / source code is in Unicode
- Hence: Object has an Unicode lastname
- The XML Serializer uses UTF-8 to encode the document
- Lastname should contain only (Extended) ASCII characters; the characters are valid ASCII chars but ofcourse in UTF-8 encoded form
I normally don't experience any trouble with my encodings; I am familiar with The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) but this one's got me stumped...
I understand that the UTF-8 document will be perfectly able to "contain" both encodings because the codepoints 'overlap'. But where I get lost is when I need to convert the serialized message to a byte-array. When doing a dump I see C3 XX C3 XX C3 XX
(I don't have the actual dump at hand). It's clear (or I've been staring at this for too long) that the lastname / city strings are put in the serialized document in their unicode form; the byte-array suggests so.
Now what will I have to do, and where, to ensure the Lastname string goes into the XML document and finally the byte-array as an ASCII string (and the actual 208, 216, 235 byte sequence), and that City makes it in there as ISOLatin1?
I know the requirements are backwards, but I can't change those (3rd party). I always use UTF-8 for our internal projects so I have to support the unicode-utf8=>ASCII/ISOLatin1 conversion (ofcourse, only for chars that are in those sets).
My head hurts...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
不必关心 XML 文档如何编码以进行传输。做你想做的事情的正确方法是使用 XML 字符引用 来表示需要如此保留的字符。例如,您
使用 XML 字符引用 作为
接收方 [符合] XML 处理器将/应该/必须将这些数字字符引用转换回它们表示的字符。下面是一些可以实现这一目的的代码:
您还可以使用正则表达式进行替换,或者编写可以完成相同操作的 XSLT。但这项任务是如此微不足道,并不真正值得采取这种方法。上面的代码可能更快,内存占用更少,而且......更容易理解。
但您应该注意,由于您希望在同一文档中保留两种不同的编码,因此您的转换例程需要区分从“扩展 ASCII”到 XML 字符引用的转换和从“ISO Latin 1”到 XML 的转换字符参考。
在这两种情况下,字符引用指定 ISO/IEC 10646 字符集中的代码点 — 本质上是 unicode。您需要将字符映射到适当的代码点。由于 CLR 世界中的字符串是 UTF-16 编码的,因此这不是什么大问题。我相信上面的代码应该可以正常工作,除非你得到了一些非常奇怪的东西,并且与 UTF-16 不能很好地配合。
Never mind how the XML document is encoded for transmission. The right way to do what you want to do—encode certain non-ASCII characters so they survive the trip unscathed—is to use XML character references to represent the characters that need to be so preserved. For instance, your
is represented using XML character references as
The receiving [conformant] XML processor will/should/must convert those numeric character references back to the characters they represent. Here's some code that will do the trick:
You could also use a regular expression to make the replacement or write an XSLT that would do the same thing. But the task is so trivial, it doesn't really warrant that sort of approach. The above code is probably faster and less memory intensive and...it's easier to understand.
You should note though that since you want to preserve two different encodings in the same document, your conversion routine will need to differentiate between the conversion from "extended ASCII" to an XML character reference and the conversion from "ISO Latin 1" to an XML character reference.
In both cases, the character reference specifies a codepoint in the ISO/IEC 10646 character set — essentially unicode. You'll want to map the characters to the appropriate code point. Since string in the CLR world are UTF-16 encoded, that shouldn't be much of an issue. The above code should work fine, I believe, unless you've get something really oddball that doesn't play very nicely with UTF-16.
所以..
System.Text.Encoding.ASCII.GetBytes(string)
可能会做你想要的.. 将字符串转换为ascii编码的字节数组。So..
System.Text.Encoding.ASCII.GetBytes(string)
will probably do what you want.. convert a string into an ascii-encoded byte array.UTF-8 编码的字符串/字节数组中不可能有 208、216、235 字节序列。
我希望您可以将 XML 保存为 ISO 8859-1,无论是否在 XML
处理指令中提及(甚至可能指定无效的 UTF) XML 标头中的 -8 编码)。
否则,如果您的要求如您所述 - 只需要求给定输入的确切预期字节数组并制作您自己的自定义序列化(或者可能是自定义编码,也不确定是否可能)。
You simply can't have 208, 216, 235 byte sequence in UTF-8 encoded string/byte array.
I hope you can save XML as ISO 8859-1 with or without mentioning it in XML
<?xml version="1.0" encoding="XXXXXXXXXX"?>
processing instruction (maybe even specifying invalid UTF-8 encoding in XML header).Otherwise if your requirements are as you stated - just ask for exact expected byte array for given input and craft your own custom serialization (or maybe custom encoding, also not sure if it is possible).
如果这是精确的规范,那么我认为您可能会误解它。您的任务不是编码之一,而是验证/后备之一。 整个文档 - 包括
Lastname
和City
字段 - 必须编码为UTF-8。很简单,如果 XML 文档将其编码声明为 UTF-8,然后包含在该编码下无效的字节值,则该文档将无效。方便的是,ASCII 与 Unicode 的前 128 个代码点重叠; Latin1 与前 256 个重叠。
要检查
Lastname
是否可以表示为 ASCII,您可以检查其所有字符的代码点是否在 0–127 范围内。为了符合您的规范,您必须通过将字符串编码为 ASCII,然后将其解码回来,强制无效字符回退到替换字符(通常是
?
):对于
City 也是如此
:随后,您应该将所有内容保存为 UTF-8。
我的假设是您的第三方软件可以使用 UTF-8 正确解码 XML 文档;但是,它必须提取
Lastname
和City
字段,并在仅允许使用 ASCII 和 Latin1 的地方使用它们。它对您施加限制,以确保不会被迫导致数据丢失(因为存在不允许的字符)。编辑:这是您建议的解决方法。我使用 Latin1 代替“扩展 ASCII”,因为后一个术语不明确。
SecurityElement.Escape
将字符串中的无效 XML 字符替换为其有效的 XML 等效字符(例如,将<
替换为<
和&
到&
)。If that is the precise specification, then I think you might be misunderstanding it. Your task is not one of encoding, but one of validation/fallback. The entire document – including the
Lastname
andCity
fields – must be encoded as UTF-8. Quite simply, the XML document would be invalid if it declares its encoding as UTF-8 and then contains byte values that are not valid under that encoding.Conveniently, ASCII overlaps with the first 128 codepoints of Unicode; Latin1 overlaps with the first 256.
To check whether
Lastname
can be represented as ASCII, then you could check that all its characters have codepoints within the 0–127 range.To conform with your specification, you would have to force invalid characters to fall back to the replacement character (typically
?
) by encoding the string as ASCII, and then decoding it back:Similarly for
City
:Subsequently, you should just save everything as UTF-8.
My assumption is that your third-party software can correctly decode the XML document using UTF-8; however, it must then extract the
Lastname
andCity
fields, and use them somewhere where only ASCII and Latin1 are allowed. It imposes the restrictions on you in order to ensure that it would not be forced to incur data loss (because of the presence of disallowed characters).Edit: This is the workaround that you’re proposing. I’m using Latin1 in the place of “Extended ASCII” because the latter term is ambiguous.
SecurityElement.Escape
replaces invalid XML characters in a string with their valid XML equivalent (e.g.<
to<
and&
to&
).我将其理解为两个独立的要求:
1)XML 必须采用 UTF-8 编码;
2) 城市名称仅限于ISOLatin1。
这意味着当您将 UTF-8 解码为 Uncode 时,城市字符仅来自 ISOLatin1 集。换句话说,XML 可以采用 ISOLatin1 编码(所有文本均来自 ISOLatin1 代码表),但它是 UTF-8。 ISOLatin1 是 Unicode 表的一小部分,UTF-8 是 Unicode 的 8 位编码。
I understand this as 2 separate requirements:
1) The XML must be UTF-8 encoded;
2) The City name is limited to ISOLatin1.
This means that when you decode UTF-8 to Uncode, the City characters are only from ISOLatin1 set. In other words, the XML can be ISOLatin1 encoded (all text is from ISOLatin1 code table) but it is UTF-8. ISOLatin1 is small part of Unicode table and UTF-8 is 8-bit encoding of Unicode.
尼古拉斯·凯里接受的答案是好的,但它有错误并且代码不起作用。我没有足够的声誉来发表评论,所以我将在这里编写工作代码:
Accepted answer from Nicholas Carey is OK, but it has errors and code doesn't work. I don't have enough reputation to comment so I will write working code here: