我有 UTF-8 - 但仍然得到“1 字节 UTF-8 序列的无效字节 1”

发布于 2024-12-20 12:33:47 字数 1745 浏览 4 评论 0原文

我动态创建一个 XML 字符串（不是从文件中读取）。然后我使用 Cocoon 3 通过 FOP 将其转换为 PDF。 Xerces 在中间的某个地方运行。当我使用硬编码的东西时，一切正常。一旦我将德语变音符号放入数据库并用我得到的数据丰富我的 xml：

Caused by: org.apache.cocoon.pipeline.ProcessingException: Can't parse the XML string.
at org.apache.cocoon.sax.component.XMLGenerator$StringGenerator.execute(XMLGenerator.java:326)
at org.apache.cocoon.sax.component.XMLGenerator.execute(XMLGenerator.java:104)
at org.apache.cocoon.pipeline.AbstractPipeline.invokeStarter(AbstractPipeline.java:146)
at org.apache.cocoon.pipeline.AbstractPipeline.execute(AbstractPipeline.java:76)
at de.grobmeier.tab.webapp.modules.documents.InvoicePipeline.generateInvoice(InvoicePipeline.java:74)
... 87 more

Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:554)

然后我调试了我的应用程序并发现，我的“ä”（来自数据库）的字节值为 196，这是十六进制的 C4。这是我所期望的： http://www.utf8-zeichentabelle.de/

我不知道为什么我的代码失败。

然后我尝试手动添加 BOM，如下所示：

byte[] bom = new byte[3];
bom[0] = (byte) 0xEF;
bom[1] = (byte) 0xBB;
bom[2] = (byte) 0xBF;
String myString = new String(bom) + inputString;

我知道这不太好，但我尝试了 - 当然它失败了。我尝试在前面添加一个 xml 标头：

<?xml version="1.0" encoding="UTF-8"?>

也失败了。然后我把它结合起来。失败的。

毕竟我尝试过类似的操作：

xmlInput = new String(xmlInput.getBytes("UTF8"), "UTF8");

实际上什么也没做，因为它已经是 UTF-8 了。但还是失败了。

那么...知道我做错了什么以及 Xerces 对我的期望吗？

谢谢基督教

原文

I create a XML String on the fly (NOT reading from a file). Then I use Cocoon 3 to transform it via FOP to a PDF. Somewhere in the middle Xerces runs. When I use the hardcoded stuff everything works. As soon as I put a german Umlaut into the database and enrich my xml with that data I get:

Caused by: org.apache.cocoon.pipeline.ProcessingException: Can't parse the XML string.
at org.apache.cocoon.sax.component.XMLGenerator$StringGenerator.execute(XMLGenerator.java:326)
at org.apache.cocoon.sax.component.XMLGenerator.execute(XMLGenerator.java:104)
at org.apache.cocoon.pipeline.AbstractPipeline.invokeStarter(AbstractPipeline.java:146)
at org.apache.cocoon.pipeline.AbstractPipeline.execute(AbstractPipeline.java:76)
at de.grobmeier.tab.webapp.modules.documents.InvoicePipeline.generateInvoice(InvoicePipeline.java:74)
... 87 more

Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:554)

I have then debugged my app and found out, my "Ä" (which comes frome the database) has the byte value of 196, which is C4 in hex. This is what I have expected according to this: http://www.utf8-zeichentabelle.de/

I do not know why my code fails.

I have then tried to add a BOM manually, like that:

byte[] bom = new byte[3];
bom[0] = (byte) 0xEF;
bom[1] = (byte) 0xBB;
bom[2] = (byte) 0xBF;
String myString = new String(bom) + inputString;

I know this is not exactly good, but I tried it - of course it failed. I have tried to add a xml header in front:

<?xml version="1.0" encoding="UTF-8"?>

Which failed too. Then I combined it. Failed.

After all I tried something like that:

xmlInput = new String(xmlInput.getBytes("UTF8"), "UTF8");

Which is doing nothing in fact, because it is already UTF-8. Still it fails.

So... any ideas what I am doing wrong and what Xerces is expecting from me?

Thanks
Christian

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半夏半凉 2024-12-27 12:33:47

如果您的数据库仅包含一个字节（值为 0xC4），那么您没有使用 UTF-8 编码。

字符“LATIN CAPITAL LETTER A WITH DIAERESIS”的代码点值为 U+00C4，但 UTF-8 无法将其编码为单个字节。如果您检查 UTF8-zeichentabelle.de 上的第三列“UTF-8（十六进制）”，您会看到 UTF-8 将其编码为 0xC3 84（两个字节）。

请阅读 Joel 的文章“每个软件开发人员绝对必须了解的绝对最低限度的 Unicode 和字符集 (没有任何借口！）”了解更多信息。

编辑：克里斯蒂安自己找到了答案；结果发现这是 Cocoon 3 SAX 组件中的问题（我猜是 alpha 3 版本）。事实证明，如果将 XML 作为字符串传递到 XMLGenerator 类中，则在 SAX 解析期间会出现问题，从而导致混乱。

我查找代码以找到 Cocoon-stax 中的实际问题：

if (XMLGenerator.this.logger.isDebugEnabled()) {
    XMLGenerator.this.logger.debug("Using a string to produce SAX events.");
}
XMLUtils.toSax(new ByteArrayInputStream(this.xmlString.getBytes()), XMLGenerator.this.getSAXConsumer();

如您所见，调用 getBytes() 将使用 JRE 的默认编码创建一个字节数组，然后该数组将无法解析。这是因为 XML 声明自己为 UTF-8，而数据现在又以字节为单位，并且可能使用您的 Windows 代码页。

作为一种解决方法，可以使用以下方法：

new org.apache.cocoon.sax.component.XMLGenerator(xmlInput.getBytes("UTF-8"),
       "UTF-8");

这将触发正确的内部操作（正如 Christian 通过试验 API 发现的那样）。

我已经在 Apache 的 bug 跟踪器中提出了一个问题。

编辑 2：该问题已修复，并将包含在即将发布的版本中。

If your database contains only a single byte (with value 0xC4) then you aren't using UTF-8 encoding.

The character "LATIN CAPITAL LETTER A WITH DIAERESIS" has a code-point value U+00C4, but UTF-8 can't encode that in a single byte. If you check the third column "UTF-8 (hex.)" on UTF8-zeichentabelle.de you'll see that UTF-8 encodes that as 0xC3 84 (two bytes).

Please read Joel's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" for more info.

EDIT: Christian found the answer himself; turned out it was a problem in the Cocoon 3 SAX component (I guess it's the alpha 3 version). It turns out that if you pass an XML as a String into the XMLGenerator class, something will go wrong during SAX parsing causing this mess.

I looked up the code to find the actual problem in Cocoon-stax:

if (XMLGenerator.this.logger.isDebugEnabled()) {
    XMLGenerator.this.logger.debug("Using a string to produce SAX events.");
}
XMLUtils.toSax(new ByteArrayInputStream(this.xmlString.getBytes()), XMLGenerator.this.getSAXConsumer();

As you can see, the call getBytes() will create a Byte array with the JRE's default encoding which will then fail to parse. This is because the XML declares itself to be UTF-8 whereas the data is now in bytes again, and likely using your Windows codepage.

As a workaround, one can use the following:

new org.apache.cocoon.sax.component.XMLGenerator(xmlInput.getBytes("UTF-8"),
       "UTF-8");

This will trigger the right internal actions (as Christian found out by experimenting with the API).

I've opened an issue in Apache's bug tracker.

EDIT 2: The issue is fixed and will be included in an upcoming release.

回复收藏 0 原文

暖树树初阳… 2024-12-27 12:33:47

您在该页面上看到的 C4 指的是 unicode 代码点 U+00C4。用于表示 UTF-8 中此类代码点的字节序列不是 "\xC4"。您想要的是 UTF-8（十六进制）列中的内容，即 "\xC3\x84"。

因此，您的数据不是 UTF-8 格式。

您可以此处了解如何以 UTF-8 编码数据。

回复收藏 0 原文

情绪操控生活 2024-12-27 12:33:47

我正在运行 Windows 7，并使用 TextPad 作为文本编辑器来手动构建 xml 数据文件。我收到 MalformedByteSequenceException。我的 xml 文件中的规范是 UTF-8。经过一番摸索，我发现我的编辑器有一个工具“工具...转换为DOS”。我这样做了，重新保存了文件，异常消失了，我的代码运行良好。

然后我在编辑器中查看了该文件类型的默认编码。它是 ASCII，但当我将 xml 编码参数更改为 ASCII 时，我得到了另一个不同的 MalformedByteSequenceException。

因此，在 Windows 系统上，您可以尝试将 xml 编码保留为 UTF-8，但保存文件编码为 DOS。我没有进一步探究为什么会这样。

回复收藏 0 原文

~没有更多了~