如何让SAX解析器根据xml声明确定编码？

发布于 2024-09-14 11:29:56 字数 655 浏览 8 评论 0原文

我正在尝试解析来自不同来源的 xml 文件（我对此几乎无法控制）。它们中的大多数都以 UTF-8 编码，使用以下代码片段不会导致任何问题：

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);

由于 SAX 默认为 UTF-8，这很好。然而，一些文档声明：

<?xml version="1.0" encoding="ISO-8859-1"?>

尽管声明了 ISO-8859-1，SAX 仍然默认为 UTF-8。仅当我添加：

is.setEncoding("ISO-8859-1");

SAX 才会使用正确的编码。

如何让 SAX 自动从 xml 声明中检测正确的编码，而无需我专门设置它？我需要这个，因为我事先不知道文件的编码是什么。

提前致谢，艾伦

原文

I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);

Since SAX defaults to UTF-8 this is fine. However some of the documents declare:

<?xml version="1.0" encoding="ISO-8859-1"?>

Even though ISO-8859-1 is declared SAX still defaults to UTF-8.
Only if I add:

is.setEncoding("ISO-8859-1");

Will SAX use the correct encoding.

How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.

Thanks in advance,
Allan

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七色彩虹 2024-09-21 11:29:56

使用InputStream作为的参数>InputSource 当您希望 Sax 自动检测编码时。

如果要设置特定编码，请使用带有指定编码的 Reader 或 setEncoding 方法。

为什么？因为自动检测编码算法需要原始数据，而不是转换为字符。

该主题中的问题是：如何让 SAX 解析器确定 xml 声明的编码？ 我发现 Allan 对这个问题的回答具有误导性，因此我根据 Jörn Horstmann 的评论和我的评论提供了替代方案。后来的经历。

回复收藏 0 原文

稀香 2024-09-21 11:29:56

我自己找到了答案。

SAX 解析器在内部使用 InputSource 并来自 InputSource 文档：

SAX 解析器将使用
如何确定InputSource对象
读取 XML 输入。如果有一个
字符流可用，解析器
将直接读取该流，
忽略任何文本编码
在该流中找到的声明。如果
没有字符流，但是
有一个字节流，解析器
将使用该字节流，使用
输入源中指定的编码
否则（如果未指定编码）
自动检测字符编码
使用诸如中的算法
XML 规范。如果两者都不是
字符流和字节流都不是
可用时，解析器将尝试
打开到资源的 URI 连接
由系统标识符标识。

因此，基本上您需要将字符流传递给解析器，以便它获取正确的编码。请参阅下面的解决方案：

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
Reader isr = new InputStreamReader(getInputStream());
InputSource is = new InputSource();
is.setCharacterStream(isr);
parser.parse(is, handler);

I found the answer myself.

The SAX parser uses InputSource internally and from the InputSource docs:

The SAX parser will use the
InputSource object to determine how to
read XML input. If there is a
character stream available, the parser
will read that stream directly,
disregarding any text encoding
declaration found in that stream. If
there is no character stream, but
there is a byte stream, the parser
will use that byte stream, using the
encoding specified in the InputSource
or else (if no encoding is specified)
autodetecting the character encoding
using an algorithm such as the one in
the XML specification. If neither a
character stream nor a byte stream is
available, the parser will attempt to
open a URI connection to the resource
identified by the system identifier.

So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below:

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
Reader isr = new InputStreamReader(getInputStream());
InputSource is = new InputSource();
is.setCharacterStream(isr);
parser.parse(is, handler);

回复收藏 0 原文

~没有更多了~