如何让SAX解析器根据xml声明确定编码?
我正在尝试解析来自不同来源的 xml 文件(我对此几乎无法控制)。它们中的大多数都以 UTF-8 编码,使用以下代码片段不会导致任何问题:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);
由于 SAX 默认为 UTF-8,这很好。然而,一些文档声明:
<?xml version="1.0" encoding="ISO-8859-1"?>
尽管声明了 ISO-8859-1,SAX 仍然默认为 UTF-8。 仅当我添加:
is.setEncoding("ISO-8859-1");
SAX 才会使用正确的编码。
如何让 SAX 自动从 xml 声明中检测正确的编码,而无需我专门设置它?我需要这个,因为我事先不知道文件的编码是什么。
提前致谢, 艾伦
I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);
Since SAX defaults to UTF-8 this is fine. However some of the documents declare:
<?xml version="1.0" encoding="ISO-8859-1"?>
Even though ISO-8859-1 is declared SAX still defaults to UTF-8.
Only if I add:
is.setEncoding("ISO-8859-1");
Will SAX use the correct encoding.
How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.
Thanks in advance,
Allan
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用InputStream作为的参数>InputSource 当您希望 Sax 自动检测编码时。
如果要设置特定编码,请使用带有指定编码的 Reader 或 setEncoding 方法。
为什么?因为自动检测编码算法需要原始数据,而不是转换为字符。
该主题中的问题是:如何让 SAX 解析器确定 xml 声明的编码? 我发现 Allan 对这个问题的回答具有误导性,因此我根据 Jörn Horstmann 的评论和我的评论提供了替代方案。后来的经历。
Use InputStream as argument to InputSource when you want Sax to autodetect the encoding.
If you want to set a specific encoding, use Reader with a specified encoding or setEncoding method.
Why? Because autodetection encoding algorithms require raw data, not converted to characters.
The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration? I found Allan's answer to the question misleading and I provided the alternative one, based on Jörn Horstmann's comment and my later experience.
我自己找到了答案。
SAX 解析器在内部使用 InputSource 并来自 InputSource 文档:
因此,基本上您需要将字符流传递给解析器,以便它获取正确的编码。请参阅下面的解决方案:
I found the answer myself.
The SAX parser uses InputSource internally and from the InputSource docs:
So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below: