字符集和罗马的问题(rss/atom feeds)
我正在尝试使用 rome (1.0) 创建一个提要聚合器。一切正常,但我面临提要字符集的问题。我正在 mac os x (netbeans 6.9.1) 上使用 java 1.6 开发它。
我使用以下代码来检索提要:
InputStream is = new URL(_source).openConnection().getInputStream();
SyndFeed feed = (SyndFeed) input.build(new InputStreamReader(is, Charset.forName(_charset)));
其中 _source
是 rss 源(例如 http://rss.cnn.com/rss/edition.rss) 和 _charset
是 UTF-8 或 ISO-8859-1。
它可以工作,但有些带有拉丁字符(如葡萄牙语)的网站即使我使用两种编码也不起作用。
例如,从 http://oglobo.globo.com/rss/plantaopais.xml 将始终返回虚拟字符,如下所示:
Secret�rio de S�o Paulo (UTF-8)
圣保罗秘密 (ISO-8859-1)
为什么?我错过了什么吗?
如果我尝试使用 UTF-16 之类的内容,rome 会抛出错误: com.sun.synmination.io.ParsingFeedException: Invalid XML: Error on line 1: Content is not allowed in prolog。
我尝试过其他编码,例如US-ASCII,但没有幸运...
另一个问题:罗马是处理提要的最佳解决方案(使用java)吗?来自罗马的最新版本是 2009 年的 1.0。似乎已经死了...
TIA,
Bob
I'm trying to create a feed aggregator using rome (1.0). Everything is working, but I'm facing problems with feed's charset. I'm developing it using java 1.6 over a mac os x (netbeans 6.9.1).
I'm using the following code to retrieve feeds:
InputStream is = new URL(_source).openConnection().getInputStream();
SyndFeed feed = (SyndFeed) input.build(new InputStreamReader(is, Charset.forName(_charset)));
Where _source
is a rss source (like http://rss.cnn.com/rss/edition.rss) and _charset
is UTF-8 or ISO-8859-1.
It works, but some sites with latin characters (like portuguese) it doesn't even if I use both encodings.
For instance, feeds read from http://oglobo.globo.com/rss/plantaopais.xml will always return dummy characters as following:
Secret�rio de S�o Paulo (UTF-8)
Secretário de São Paulo (ISO-8859-1)
Why? Am I missing something?
If I try to use something like UTF-16, rome throws an error: com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 1: Content is not allowed in prolog.
I've tried other encodings, like US-ASCII with no lucky...
Another question: is rome the best solution to deal with feeds (using java)? The most recent version from rome is 1.0 that is dated from 2009. Seems to be dead...
TIA,
Bob
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不知道罗马(你可以在你的问题中添加一个链接)。 ISO-8859-1 应该是用于您链接的 feed 的正确编码。但是您的库不支持
InputStream
作为源吗(因此它本身会通过 XML 前导码查找正确的编码)?难道是你的程序的输出处理后输出是乱码?你能编写
你的程序并报告其输出吗? (如果你的 Java + 控制台组合配置正确,它应该是“São Paulo”。)
所以,我现在下载并编译了 Rome(用 Maven 下载其他东西花了半个小时),我可以重现该问题。看起来采用 Reader 的
build
方法有问题。这是一个可行的变体(如果 rome、jdom 和 xerces 位于类路径中):
通过使用
InputSource
和InputStream
而不是Reader
code>,解析器本身会找到正确的字符集,并得到正确的结果。仔细研究一下源代码,我们的 SyndFeed 似乎将 Reader 或 InputSource 传递给 JDOM,JDOM 又将其传递给 SAX XMLReader,如果遇到一个以
。然后我在 Xerces 的源代码中进行了挖掘(这似乎是这里使用的),但没有发现任何可疑的东西会导致这种情况。
I don't know rome (you could have put a link in your question). ISO-8859-1 should be the right encoding to use for the feed you linked. But doesn't your library supports an
InputStream
as a source (so it would itself look up the right encoding by the XML preamble)?Could it be that the output is garbled after it's processing by the output of your program? Could you write
in your program and report its output? (It should be "São Paulo" if your Java + console combination is configured right.)
So, I now downloaded and compiled Rome (which took half an hour of downloading of other stuff by Maven), and I can reproduce the problem. Looks like the
build
method taking a Reader has problems.Here is a variant that works (if rome, jdom and xerces are in the class path):
By using an
InputSource
with anInputStream
instead of aReader
, the parser itself finds out the right charset, and gets it right.Digging a bit around in the source, it seems our SyndFeed passes the Reader or InputSource to JDOM, which in turn passes it to the SAX XMLReader, which seems to get confused if confronted with a Reader which presents itself with
<?xml ... encoding="ISO-8859-1" ?>
. I then dug around in the source of Xerces (which seem to be the one used here), but didn't find anything suspicious which would cause this.