字符集和罗马的问题（rss/atom feeds）

发布于 2024-10-24 09:34:30 字数 1175 浏览 11 评论 0原文

我正在尝试使用 rome (1.0) 创建一个提要聚合器。一切正常，但我面临提要字符集的问题。我正在 mac os x (netbeans 6.9.1) 上使用 java 1.6 开发它。

我使用以下代码来检索提要：

InputStream is = new URL(_source).openConnection().getInputStream();
SyndFeed feed = (SyndFeed) input.build(new InputStreamReader(is, Charset.forName(_charset)));

其中 _source 是 rss 源（例如 http://rss.cnn.com/rss/edition.rss) 和 _charset 是 UTF-8 或 ISO-8859-1。

它可以工作，但有些带有拉丁字符（如葡萄牙语）的网站即使我使用两种编码也不起作用。

例如，从 http://oglobo.globo.com/rss/plantaopais.xml 将始终返回虚拟字符，如下所示：

Secret�rio de S�o Paulo (UTF-8)
圣保罗秘密 (ISO-8859-1)

为什么？我错过了什么吗？

如果我尝试使用 UTF-16 之类的内容，rome 会抛出错误： com.sun.synmination.io.ParsingFeedException: Invalid XML: Error on line 1: Content is not allowed in prolog。

我尝试过其他编码，例如US-ASCII，但没有幸运...

另一个问题：罗马是处理提要的最佳解决方案（使用java）吗？来自罗马的最新版本是 2009 年的 1.0。似乎已经死了...

TIA，

Bob

原文

I'm trying to create a feed aggregator using rome (1.0). Everything is working, but I'm facing problems with feed's charset. I'm developing it using java 1.6 over a mac os x (netbeans 6.9.1).

I'm using the following code to retrieve feeds:

InputStream is = new URL(_source).openConnection().getInputStream();
SyndFeed feed = (SyndFeed) input.build(new InputStreamReader(is, Charset.forName(_charset)));

Where _source is a rss source (like http://rss.cnn.com/rss/edition.rss) and _charset is UTF-8 or ISO-8859-1.

It works, but some sites with latin characters (like portuguese) it doesn't even if I use both encodings.

For instance, feeds read from http://oglobo.globo.com/rss/plantaopais.xml will always return dummy characters as following:

Secret�rio de S�o Paulo (UTF-8)
SecretÃ¡rio de SÃ£o Paulo (ISO-8859-1)

Why? Am I missing something?

If I try to use something like UTF-16, rome throws an error: com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 1: Content is not allowed in prolog.

I've tried other encodings, like US-ASCII with no lucky...

Another question: is rome the best solution to deal with feeds (using java)? The most recent version from rome is 1.0 that is dated from 2009. Seems to be dead...

TIA,

Bob

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

灵芸 2024-10-31 09:34:30

我不知道罗马（你可以在你的问题中添加一个链接）。 ISO-8859-1 应该是用于您链接的 feed 的正确编码。但是您的库不支持 InputStream 作为源吗（因此它本身会通过 XML 前导码查找正确的编码）？

难道是你的程序的输出处理后输出是乱码？你能编写

System.out.println("S\u00e3o Paulo");

你的程序并报告其输出吗？（如果你的 Java + 控制台组合配置正确，它应该是“São Paulo”。）

所以，我现在下载并编译了 Rome（用 Maven 下载其他东西花了半个小时），我可以重现该问题。看起来采用 Reader 的 build 方法有问题。

这是一个可行的变体（如果 rome、jdom 和 xerces 位于类路径中）：

package de.fencing_game.paul.examples.rome;

import org.xml.sax.InputSource;

import java.nio.charset.Charset;
import java.io.*;
import java.net.*;

import com.sun.syndication.io.*;
import com.sun.syndication.feed.synd.*;

public class RomeTest {

    public static void main(String[] ignored)
        throws IOException, FeedException
    {
        String charset = "UTF-8";
        String url = "http://oglobo.globo.com/rss/plantaopais.xml";


        InputStream is = new URL(url).openConnection().getInputStream();
        InputSource source = new InputSource(is);

        SyndFeedInput input = new SyndFeedInput();
        SyndFeed feed = input.build(source);

        System.out.println("description: " + feed.getDescription());
    }


}

通过使用 InputSource 和 InputStream 而不是 Reader code>，解析器本身会找到正确的字符集，并得到正确的结果。

仔细研究一下源代码，我们的 SyndFeed 似乎将 Reader 或 InputSource 传递给 JDOM，JDOM 又将其传递给 SAX XMLReader，如果遇到一个以。然后我在 Xerces 的源代码中进行了挖掘（这似乎是这里使用的），但没有发现任何可疑的东西会导致这种情况。

I don't know rome (you could have put a link in your question). ISO-8859-1 should be the right encoding to use for the feed you linked. But doesn't your library supports an InputStream as a source (so it would itself look up the right encoding by the XML preamble)?

Could it be that the output is garbled after it's processing by the output of your program? Could you write

System.out.println("S\u00e3o Paulo");

in your program and report its output? (It should be "São Paulo" if your Java + console combination is configured right.)

So, I now downloaded and compiled Rome (which took half an hour of downloading of other stuff by Maven), and I can reproduce the problem. Looks like the build method taking a Reader has problems.

Here is a variant that works (if rome, jdom and xerces are in the class path):

package de.fencing_game.paul.examples.rome;

import org.xml.sax.InputSource;

import java.nio.charset.Charset;
import java.io.*;
import java.net.*;

import com.sun.syndication.io.*;
import com.sun.syndication.feed.synd.*;

public class RomeTest {

    public static void main(String[] ignored)
        throws IOException, FeedException
    {
        String charset = "UTF-8";
        String url = "http://oglobo.globo.com/rss/plantaopais.xml";


        InputStream is = new URL(url).openConnection().getInputStream();
        InputSource source = new InputSource(is);

        SyndFeedInput input = new SyndFeedInput();
        SyndFeed feed = input.build(source);

        System.out.println("description: " + feed.getDescription());
    }


}

By using an InputSource with an InputStream instead of a Reader, the parser itself finds out the right charset, and gets it right.

Digging a bit around in the source, it seems our SyndFeed passes the Reader or InputSource to JDOM, which in turn passes it to the SAX XMLReader, which seems to get confused if confronted with a Reader which presents itself with <?xml ... encoding="ISO-8859-1" ?>. I then dug around in the source of Xerces (which seem to be the one used here), but didn't find anything suspicious which would cause this.

回复收藏 0 原文

~没有更多了~