确定提要是 Atom 还是 RSS

发布于 2024-12-07 06:11:05 字数 470 浏览 0 评论 0原文

我正在尝试确定给定的提要是基于 Atom 还是基于 RSS。

这是我的代码：

public boolean isRSS(String URL) throws ParserConfigurationException, SAXException, IOException{
        DocumentBuilder builder = DocumentBuilderFactory.newInstance()
                .newDocumentBuilder();
        Document doc = builder
                .parse(URL);
        return doc.getDocumentElement().getNodeName().equalsIgnoreCase() == "rss";
    }

有更好的方法吗？如果我使用 SAX 解析器会更好吗？

原文

I'm trying to determine whether a given feed is Atom based or RSS based.

Here's my code:

public boolean isRSS(String URL) throws ParserConfigurationException, SAXException, IOException{
        DocumentBuilder builder = DocumentBuilderFactory.newInstance()
                .newDocumentBuilder();
        Document doc = builder
                .parse(URL);
        return doc.getDocumentElement().getNodeName().equalsIgnoreCase() == "rss";
    }

Is there a better way to do it? would it be better if I used a SAX Parser instead?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

水中月 2024-12-14 06:11:05

根元素是确定提要类型的最简单方法。

RSS 提要具有根元素 rss （请参阅规范）
Atom 提要具有根元素 feed （请参阅规范)

对于不同的解析器获取根元素的方法有多种。没有一个比另一个低劣。关于 StAX 与 SAX 与 DOM 等的文章已经足够多了，它们可以用作特定决策的基础。

前两行代码没有任何问题：

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(URL);

在 return 语句中，您在 Java 字符串比较中犯了错误。

当您对字符串使用比较运算符 == 时，它会比较引用而不是值（即检查两者是否完全相同的对象）。您应该在此处使用 equals() 方法。为了确保我建议使用 equalsIgnoreCase()：

return doc.getDocumentElement().getNodeName().equalsIgnoreCase("rss");

提示：如果您在 isRss() 方法您不必使用三元运算符。

The root element is the easiest way to determine the type of a feed.

RSS feeds have the root element rss (see specification)
Atom feeds have the root element feed (see specification)

For different Parsers there are different ways to get the root element. None is inferior to the other. There has been written enough about StAX vs. SAX vs. DOM etc, which can be used as basis for a specific decision.

There is nothing wrong with your first two lines of code:

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(URL);

In your return statement you make a mistake on Java String comparison.

When you use the comparison operator == with Strings, it compares references not values (i.e. you check if both are exactly the same object). You should use the equals() method here. Just to be sure I would recommend to use equalsIgnoreCase():

return doc.getDocumentElement().getNodeName().equalsIgnoreCase("rss");

Hint: If you check for "rss" instead of "feed" (like for Atom) in your isRss() method you don't have to use the ternary operator.

回复收藏 0 原文

葮薆情 2024-12-14 06:11:05

嗅探内容是一种方法。但请注意，atom 使用命名空间，并且您正在创建一个非命名空间感知的解析器。

public boolean isAtom(String URL) throws ParserConfigurationException, SAXException, IOException{
    DocumentBuilderFactory f = DocumentBuilderFActory.newInstance();
    f.setNamespaceAware(true);
    DocumentBuilder builder = f.newInstance().newDocumentBuilder();
    Document doc = builder.parse(URL);
    Element e = doc.getDocumentElement(); 
    return e.getLocalName().equals("feed") && 
            e.getNamespaceURI().equals("http://www.w3.org/2005/Atom");
}

另请注意，不能使用 equalsIgnorCase() 进行比较，因为 XML 元素名称区分大小写。

另一种方法是对 Content-Type 标头做出反应（如果它在 HTTP GET 请求中可用）。 ATOM 的内容类型为 application/atom+xml，RSS 的内容类型为 application/rss+xml。但我怀疑并非所有 RSS 提要都可以信任正确设置此标头。

第三个选项是查看 URL 后缀，例如 .atom 和 .rss。

如果您使用 Spring 或 JAX-RS，最后两种方法很容易配置

Sniffing content is one method. But note that atom uses namespaces, and you are creating a non namespace aware parser.

public boolean isAtom(String URL) throws ParserConfigurationException, SAXException, IOException{
    DocumentBuilderFactory f = DocumentBuilderFActory.newInstance();
    f.setNamespaceAware(true);
    DocumentBuilder builder = f.newInstance().newDocumentBuilder();
    Document doc = builder.parse(URL);
    Element e = doc.getDocumentElement(); 
    return e.getLocalName().equals("feed") && 
            e.getNamespaceURI().equals("http://www.w3.org/2005/Atom");
}

Note also that you cannot compare using equalsIgnorCase(), since XML element names are case sensitive.

Another method is to react on the Content-Type header, if it is available in a HTTP GET request. Content-Type for ATOM would be application/atom+xml and for RSS application/rss+xml. I would suspect though, that not all RSS feed can be trusted to correctky set this header.

A third option is to look at the URL suffix, e.g. .atom and .rss.

The last two methods are easily configurable if you are using Spring or JAX-RS

回复收藏 0 原文

甜妞爱困 2024-12-14 06:11:05

您可以使用 StAX 解析器来避免将整个 XML 文档解析到内存中：

public boolean isAtom(String url) throws ParserConfigurationException, SAXException, IOException{
    XMLInputFactory xif = XMLInputFactory.newFactory();
    XMLStreamReader xsr = xif.createXMLStreamReader(new URL(url).openConnection());
    xsr.nextTag();  // Advance to root element
    return xsr.getLocalName().equals("feed") && 
            xsr.getNamespaceURI().equals("http://www.w3.org/2005/Atom");
}

You could use a StAX parser to avoid parsing the entire XML document into memory:

public boolean isAtom(String url) throws ParserConfigurationException, SAXException, IOException{
    XMLInputFactory xif = XMLInputFactory.newFactory();
    XMLStreamReader xsr = xif.createXMLStreamReader(new URL(url).openConnection());
    xsr.nextTag();  // Advance to root element
    return xsr.getLocalName().equals("feed") && 
            xsr.getNamespaceURI().equals("http://www.w3.org/2005/Atom");
}

回复收藏 0 原文

~没有更多了~