如何从 Google Reader 导出的 OPML 文件中提取提要 URL?

发布于 2024-11-03 03:11:31 字数 743 浏览 6 评论 0原文

我正在尝试使用一个名为 Rss-Aware 的软件。它基本上是一个桌面提要检查器,用于检查 RSS 提要是否更新并通过 Ubuntu 的 Notify-OSD 系统发出通知。

但是,要知道要检查哪些提要,您必须在 ~/.rss-aware/rssfeeds.txt 中的文本文件中逐个列出提要 URL,并在每个提要 URL 之间使用换行符。比如:

http://example.com/feed.xml
http://othersite.org/feed.xml
http://othergreatsite.net/rss.xml

......看起来很简单,对吧?嗯,我想要使用的提要列表是从 Google Reader 导出为 OPML 文件(它是 XML 的一种),我不知道如何解析它以仅输出提要 URL。看起来应该很简单,但我很困惑。

如果有人能提供 Python 或 Ruby 的实现,或者我可以根据提示快速完成的操作,我会很高兴。 bash 脚本会很棒。

非常感谢您的帮助,我是一个非常弱的程序员,很想学习如何进行这种基本解析。

编辑:此外,这是我正在尝试的 OPML 文件从中提取提要网址。

I have a piece of software called Rss-Aware that I'm trying to use. It basically desktop feed-checker that checks if RSS feeds are updated and gives a notification through Ubuntu's Notify-OSD system.

However, to know what feeds to check, you have to list out the feed urls in a text file in ~/.rss-aware/rssfeeds.txt one after the other in a list with linebreak between each feed url. Something like:

http://example.com/feed.xml
http://othersite.org/feed.xml
http://othergreatsite.net/rss.xml

...Seems pretty simple right? Well, the list of feeds I'd like to use are exported from Google Reader as an OPML file (it's a type of XML) and I have no clue how to parse it to just output the the feed urls. It seems like it should be pretty straight forward yet I'm stumped.

I'd love if anyone could give an implementation in Python or Ruby or something I could do quickly from a prompt. A bash script would be awesome.

Thanks you so much for the help, I'm a really weak programmer and would love to learn how to do this basic parsing.

EDIT: Also, here is the OPML file I'm trying to extract the feed urls from.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

汐鸠 2024-11-10 03:11:31

我为此目的编写了一个订阅列表解析器。它称为 listparser,是用 Python 编写的。我刚刚测试了您的 OPML 文件,它似乎可以完美地解析该文件。它还将使您的提要标签可用。

如果您曾经使用过 feedparser,那么该界面应该很熟悉:

>>> import listparser as lp
>>> d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
>>> len(d.feeds)
112
>>> d.feeds[100].url
u'http://longreads.com/rss'
>>> d.feeds[100].tags
[u'reading']

可以使用类似于以下的脚本创建包含 feed URL 的文件:

import listparser as lp
d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
f = open('/home/USERNAME/.rss-aware/rssfeeds.txt', 'w')
for i in d.feeds:
    f.write(i.url + '\n')
f.close()

只需将 USERNAME 替换为您的实际用户名。完毕!

I wrote a subscription list parser for this very purpose. It's called listparser, and it's written in Python. I just tested your OPML file, and it appears to parse the file perfectly. It will also make your feeds' labels available.

If you've ever used feedparser, the interface should be familiar:

>>> import listparser as lp
>>> d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
>>> len(d.feeds)
112
>>> d.feeds[100].url
u'http://longreads.com/rss'
>>> d.feeds[100].tags
[u'reading']

It's possible to create the file with feed URLs using a script similar to:

import listparser as lp
d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
f = open('/home/USERNAME/.rss-aware/rssfeeds.txt', 'w')
for i in d.feeds:
    f.write(i.url + '\n')
f.close()

Just replace USERNAME with your actual username. Done!

素食主义者 2024-11-10 03:11:31

XML 解析非常容易实现并且对我来说非常有效。

from xml.etree import ElementTree
def extract_rss_urls_from_opml(filename):
    urls = []
    with open(filename, 'rt') as f:
        tree = ElementTree.parse(f)
    for node in tree.findall('.//outline'):
        url = node.attrib.get('xmlUrl')
        if url:
            urls.append(url)
    return urls
urls = extract_rss_urls_from_opml('your_file')

XML parsing was so easy to implement and worked great for me.

from xml.etree import ElementTree
def extract_rss_urls_from_opml(filename):
    urls = []
    with open(filename, 'rt') as f:
        tree = ElementTree.parse(f)
    for node in tree.findall('.//outline'):
        url = node.attrib.get('xmlUrl')
        if url:
            urls.append(url)
    return urls
urls = extract_rss_urls_from_opml('your_file')
红衣飘飘貌似仙 2024-11-10 03:11:31

由于它是一个 XML 文件,因此您可以使用 XPath 查询来提取网址。
在 XML 文件中,RSS feed url 似乎存储在 xmlUrl 属性中。 XPath 表达式 //@xmlUrl 将选择该属性的所有值。

如果您想在网络浏览器中对此进行测试,可以使用在线 XPath 测试器。如果您想在 Python 中执行此 XPath 查询,此问题解释了如何在 Python 中使用 XPath 。此外,lxml 文档有有关在 lxml 中使用 XPath 的页面,可能会有所帮助。

Since it's an XML file, you can use an XPath query to extract the urls.
In the XML file, it looks like the rss feed urls are stored in xmlUrl attributes. The XPath expression //@xmlUrl will select all values of that attribute.

If you want to test this out in your web-browser, you can use an online XPath tester. If you want to perform this XPath query in Python, this question explains how to use XPath in Python. Additionally, the lxml docs have a page on using XPath in lxml that might be helpful.

我纯我任性 2024-11-10 03:11:31

您还可以使用正则表达式。我使用以下搜索和替换正则表达式将 Google Reader OPML 导出转换为 Firefox HTML 实时书签导入:

^\s+<outline.*?title="(.*?)".*?xmlUrl="(.*?)".*?htmlUrl="(.*?)".*?/>
<DT><A FEEDURL="$2" HREF="$3">$1</A>

You could also use a regex. I used the following search-and-replace regex to convert my Google Reader OPML export to a Firefox HTML live-bookmark import:

^\s+<outline.*?title="(.*?)".*?xmlUrl="(.*?)".*?htmlUrl="(.*?)".*?/>
<DT><A FEEDURL="$2" HREF="$3">$1</A>
终难遇 2024-11-10 03:11:31

有许多 python 包可以提供帮助,这是一个非常旧的包(就像这个问题本身一样),并且可能不再维护(我什至找不到源代码),但使用起来非常简单。作为 Python 单行代码(将所有 Python 代码放入命令行):

$ pip install opml
$ python3 -c 'import opml; o=opml.parse("stitcher.opml"); print(*[x.xmlUrl for x in o], sep="\n")'

这会从 OPML 文件中的每一行打印出一个 URL。或者,只需根据需要更改 print 语句即可。由于 python 包除此之外并不是特别有用,因此我会在完成后将其卸载:pip uninstall opml(请参阅:https://pypi.org/project/opml/ )

There are a number of python packages that could help, this is one that is really old (as is this question itself), and likely no longer maintained (I can't even find the source code), but is quite simple to use. As a python one-liner (putting all python code onto the command-line):

$ pip install opml
$ python3 -c 'import opml; o=opml.parse("stitcher.opml"); print(*[x.xmlUrl for x in o], sep="\n")'

This prints out one URL per line from the OPML file. Alternatively just change the print statement as desired. Since the python package is not particularly useful beyond this, I'd uninstall it after you're done: pip uninstall opml (See: https://pypi.org/project/opml/ )

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文