如何使用 scrapy 的 XmlFeedSpider 解析 sitemap.xml 文件？

发布于 2024-10-31 09:25:32 字数 3137 浏览 4 评论 0原文

我正在尝试使用 scrapy 解析 sitemap.xml 文件，站点地图文件类似于以下文件，只有更多的 url 节点。

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:video="http://www.sitemaps.org/schemas/sitemap-video/1.1">
    <url>
        <loc>
            http://www.site.com/page.html
        </loc>
        <video:video>
            <video:thumbnail_loc>
                http://www.site.com/thumb.jpg
            </video:thumbnail_loc>
            <video:content_loc>http://www.example.com/video123.flv</video:content_loc>
            <video:player_loc allow_embed="yes" autoplay="ap=1">
                http://www.example.com/videoplayer.swf?video=123
            </video:player_loc>
            <video:title>here is the page title</video:title>
            <video:description>and an awesome description</video:description>
            <video:duration>302</video:duration>
            <video:publication_date>2011-02-24T02:03:43+02:00</video:publication_date>
            <video:tag>w00t</video:tag>
            <video:tag>awesome</video:tag>
            <video:tag>omgwtfbbq</video:tag>
            <video:tag>kthxby</video:tag>
        </video:video>
    </url>
</urlset>

我查看了相关的 scrapy 的文档，我写了以下代码片段来看看是否我正在做正确的方式（似乎我没有^^）：

class SitemapSpider(XMLFeedSpider):
    name = "sitemap"
    namespaces = [
        ('', 'http://www.sitemaps.org/schemas/sitemap/0.9'),
        ('video', 'http://www.sitemaps.org/schemas/sitemap-video/1.1'),
    ]
    start_urls = ["http://example.com/sitemap.xml"]
    itertag = 'url'

    def parse_node(self, response, node):
        print "Parsing: %s" % str(node)

但是当我运行蜘蛛时，我收到此错误：

File "/.../python2.7/site-packages/scrapy/utils/iterators.py", line 32, in xmliter
    yield XmlXPathSelector(text=nodetext).select('//' + nodename)[0]
    exceptions.IndexError: list index out of range

我认为我没有定义“默认”命名空间（http://www.sitemaps.org/schemas/sitemap/0.9) 正确，但我找不到如何做这个。

迭代 url 节点然后能够从其子节点中提取所需信息的正确方法是什么？

答案：

不幸的是，我无法使用XMLFeedSpider（这应该是使用scrapy解析XML的方法），但是感谢 simplebias 的回答，我已经能够找到一种方法来实现这种“老式方式”。我想出了以下代码（这次有效！）：

class SitemapSpider(BaseSpider):
    name = 'sitemap'
    namespaces = {
        'sitemap': 'http://www.sitemaps.org/schemas/sitemap/0.9',
        'video': 'http://www.sitemaps.org/schemas/sitemap-video/1.1',
    }

    def parse(self, response):
        xxs = XmlXPathSelector(response)
        for namespace, schema in self.namespaces.iteritems():
            xxs.register_namespace(namespace, schema)
        for urlnode in xxs.select('//sitemap:url'):
            extract_datas_here()

原文

I am trying to parse sitemap.xml files using scrapy, the sitemap files are like the following one with just much more url nodes.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:video="http://www.sitemaps.org/schemas/sitemap-video/1.1">
    <url>
        <loc>
            http://www.site.com/page.html
        </loc>
        <video:video>
            <video:thumbnail_loc>
                http://www.site.com/thumb.jpg
            </video:thumbnail_loc>
            <video:content_loc>http://www.example.com/video123.flv</video:content_loc>
            <video:player_loc allow_embed="yes" autoplay="ap=1">
                http://www.example.com/videoplayer.swf?video=123
            </video:player_loc>
            <video:title>here is the page title</video:title>
            <video:description>and an awesome description</video:description>
            <video:duration>302</video:duration>
            <video:publication_date>2011-02-24T02:03:43+02:00</video:publication_date>
            <video:tag>w00t</video:tag>
            <video:tag>awesome</video:tag>
            <video:tag>omgwtfbbq</video:tag>
            <video:tag>kthxby</video:tag>
        </video:video>
    </url>
</urlset>

I looked at the related scrapy's documentation, and i wrote the following snippet to see if i was doing the right way (and it seems i don't ^^):

class SitemapSpider(XMLFeedSpider):
    name = "sitemap"
    namespaces = [
        ('', 'http://www.sitemaps.org/schemas/sitemap/0.9'),
        ('video', 'http://www.sitemaps.org/schemas/sitemap-video/1.1'),
    ]
    start_urls = ["http://example.com/sitemap.xml"]
    itertag = 'url'

    def parse_node(self, response, node):
        print "Parsing: %s" % str(node)

But when i run the spider, i get this error:

File "/.../python2.7/site-packages/scrapy/utils/iterators.py", line 32, in xmliter
    yield XmlXPathSelector(text=nodetext).select('//' + nodename)[0]
    exceptions.IndexError: list index out of range

I think i'm not defining the "default" namespace (http://www.sitemaps.org/schemas/sitemap/0.9) properly, but i can't find how to do this.

What's the correct way to iterate over the url nodes and then be able to extract the needed infos from its childs?

ANSWER:

Unfortunately, i wasn't able to use the XMLFeedSpider (which is supposed to be the way to parse XML with scrapy), but thanks to simplebias' answer, i have been able to figure a way to achieve this "the old-school way". I came up with the following code (which works, this time!):

class SitemapSpider(BaseSpider):
    name = 'sitemap'
    namespaces = {
        'sitemap': 'http://www.sitemaps.org/schemas/sitemap/0.9',
        'video': 'http://www.sitemaps.org/schemas/sitemap-video/1.1',
    }

    def parse(self, response):
        xxs = XmlXPathSelector(response)
        for namespace, schema in self.namespaces.iteritems():
            xxs.register_namespace(namespace, schema)
        for urlnode in xxs.select('//sitemap:url'):
            extract_datas_here()

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情深已缘浅 2024-11-07 09:25:32

Scrapy 在底层使用 lxml / libxml2，最终调用 node.xpath() 方法来执行选择。 xpath 表达式中带有命名空间的任何元素都必须带有前缀，并且必须传递映射来告诉选择器每个前缀解析到哪个命名空间。

下面是一个示例，说明如何在使用 node.xpath() 方法时将前缀映射到名称空间：

doc = '<root xmlns="chaos"><bar /></root>'
tree = lxml.etree.fromstring(doc)
tree.xpath('//bar')
[]
tree.xpath('//x:bar', namespaces={'x': 'chaos'})
[<Element {chaos}bar at 7fa40f9c50a8>]

如果没有使用这个 scrapy XMLFeedSpider 类，我猜您的名称空间映射和 itertag 需要遵循相同的方案：

class SitemapSpider(XMLFeedSpider):
    namespaces = [
        ('sm', 'http://www.sitemaps.org/schemas/sitemap/0.9'),
        ]
     itertag = 'sm:url'

Scrapy uses lxml / libxml2 under the hood, eventually invoking the node.xpath() method to perform the selection. Any elements in your xpath expression which are namespaced must be prefixed, and you must pass a mapping to tell the selector which namespace each prefix resolves to.

Here is an example to illustrate how to map prefixes to namespaces when using the node.xpath() method:

doc = '<root xmlns="chaos"><bar /></root>'
tree = lxml.etree.fromstring(doc)
tree.xpath('//bar')
[]
tree.xpath('//x:bar', namespaces={'x': 'chaos'})
[<Element {chaos}bar at 7fa40f9c50a8>]

Without having used this scrapy XMLFeedSpider class, I'm guessing your namespace map and itertag need to follow the same scheme:

class SitemapSpider(XMLFeedSpider):
    namespaces = [
        ('sm', 'http://www.sitemaps.org/schemas/sitemap/0.9'),
        ]
     itertag = 'sm:url'

回复收藏 0 原文