如何使用 scrapy 的 XmlFeedSpider 解析 sitemap.xml 文件?
我正在尝试使用 scrapy 解析 sitemap.xml
文件,站点地图文件类似于以下文件,只有更多的 url
节点。
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:video="http://www.sitemaps.org/schemas/sitemap-video/1.1">
<url>
<loc>
http://www.site.com/page.html
</loc>
<video:video>
<video:thumbnail_loc>
http://www.site.com/thumb.jpg
</video:thumbnail_loc>
<video:content_loc>http://www.example.com/video123.flv</video:content_loc>
<video:player_loc allow_embed="yes" autoplay="ap=1">
http://www.example.com/videoplayer.swf?video=123
</video:player_loc>
<video:title>here is the page title</video:title>
<video:description>and an awesome description</video:description>
<video:duration>302</video:duration>
<video:publication_date>2011-02-24T02:03:43+02:00</video:publication_date>
<video:tag>w00t</video:tag>
<video:tag>awesome</video:tag>
<video:tag>omgwtfbbq</video:tag>
<video:tag>kthxby</video:tag>
</video:video>
</url>
</urlset>
我查看了相关的 scrapy 的文档,我写了以下代码片段来看看是否我正在做正确的方式(似乎我没有^^):
class SitemapSpider(XMLFeedSpider):
name = "sitemap"
namespaces = [
('', 'http://www.sitemaps.org/schemas/sitemap/0.9'),
('video', 'http://www.sitemaps.org/schemas/sitemap-video/1.1'),
]
start_urls = ["http://example.com/sitemap.xml"]
itertag = 'url'
def parse_node(self, response, node):
print "Parsing: %s" % str(node)
但是当我运行蜘蛛时,我收到此错误:
File "/.../python2.7/site-packages/scrapy/utils/iterators.py", line 32, in xmliter
yield XmlXPathSelector(text=nodetext).select('//' + nodename)[0]
exceptions.IndexError: list index out of range
我认为我没有定义“默认”命名空间(http://www.sitemaps.org/schemas/sitemap/0.9) 正确,但我找不到如何做这个。
迭代 url
节点然后能够从其子节点中提取所需信息的正确方法是什么?
答案:
不幸的是,我无法使用XMLFeedSpider
(这应该是使用scrapy
解析XML的方法),但是感谢 simplebias 的回答,我已经能够找到一种方法来实现这种“老式方式”。我想出了以下代码(这次有效!):
class SitemapSpider(BaseSpider):
name = 'sitemap'
namespaces = {
'sitemap': 'http://www.sitemaps.org/schemas/sitemap/0.9',
'video': 'http://www.sitemaps.org/schemas/sitemap-video/1.1',
}
def parse(self, response):
xxs = XmlXPathSelector(response)
for namespace, schema in self.namespaces.iteritems():
xxs.register_namespace(namespace, schema)
for urlnode in xxs.select('//sitemap:url'):
extract_datas_here()
I am trying to parse sitemap.xml
files using scrapy, the sitemap files are like the following one with just much more url
nodes.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:video="http://www.sitemaps.org/schemas/sitemap-video/1.1">
<url>
<loc>
http://www.site.com/page.html
</loc>
<video:video>
<video:thumbnail_loc>
http://www.site.com/thumb.jpg
</video:thumbnail_loc>
<video:content_loc>http://www.example.com/video123.flv</video:content_loc>
<video:player_loc allow_embed="yes" autoplay="ap=1">
http://www.example.com/videoplayer.swf?video=123
</video:player_loc>
<video:title>here is the page title</video:title>
<video:description>and an awesome description</video:description>
<video:duration>302</video:duration>
<video:publication_date>2011-02-24T02:03:43+02:00</video:publication_date>
<video:tag>w00t</video:tag>
<video:tag>awesome</video:tag>
<video:tag>omgwtfbbq</video:tag>
<video:tag>kthxby</video:tag>
</video:video>
</url>
</urlset>
I looked at the related scrapy's documentation, and i wrote the following snippet to see if i was doing the right way (and it seems i don't ^^):
class SitemapSpider(XMLFeedSpider):
name = "sitemap"
namespaces = [
('', 'http://www.sitemaps.org/schemas/sitemap/0.9'),
('video', 'http://www.sitemaps.org/schemas/sitemap-video/1.1'),
]
start_urls = ["http://example.com/sitemap.xml"]
itertag = 'url'
def parse_node(self, response, node):
print "Parsing: %s" % str(node)
But when i run the spider, i get this error:
File "/.../python2.7/site-packages/scrapy/utils/iterators.py", line 32, in xmliter
yield XmlXPathSelector(text=nodetext).select('//' + nodename)[0]
exceptions.IndexError: list index out of range
I think i'm not defining the "default" namespace (http://www.sitemaps.org/schemas/sitemap/0.9) properly, but i can't find how to do this.
What's the correct way to iterate over the url
nodes and then be able to extract the needed infos from its childs?
ANSWER:
Unfortunately, i wasn't able to use the XMLFeedSpider
(which is supposed to be the way to parse XML with scrapy
), but thanks to simplebias' answer, i have been able to figure a way to achieve this "the old-school way". I came up with the following code (which works, this time!):
class SitemapSpider(BaseSpider):
name = 'sitemap'
namespaces = {
'sitemap': 'http://www.sitemaps.org/schemas/sitemap/0.9',
'video': 'http://www.sitemaps.org/schemas/sitemap-video/1.1',
}
def parse(self, response):
xxs = XmlXPathSelector(response)
for namespace, schema in self.namespaces.iteritems():
xxs.register_namespace(namespace, schema)
for urlnode in xxs.select('//sitemap:url'):
extract_datas_here()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Scrapy 在底层使用 lxml / libxml2,最终调用
node.xpath()
方法来执行选择。 xpath 表达式中带有命名空间的任何元素都必须带有前缀,并且必须传递映射来告诉选择器每个前缀解析到哪个命名空间。下面是一个示例,说明如何在使用
node.xpath()
方法时将前缀映射到名称空间:如果没有使用这个 scrapy XMLFeedSpider 类,我猜您的名称空间映射和 itertag 需要遵循相同的方案:
Scrapy uses lxml / libxml2 under the hood, eventually invoking the
node.xpath()
method to perform the selection. Any elements in your xpath expression which are namespaced must be prefixed, and you must pass a mapping to tell the selector which namespace each prefix resolves to.Here is an example to illustrate how to map prefixes to namespaces when using the
node.xpath()
method:Without having used this scrapy XMLFeedSpider class, I'm guessing your namespace map and itertag need to follow the same scheme:
我发现 hxs 和 xxs 之间的区别很有帮助。我发现很难找到 xxs 对象。我正在尝试使用它
,当它们能够更好地满足我的需要时。
或者
I found that the difference between hxs and xxs were helpful. I found it difficult to locate the xxs object. I was trying to use this
When these worked far better for what I needed.
or