使用 Scrapy for Python 从 html 路径中提取数据

发布于 2024-11-29 22:18:31 字数 2919 浏览 1 评论 0原文

我的项目概述：

我正在尝试在 python 2.6 中创建一个简单的脚本，该脚本将从 Bing 地图获取交通时间数据。我使用 Scrapy 库模块包 (scrapy.org/) 来爬行每个网站并从 Bing 地图中提取数据。

上面的图片显示了我想要的。（现在突出显示的数据部分，但最终也需要下面的时间。）

我首先做了一个测试，看看起始网址是否会通过。然后使用输出日志打印 url 的输出（如果成功通过）。一旦成功，我的下一步就是尝试从网页中提取我需要的数据。

我一直在使用 Firebug、XPather 和 XPath Firefox Add-ons 来查找我想要提取的数据的 html 路径。此链接对于指导我正确编码路径非常有帮助（doc.scrapy.org/topics/selectors.html）。通过查看 firebug，这就是我想要提取的内容...

<span class="time">22 min</span>

并且 XPather 将其显示为该特定项目的路径。 ...

/div[@id='TaskHost_DrivingDirectionsSummaryContainer']/div[1]/span[3]

当我使用上面给定的路径在 cmd 中运行程序时，提取的数据打印为 [ ] ，当我将 /class='time' 添加到 span 末尾时，数据打印输出为 [u'False' ]。当仔细查看 firebug 的 DOM 窗口时，我注意到 class="time" 对于 get isID 为 false，并且 childNode 保存了我需要的数据。如何从childNode中提取数据？

下面是我到目前为止的代码

from scrapy import log # This module is useful for printing out debug information
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector, XPathSelectorList, XmlXPathSelector
import html5lib

    class BingSpider(BaseSpider):
        name = 'bing.com/maps'
        allowed_domains = ["bing.com/maps"]
        start_urls = [
                "http://www.bing.com/maps/?FORM=Z9LH4#Y3A9NDAuNjM2MDAxNTg1OTk5OTh+LTc0LjkxMTAwMzExMiZsdmw9OCZzdHk9ciZydHA9cG9zLjQwLjcxNDU0OF8tNzQuMDA3MTI1X05ldyUyMFlvcmslMkMlMjBOWV9fX2VffnBvcy40MC43MzE5N18tNzQuMTc0MTg1MDAwMDAwMDRfTmV3YXJrJTJDJTIwTkpfX19lXyZtb2RlPUQmcnRvcD0wfjB+MH4="
                     ]

    def parse(self, response):
        self.log('A response from %s just arrived!' % response.url)
        x = HtmlXPathSelector(response)
        time=x.select("//div[@id='TaskHost_DrivingDirectionsSummaryContainer']/div[1]/span[3]").extract()
        print time

CMD输出

2011-09-05 17:43:01-0400 [scrapy] DEBUG: Enabled item pipelines:
2011-09-05 17:43:01-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-09-05 17:43:01-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-09-05 17:43:01-0400 [bing.com] INFO: Spider opened
2011-09-05 17:43:02-0400 [bing.com] DEBUG: Crawled (200) <GET http://www.bing.co
m/maps/#Y3A9NDAuNzIzMjYwOTYzMTUwMDl+LTc0LjA5MDY1NSZsdmw9MTImc3R5PXImcnRwPXBvcy40
MC43MzE5N18tNzQuMTc0MTg1X05ld2FyayUyQyUyME5KX19fZV9+cG9zLjQwLjcxNDU0OF8tNzQuMDA3
MTI0OTk5OTk5OTdfTmV3JTIwWW9yayUyQyUyME5ZX19fZV8mbW9kZT1EJnJ0b3A9MH4wfjB+> (refer
er: None)
2011-09-05 17:43:02-0400 [bing.com] DEBUG: A response from http://www.bing.com/m
aps/ just arrived!
[]
2011-09-05 17:43:02-0400 [bing.com] INFO: Closing spider (finished)
2011-09-05 17:43:02-0400 [bing.com] INFO: Spider closed (finished)

原文

Overview of my project:

I'm trying to create a simple script in python 2.6 that will get traffic time data from Bing Maps. The Scrapy library module package (scrapy.org/) is what I'm using to crawl through each website and extract data from Bing maps.

The picture above shows what i want. (the highlighted data part for now but ultimately the time below will be needed too.)

I first did a test to see if the start url would go though. and then used an output log to print the output of the url if it successfully went through. Once that worked, my next step was to try and extract the data i need from the webpage.

I have been using Firebug, XPather, and XPath Firefox Add-ons to find the html path of the data I want to extract. This link has been pretty helpful in guiding me in correctly coding the path's (doc.scrapy.org/topics/selectors.html). From looking at firebug, this is what i want to extract...

<span class="time">22 min</span>

and XPather shows this as the path for this particular item. ...

/div[@id='TaskHost_DrivingDirectionsSummaryContainer']/div[1]/span[3]

When i run the program in cmd with the given path above, the extracted data prints out as [ ] and when i add /class='time' to the end of span, the data print out is [u'False']. When looking at a bit closer in the DOM window of firebug, I noticed that class="time" is false for get isID and the the the childNode held the data i needed. How do i extract the data from the childNode?

Below is my code so far

from scrapy import log # This module is useful for printing out debug information
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector, XPathSelectorList, XmlXPathSelector
import html5lib

    class BingSpider(BaseSpider):
        name = 'bing.com/maps'
        allowed_domains = ["bing.com/maps"]
        start_urls = [
                "http://www.bing.com/maps/?FORM=Z9LH4#Y3A9NDAuNjM2MDAxNTg1OTk5OTh+LTc0LjkxMTAwMzExMiZsdmw9OCZzdHk9ciZydHA9cG9zLjQwLjcxNDU0OF8tNzQuMDA3MTI1X05ldyUyMFlvcmslMkMlMjBOWV9fX2VffnBvcy40MC43MzE5N18tNzQuMTc0MTg1MDAwMDAwMDRfTmV3YXJrJTJDJTIwTkpfX19lXyZtb2RlPUQmcnRvcD0wfjB+MH4="
                     ]

    def parse(self, response):
        self.log('A response from %s just arrived!' % response.url)
        x = HtmlXPathSelector(response)
        time=x.select("//div[@id='TaskHost_DrivingDirectionsSummaryContainer']/div[1]/span[3]").extract()
        print time

CMD output

2011-09-05 17:43:01-0400 [scrapy] DEBUG: Enabled item pipelines:
2011-09-05 17:43:01-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-09-05 17:43:01-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-09-05 17:43:01-0400 [bing.com] INFO: Spider opened
2011-09-05 17:43:02-0400 [bing.com] DEBUG: Crawled (200) <GET http://www.bing.co
m/maps/#Y3A9NDAuNzIzMjYwOTYzMTUwMDl+LTc0LjA5MDY1NSZsdmw9MTImc3R5PXImcnRwPXBvcy40
MC43MzE5N18tNzQuMTc0MTg1X05ld2FyayUyQyUyME5KX19fZV9+cG9zLjQwLjcxNDU0OF8tNzQuMDA3
MTI0OTk5OTk5OTdfTmV3JTIwWW9yayUyQyUyME5ZX19fZV8mbW9kZT1EJnJ0b3A9MH4wfjB+> (refer
er: None)
2011-09-05 17:43:02-0400 [bing.com] DEBUG: A response from http://www.bing.com/m
aps/ just arrived!
[]
2011-09-05 17:43:02-0400 [bing.com] INFO: Closing spider (finished)
2011-09-05 17:43:02-0400 [bing.com] INFO: Spider closed (finished)

分享到QQ

分享到微博