使用 Scrapy for Python 从 html 路径中提取数据

发布于 2024-11-29 22:18:31 字数 2919 浏览 1 评论 0原文

我的项目概述:

我正在尝试在 python 2.6 中创建一个简单的脚本,该脚本将从 Bing 地图获取交通时间数据。我使用 Scrapy 库模块包 (scrapy.org/) 来爬行每个网站并从 Bing 地图中提取数据。

http://dl.dropbox.com/u/15387358/bing%20info.jpg

上面的图片显示了我想要的。 (现在突出显示的数据部分,但最终也需要下面的时间。)

我首先做了一个测试,看看起始网址是否会通过。然后使用输出日志打印 url 的输出(如果成功通过)。一旦成功,我的下一步就是尝试从网页中提取我需要的数据。

我一直在使用 Firebug、XPather 和 XPath Firefox Add-ons 来查找我想要提取的数据的 html 路径。此链接对于指导我正确编码路径非常有帮助(doc.scrapy.org/topics/selectors.html)。通过查看 firebug,这就是我想要提取的内容...

<span class="time">22 min</span>

并且 XPather 将其显示为该特定项目的路径。 ...

/div[@id='TaskHost_DrivingDirectionsSummaryContainer']/div[1]/span[3]

当我使用上面给定的路径在 cmd 中运行程序时,提取的数据打印为 [ ] ,当我将 /class='time' 添加到 span 末尾时,数据打印输出为 [u'False' ]。当仔细查看 firebug 的 DOM 窗口时,我注意到 class="time" 对于 get isID 为 false,并且 childNode 保存了我需要的数据。如何从childNode中提取数据?

http://dl.dropbox.com/u/15387358/Firebug%20info.jpg

下面是我到目前为止的代码

from scrapy import log # This module is useful for printing out debug information
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector, XPathSelectorList, XmlXPathSelector
import html5lib

    class BingSpider(BaseSpider):
        name = 'bing.com/maps'
        allowed_domains = ["bing.com/maps"]
        start_urls = [
                "http://www.bing.com/maps/?FORM=Z9LH4#Y3A9NDAuNjM2MDAxNTg1OTk5OTh+LTc0LjkxMTAwMzExMiZsdmw9OCZzdHk9ciZydHA9cG9zLjQwLjcxNDU0OF8tNzQuMDA3MTI1X05ldyUyMFlvcmslMkMlMjBOWV9fX2VffnBvcy40MC43MzE5N18tNzQuMTc0MTg1MDAwMDAwMDRfTmV3YXJrJTJDJTIwTkpfX19lXyZtb2RlPUQmcnRvcD0wfjB+MH4="
                     ]

    def parse(self, response):
        self.log('A response from %s just arrived!' % response.url)
        x = HtmlXPathSelector(response)
        time=x.select("//div[@id='TaskHost_DrivingDirectionsSummaryContainer']/div[1]/span[3]").extract()
        print time

CMD输出

2011-09-05 17:43:01-0400 [scrapy] DEBUG: Enabled item pipelines:
2011-09-05 17:43:01-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-09-05 17:43:01-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-09-05 17:43:01-0400 [bing.com] INFO: Spider opened
2011-09-05 17:43:02-0400 [bing.com] DEBUG: Crawled (200) <GET http://www.bing.co
m/maps/#Y3A9NDAuNzIzMjYwOTYzMTUwMDl+LTc0LjA5MDY1NSZsdmw9MTImc3R5PXImcnRwPXBvcy40
MC43MzE5N18tNzQuMTc0MTg1X05ld2FyayUyQyUyME5KX19fZV9+cG9zLjQwLjcxNDU0OF8tNzQuMDA3
MTI0OTk5OTk5OTdfTmV3JTIwWW9yayUyQyUyME5ZX19fZV8mbW9kZT1EJnJ0b3A9MH4wfjB+> (refer
er: None)
2011-09-05 17:43:02-0400 [bing.com] DEBUG: A response from http://www.bing.com/m
aps/ just arrived!
[]
2011-09-05 17:43:02-0400 [bing.com] INFO: Closing spider (finished)
2011-09-05 17:43:02-0400 [bing.com] INFO: Spider closed (finished)

Overview of my project:

I'm trying to create a simple script in python 2.6 that will get traffic time data from Bing Maps. The Scrapy library module package (scrapy.org/) is what I'm using to crawl through each website and extract data from Bing maps.

http://dl.dropbox.com/u/15387358/bing%20info.jpg

The picture above shows what i want. (the highlighted data part for now but ultimately the time below will be needed too.)

I first did a test to see if the start url would go though. and then used an output log to print the output of the url if it successfully went through. Once that worked, my next step was to try and extract the data i need from the webpage.

I have been using Firebug, XPather, and XPath Firefox Add-ons to find the html path of the data I want to extract. This link has been pretty helpful in guiding me in correctly coding the path's (doc.scrapy.org/topics/selectors.html). From looking at firebug, this is what i want to extract...

<span class="time">22 min</span>

and XPather shows this as the path for this particular item. ...

/div[@id='TaskHost_DrivingDirectionsSummaryContainer']/div[1]/span[3]

When i run the program in cmd with the given path above, the extracted data prints out as [ ] and when i add /class='time' to the end of span, the data print out is [u'False']. When looking at a bit closer in the DOM window of firebug, I noticed that class="time" is false for get isID and the the the childNode held the data i needed. How do i extract the data from the childNode?

http://dl.dropbox.com/u/15387358/Firebug%20info.jpg

Below is my code so far

from scrapy import log # This module is useful for printing out debug information
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector, XPathSelectorList, XmlXPathSelector
import html5lib

    class BingSpider(BaseSpider):
        name = 'bing.com/maps'
        allowed_domains = ["bing.com/maps"]
        start_urls = [
                "http://www.bing.com/maps/?FORM=Z9LH4#Y3A9NDAuNjM2MDAxNTg1OTk5OTh+LTc0LjkxMTAwMzExMiZsdmw9OCZzdHk9ciZydHA9cG9zLjQwLjcxNDU0OF8tNzQuMDA3MTI1X05ldyUyMFlvcmslMkMlMjBOWV9fX2VffnBvcy40MC43MzE5N18tNzQuMTc0MTg1MDAwMDAwMDRfTmV3YXJrJTJDJTIwTkpfX19lXyZtb2RlPUQmcnRvcD0wfjB+MH4="
                     ]

    def parse(self, response):
        self.log('A response from %s just arrived!' % response.url)
        x = HtmlXPathSelector(response)
        time=x.select("//div[@id='TaskHost_DrivingDirectionsSummaryContainer']/div[1]/span[3]").extract()
        print time

CMD output

2011-09-05 17:43:01-0400 [scrapy] DEBUG: Enabled item pipelines:
2011-09-05 17:43:01-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-09-05 17:43:01-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-09-05 17:43:01-0400 [bing.com] INFO: Spider opened
2011-09-05 17:43:02-0400 [bing.com] DEBUG: Crawled (200) <GET http://www.bing.co
m/maps/#Y3A9NDAuNzIzMjYwOTYzMTUwMDl+LTc0LjA5MDY1NSZsdmw9MTImc3R5PXImcnRwPXBvcy40
MC43MzE5N18tNzQuMTc0MTg1X05ld2FyayUyQyUyME5KX19fZV9+cG9zLjQwLjcxNDU0OF8tNzQuMDA3
MTI0OTk5OTk5OTdfTmV3JTIwWW9yayUyQyUyME5ZX19fZV8mbW9kZT1EJnJ0b3A9MH4wfjB+> (refer
er: None)
2011-09-05 17:43:02-0400 [bing.com] DEBUG: A response from http://www.bing.com/m
aps/ just arrived!
[]
2011-09-05 17:43:02-0400 [bing.com] INFO: Closing spider (finished)
2011-09-05 17:43:02-0400 [bing.com] INFO: Spider closed (finished)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

回首观望 2024-12-06 22:18:31

当网站大量使用 JavaScript 时,您不能信任运行时获得的 XPath,因为那是 JavaScript 代码运行后获得的 XPath,而 Scrapy 不运行 JavaScript 代码。

您应该:

  1. 打开网络浏览器开发人员工具的“网络”选项卡。

  2. 在网站上执行获取所需数据的步骤,同时您在“网络”选项卡上观看网站执行的相应请求。

  3. 尝试使用 Scrapy 重现这些步骤(请求)。

另请参阅调试蜘蛛

When a website uses JavaScript in a significant way, you cannot trust the XPath you get at runtime, because that is the XPath you get after the JavaScript code has run, and Scrapy does not run JavaScript code.

You should:

  1. Open the Network tab of the developer tools of your web browser.

  2. Perform on the website the steps to get to the desired data, while you watch the corresponding requests performed by the website on the Network tab.

  3. Try to reproduce those steps (requests) with Scrapy.

See also Debugging Spiders.

无尽的现实 2024-12-06 22:18:31

对于所有抓取目的,请使用 BeautifulSoup

soup.find('span', class="time")

For all scrapping purposes use BeautifulSoup

soup.find('span', class="time")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文