使用 python 从网站获取非 HTML 数据
我试图将此页面上的当前合约价格转换为字符串: http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500.html
我真的很想要一个 python 2.6 解决方案。
使用 urllib 很容易获取页面 html,但似乎这个数字是实时的,而不是在 html 中。我检查了 Chrome 中的元素,它是一些 td 类的东西。
但我不知道如何用 python 来解决这个问题。我尝试了 beautifulsoup(但经过几次尝试后放弃了在我的 Windows x64 系统上使用 tar.gz),然后尝试了 elementtree,但实际上我的编程兴趣是数据分析。我不是网站设计师,也不想成为一名网站设计师,所以这都是一门外语。这是实时价格 XML 吗?
非常感谢任何帮助。理想情况下是一个易于安装的模块和一些实际代码,但非常欢迎所有提示和技巧。
I'm trying to get the current contract prices on this page to a string: http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500.html
I would really like a python 2.6 solution.
It was easy to get the page html using urllib, but it seems like this number is live and not in the html. I inspected the element in Chrome and it's some td class thing.
But I don't know how to get at this with python. I tried beautifulsoup (but after several attempts gave up getting a tar.gz to work on my windows x64 system), and then elementtree, but really my programming interest is data analysis. I'm not a website designer and don't really want to become one, so it's all kind of a foreign language. Is this live price XML?
Any assistance gratefully received. Ideally a simple to install module and some actual code, but all hints and tips very welcome.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
看起来表中的数字是由 Javascript 填充的,因此仅使用 urllib 或其他库获取 HTML 是不够的,因为它们不运行 javascript。您需要使用 PyQt 之类的库来模拟浏览器渲染页面/执行 JS 来填充数字,然后抓取其输出 HTML。
请参阅这篇有关使用 PyQt 的博客文章: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/链接文本
It looks like the numbers in the table are filled in by Javascript, so just fetching the HTML with urllib or another library won't be enough since they don't run the javascript. You'll need to use a library like PyQt to simulate the browser rendering the page/executing the JS to fill in the numbers, then scrape the output HTML of that.
See this blog post on working with PyQt: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/link text
如果您使用 firebug 等内容查看该网站,您可以看到它正在进行的 AJAX 调用。例如,初始值通过 AJAX 调用(至少对我来说)填充到:
http://www.cmegroup .com/CmeWS/md/MDServer/V1/Venue/G/Exchange/XCME/FOI/FUT/Product/ES?currentTime=1292780678142&contractCDs=,ESH1,ESM1,ESU1,ESZ1,ESH2,ESH1,ESM1,ESU1, ESZ1,ESH2
这将返回一个 JSON 响应,然后由 javascript 解析以填充表格。使用 urllib 自己完成此操作,然后使用 simplejson 解析响应将非常简单。
此外,您应该非常仔细地阅读此免责声明。您正在尝试的内容对于网站所有者来说,这样做可能不太好。
If you look at that website with something like firebug, you can see the AJAX calls it's making. For instance the initial values are being filled in with a AJAX call (at least for me) to:
http://www.cmegroup.com/CmeWS/md/MDServer/V1/Venue/G/Exchange/XCME/FOI/FUT/Product/ES?currentTime=1292780678142&contractCDs=,ESH1,ESM1,ESU1,ESZ1,ESH2,ESH1,ESM1,ESU1,ESZ1,ESH2
This is returning a JSON response, which is then parsed by javascript to fill in the tabel. It would be pretty simple to do that yourself with urllib and then use simplejson to parse the response.
Also, you should read this disclaimer very carefully. What you are trying to do is probably not cool with the owners of the web-site.
如果不知道这个数字来自哪里,就很难知道该告诉你什么。它也可能是 php 或 asp,因此您必须弄清楚该数字使用的是哪种语言。
Its hard to know what to tell you wothout knowing where the number is coming from. It could be php or asp also, so you are going to have to figure out which language the number is in.