当前位置：文江博客话题详情

使用 Python 下载 URL 的 html - 但启用了 javascript

发布于 2024-11-19 13:22:30 字数 268 浏览 0 评论 0原文

我正在尝试下载此页面，以便我可以抓取搜索结果。但是，当我下载页面并尝试使用 BeautifulSoup 处理它时，我发现页面的某些部分（例如搜索结果）未包含在内，因为网站检测到未启用 javascript。

有没有办法在Python中启用JavaScript来下载URL的HTML？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

锦爱 2024-11-26 13:22:30

@kstruct：我的首选方法是使用已经编写的浏览器，而不是使用 QtWebKit 和 PyQt4 编写完整的浏览器。有 PhantomJS (C++) 项目，或 PyPhantomJS (Python)。基本上，Python 是 QtWebKit 和 Python。

它们都是无头浏览器，您可以直接通过 JavaScript 进行控制。 Python 版本有一个插件系统，允许您扩展核心，以允许您需要的附加功能。

这是 PyPhantomJS 的示例脚本（使用 saveToFile 插件）

// create new webpage
var page = new WebPage();

// open page, set callback
page.open('url', function(status) {
    // exit if page couldn't load
    if (status !== 'success') {
        console.log('FAIL to load!');
        phantom.exit(1);
    }

    // save page content to file
    phantom.saveToFile(page.content, 'myfile.txt');
    phantom.exit();
});

有用的链接：
API 参考 | 如何编写插件

@kstruct: My preferred way, instead of writing a full browser with QtWebKit and PyQt4, is to use one already written. There's the PhantomJS (C++) project, or PyPhantomJS (Python). Basically the Python one is QtWebKit and Python.

They're both headless browsers which you can control directly from JavaScript. The Python version has a plug-in system which allows you to extend the core too, to allow additional functionalities should you need.

Here's an example script for PyPhantomJS (with the saveToFile plugin)

// create new webpage
var page = new WebPage();

// open page, set callback
page.open('url', function(status) {
    // exit if page couldn't load
    if (status !== 'success') {
        console.log('FAIL to load!');
        phantom.exit(1);
    }

    // save page content to file
    phantom.saveToFile(page.content, 'myfile.txt');
    phantom.exit();
});

Useful links:
API reference | How to write plugins

回复收藏 0 原文

花开雨落又逢春i 2024-11-26 13:22:30

我会考虑使用 PyQt4 库中的 QtWebKit 模块。该模块将使 JS 代码在页面上运行，一旦完成，您可以使用我相信的标准方法保存 HTML。

否则，Selenium 就是最佳选择。它允许您通过 Python 脚本控制 Web 浏览器来拉出页面，然后提取所有 DOM 内容。

回复收藏 0 原文

你的往事 2024-11-26 13:22:30

一旦您想要启用 JavaScript，您所要求的就非常接近浏览器了。您可以使用 jython，然后使用 HtmlUnit，它是一个基于 java 的无头浏览器。它相当快，但不太稳定（因为它模仿浏览器，但并不是真正的浏览器）。我认为最快、最简单的方法是使用selenium（ide或最好是rc）。 Selenium 使您能够控制您最喜欢的浏览器（FF、IE、chrome...）。尽管它是用于测试目的，但它可能对您有用。它稳定且速度相当快（我认为它甚至比 HtmlUnit 更快）。

回复收藏 0 原文

陪你到最终 2024-11-26 13:22:30

您可以在 http://htql.net 上使用 htql。

import htql;
browser=htql.Browser(2);
page, url=browser.goUrl('http://docs.python.org/search.html?q=chdir&check_keywords=yes&area=default');
import time; 
time.sleep(2);
page, url=browser.getUpdatedPage();

顺便说一句，您需要在 http://irobotsoft.com/ 安装 IRobot

You can use htql at http://htql.net.

import htql;
browser=htql.Browser(2);
page, url=browser.goUrl('http://docs.python.org/search.html?q=chdir&check_keywords=yes&area=default');
import time; 
time.sleep(2);
page, url=browser.getUpdatedPage();

BTW, you will need to install IRobot at http://irobotsoft.com/

回复收藏 0 原文

~没有更多了~