单击 Scrapy 中的按钮

发布于 2024-11-24 03:13:06 字数 327 浏览 0 评论 0原文

我正在使用 Scrapy 来抓取网页。我需要的一些信息只有在点击某个按钮时才会弹出(当然点击后也会出现在HTML代码中)。

我发现 Scrapy 可以处理表单(如登录),如下所示此处< /a>.但问题是没有表格可以填写,所以这并不是我所需要的。

我怎样才能简单地单击一个按钮,然后显示我需要的信息?

我必须使用 mechanize 或 lxml 等外部库吗?

I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking).

I found out that Scrapy can handle forms (like logins) as shown here. But the problem is that there is no form to fill out, so it's not exactly what I need.

How can I simply click a button, which then shows the information I need?

Do I have to use an external library like mechanize or lxml?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

羁客 2024-12-01 03:13:06

Scrapy 无法解释 javascript。

如果您绝对必须与页面上的 javascript 交互,那么您需要使用 Selenium。

如果使用 Scrapy,问题的解决方案取决于按钮的作用。

如果它只是显示以前隐藏的内容,您可以毫无问题地抓取数据,它不会出现在浏览器中并不重要,HTML 仍然存在。

如果按下按钮时它通过 AJAX 动态获取内容,那么最好的办法是使用 Firebug 等工具查看按下按钮时发出的 HTTP 请求。然后您可以直接从该 URL 请求数据。

我必须使用 mechanize 或 lxml 等外部库吗?

如果你想解释 javascript,是的,你需要使用不同的库,尽管这两个库都不符合要求。他们俩都不了解 javascript。硒是一条出路。

如果您可以提供您正在抓取的页面的 URL,我可以看一下。

Scrapy cannot interpret javascript.

If you absolutely must interact with the javascript on the page, you want to be using Selenium.

If using Scrapy, the solution to the problem depends on what the button is doing.

If it's just showing content that was previously hidden, you can scrape the data without a problem, it doesn't matter that it wouldn't appear in the browser, the HTML is still there.

If it's fetching the content dynamically via AJAX when the button is pressed, the best thing to do is to view the HTTP request that goes out when you press the button using a tool like Firebug. You can then just request the data directly from that URL.

Do I have to use an external library like mechanize or lxml?

If you want to interpret javascript, yes you need to use a different library, although neither of those two fit the bill. Neither of them know anything about javascript. Selenium is the way to go.

If you can give the URL of the page you're working on scraping I can take a look.

与他有关 2024-12-01 03:13:06

Selenium 浏览器提供了非常好的解决方案。这是一个示例(pip install -U selenium):

from selenium import webdriver

class northshoreSpider(Spider):
    name = 'xxx'
    allowed_domains = ['www.example.org']
    start_urls = ['https://www.example.org']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self,response):
            self.driver.get('https://www.example.org/abc')

            while True:
                try:
                    next = self.driver.find_element_by_xpath('//*[@id="BTN_NEXT"]')
                    url = 'http://www.example.org/abcd'
                    yield Request(url,callback=self.parse2)
                    next.click()
                except:
                    break

            self.driver.close()

    def parse2(self,response):
        print 'you are here!'

Selenium browser provide very nice solution. Here is an example (pip install -U selenium):

from selenium import webdriver

class northshoreSpider(Spider):
    name = 'xxx'
    allowed_domains = ['www.example.org']
    start_urls = ['https://www.example.org']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self,response):
            self.driver.get('https://www.example.org/abc')

            while True:
                try:
                    next = self.driver.find_element_by_xpath('//*[@id="BTN_NEXT"]')
                    url = 'http://www.example.org/abcd'
                    yield Request(url,callback=self.parse2)
                    next.click()
                except:
                    break

            self.driver.close()

    def parse2(self,response):
        print 'you are here!'
余生一个溪 2024-12-01 03:13:06

虽然这是一个旧线程,但我发现使用 Helium (构建在 Selenium 之上)用于此目的,并且比使用 Selenium 更容易/更简单。它将类似于以下内容:

from helium import *

start_firefox('your_url')
s = S('path_to_your_button')
click(s)
...

Although it's an old thread I've found quite useful to use Helium (built on top of Selenium) for this purpose and far more easier/simpler than using Selenium. It will be something like the following:

from helium import *

start_firefox('your_url')
s = S('path_to_your_button')
click(s)
...

昔梦 2024-12-01 03:13:06

要正确、充分地使用 JavaScript,您需要一个完整的浏览器引擎,而这只能通过 Watir/WatiN/Selenium 等实现。

To properly and fully use JavaScript you need a full browser engine and this is possible only with Watir/WatiN/Selenium etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文