scrapy 可以用来从使用 AJAX 的网站上抓取动态内容吗?
我最近正在学习 Python,并正在尝试构建一个网络爬虫。这根本不是什么奇特的事情;它的唯一目的是从博彩网站获取数据并将这些数据放入 Excel 中。
大多数问题都是可以解决的,而我却遇到了一些麻烦。然而,我在一个问题上遇到了巨大的障碍。如果站点加载马表并列出当前投注价格,则此信息不会出现在任何源文件中。线索是这些数据有时是实时的,这些数字显然是从某个远程服务器更新的。我电脑上的 HTML 只是有一个漏洞,他们的服务器正在其中推送我需要的所有有趣数据。
现在我对动态网页内容的经验很少,所以这件事我很难理解。
我认为 Java 或 Javascript 是关键,这个经常出现。
刮刀只是一个赔率比较引擎。有些网站有 API,但我需要为那些没有 API 的网站提供 API。我正在使用带有 Python 2.7 的 scrapy 库,
如果这个问题过于开放,我深表歉意。简而言之,我的问题是:如何使用scrapy来抓取这些动态数据以便我可以使用它?这样我就可以实时抓取这个投注赔率数据?
另请参阅:如何在 Python 中抓取包含动态内容(由 JavaScript 创建)的页面?了解一般情况。
I have recently been learning Python and am dipping my hand into building a web-scraper. It's nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel.
Most of the issues are solvable and I'm having a good little mess around. However I'm hitting a massive hurdle over one issue. If a site loads a table of horses and lists current betting prices this information is not in any source file. The clue is that this data is live sometimes, with the numbers being updated obviously from some remote server. The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need.
Now my experience with dynamic web content is low, so this thing is something I'm having trouble getting my head around.
I think Java or Javascript is a key, this pops up often.
The scraper is simply a odds comparison engine. Some sites have APIs but I need this for those that don't. I'm using the scrapy library with Python 2.7
I do apologize if this question is too open-ended. In short, my question is: how can scrapy be used to scrape this dynamic data so that I can use it? So that I can scrape this betting odds data in real-time?
See also: How can I scrape a page with dynamic content (created by JavaScript) in Python? for the general case.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
这是使用 AJAX 请求的
scrapy
的简单示例。让我们看看网站 rubin-kazan.ru 。所有消息均通过 AJAX 请求加载。我的目标是获取这些消息及其所有属性(作者、日期等):
此处输入图像描述">
它不会重新加载整个页面,而只会重新加载页面中包含消息的部分。为此,我单击底部的任意数量的页面:
我观察到的 HTTP 请求是负责消息正文:
完成后,我分析请求的标头(我必须引用此 URL我将从 var 的源页面中提取部分,参见下面的代码):
以及请求的表单数据内容(HTTP 方法为“ Post"):
响应的内容是一个 JSON 文件:
其中显示了我正在查找的所有信息。
从现在开始,我必须将所有这些知识都用scrapy来实现。让我们为此目的定义蜘蛛:
在
parse
函数中,我有第一个请求的响应。在
RubiGuessItem
中,我有包含所有信息的 JSON 文件。Here is a simple example of
scrapy
with an AJAX request. Let see the site rubin-kazan.ru.All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, ...):
When I analyze the source code of the page I can't see all these messages because the web page uses AJAX technology. But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:
It doesn't reload the whole page but only the parts of the page that contain messages. For this purpose I click an arbitrary number of page on the bottom:
And I observe the HTTP request that is responsible for message body:
After finish, I analyze the headers of the request (I must quote that this URL I'll extract from source page from var section, see the code below):
And the form data content of the request (the HTTP method is "Post"):
And the content of response, which is a JSON file:
Which presents all the information I'm looking for.
From now, I must implement all this knowledge in scrapy. Let's define the spider for this purpose:
In
parse
function I have the response for first request.In
RubiGuessItem
I have the JSON file with all information.基于 Webkit 的浏览器(例如 Google Chrome 或 Safari)具有内置的开发人员工具。在 Chrome 中,您可以打开它
Menu->Tools->Developer Tools
。Network
选项卡允许您查看有关每个请求和响应的所有信息:在图片底部,您可以看到我已将请求过滤为
XHR
- 这些是由 javascript 代码发出的请求。提示:每次加载页面时都会清除日志,图片底部的黑点按钮将保留日志。
分析请求和响应后,您可以模拟来自网络爬虫的这些请求,并提取有价值的数据。在许多情况下,获取数据比解析 HTML 更容易,因为该数据不包含表示逻辑,并且经过格式化以供 JavaScript 代码访问。
Firefox 有类似的扩展,称为 firebug。有些人会说 firebug 更强大,但我喜欢 webkit 的简单性。
Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it
Menu->Tools->Developer Tools
. TheNetwork
tab allows you to see all information about every request and response:In the bottom of the picture you can see that I've filtered request down to
XHR
- these are requests made by javascript code.Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button will preserve log.
After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.
Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.
很多时候,当爬行时,我们会遇到问题,页面上呈现的内容是用 Javascript 生成的,因此 scrapy 无法爬行它(例如 ajax 请求、jQuery 疯狂)。
但是,如果您将 Scrapy 与 Web 测试框架 Selenium 一起使用,那么我们就可以抓取普通 Web 浏览器中显示的任何内容。
需要注意的一些事项:
您必须安装 Python 版本的 Selenium RC 才能正常工作,并且必须正确设置 Selenium。另外,这只是一个模板爬虫。你可以变得更疯狂、更先进,但我只是想展示基本的想法。按照现在的代码,您将对任何给定的 url 执行两个请求。一个请求是由 Scrapy 发出的,另一个请求是由 Selenium 发出的。我确信有办法解决这个问题,这样你就可以让 Selenium 执行唯一的请求,但我没有费心去实现它,并且通过执行两个请求,你也可以使用 Scrapy 抓取页面。
这非常强大,因为现在您可以抓取整个渲染的 DOM,并且仍然可以使用 Scrapy 中所有出色的抓取功能。当然,这会导致爬行速度变慢,但根据您需要渲染的 DOM 的程度,等待可能是值得的。
参考:http://snipplr.com/view/66998/
Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness).
However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.
Some things to note:
You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.
This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.
Reference: http://snipplr.com/view/66998/
另一种解决方案是实现下载处理程序或下载处理程序中间件。 (有关更多信息,请参阅 scrapy 文档下载器中间件)以下是使用 selenium 和 headless phantomjs webdriver 的示例类:
1) 在
middlewares.py
脚本中定义类。2) 将
JsDownload()
类添加到settings.py
内的变量DOWNLOADER_MIDDLEWARE
:3) 将
HTMLResponse
集成到your_spider.py
中。解码响应正文将为您提供所需的输出。可选插件:
我希望能够告诉不同的蜘蛛使用哪个中间件,所以我实现了这个包装器:
为了使包装器正常工作,所有蜘蛛必须至少具备:
包含一个中间件:
优点:
以这种方式实现它而不是在蜘蛛中实现它的主要优点是您最终只发出一个请求。例如,在 A T 的解决方案中:下载处理程序处理请求,然后将响应交给蜘蛛。然后,蜘蛛在其 parse_page 函数中发出一个全新的请求——这是对相同内容的两个请求。
Another solution would be to implement a download handler or download handler middleware. (see scrapy docs for more information on downloader middleware) The following is an example class using selenium with headless phantomjs webdriver:
1) Define class within the
middlewares.py
script.2) Add
JsDownload()
class to variableDOWNLOADER_MIDDLEWARE
withinsettings.py
:3) Integrate the
HTMLResponse
withinyour_spider.py
. Decoding the response body will get you the desired output.Optional Addon:
I wanted the ability to tell different spiders which middleware to use so I implemented this wrapper:
for wrapper to work all spiders must have at minimum:
to include a middleware:
Advantage:
The main advantage to implementing it this way rather than in the spider is that you only end up making one request. In A T's solution for example: The download handler processes the request and then hands off the response to the spider. The spider then makes a brand new request in it's parse_page function -- That's two requests for the same content.
我使用了自定义下载器中间件,但对它不是很满意,因为我无法使缓存与它一起工作。
更好的方法是实现自定义下载处理程序。
此处有一个工作示例。它看起来像这样:
假设你的刮刀被称为“scraper”。如果您将提到的代码放在“scraper”文件夹根部的一个名为 handlers.py 的文件中,那么您可以将以下内容添加到您的 settings.py 中:
瞧,JS 解析的 DOM,具有 scrapy 缓存、重试等功能。
I was using a custom downloader middleware, but wasn't very happy with it, as I didn't manage to make the cache work with it.
A better approach was to implement a custom download handler.
There is a working example here. It looks like this:
Suppose your scraper is called "scraper". If you put the mentioned code inside a file called handlers.py on the root of the "scraper" folder, then you could add to your settings.py:
And voilà, the JS parsed DOM, with scrapy cache, retries, etc.
我想知道为什么没有人只使用 Scrapy 发布解决方案。
查看 Scrapy 团队的博客文章 抓取无限滚动页面
。该示例废弃了使用无限滚动的 http://spidyquotes.herokuapp.com/scroll 网站。
这个想法是使用浏览器的开发人员工具并注意 AJAX 请求,然后根据该信息创建 Scrapy 请求。
I wonder why no one has posted the solution using Scrapy only.
Check out the blog post from Scrapy team SCRAPING INFINITE SCROLLING PAGES
. The example scraps http://spidyquotes.herokuapp.com/scroll website which uses infinite scrolling.
The idea is to use Developer Tools of your browser and notice the AJAX requests, then based on that information create the requests for Scrapy.
从 API 外部 url 生成的数据将 HTML 响应作为 POST 方法调用。
Data that generated from external url which is API calls HTML response as POST method.
我认为 2022 年有一些更现代的替代方案值得一提,我想列出该问题的更受欢迎的答案中讨论的方法的一些优缺点。
最佳答案和其他几个讨论使用浏览器
开发工具
或数据包捕获软件来尝试识别响应url
中的模式,并尝试重新构建它们用作scrapy.Request
s。优点:在我看来,这仍然是最好的选择,当它可用时,它比传统方法(即使用 xpath 从 HTML 中提取内容)更快而且通常更简单和
css
选择器。缺点:不幸的是,这仅适用于一小部分动态网站,并且通常网站都采取了适当的安全措施,因此很难使用此策略。
使用
Selenium Webdriver
是之前答案中多次提到的另一种方法。优点:它很容易实现,并且可以集成到 scrapy 工作流程中。此外,还有大量示例,如果您使用像
scrapy-selenium
这样的第 3 方扩展,则需要很少的配置
缺点:速度慢! scrapy 的主要功能之一是它的异步工作流程,可以轻松地在几秒钟内爬取数十甚至数百个页面。使用硒可以显着减少这种情况。
有两种新方法绝对值得考虑,
scrapy-splash
和scrapy-playwright
。scrapy-splash:
pip3 install scrapy-splash
从pypi安装,而splash需要在它自己的进程中运行,并且最容易从docker容器运行。scrapy-playwright:
selenium
一样,但不会因使用selenium而导致速度严重下降。 Playwright 可以轻松融入异步 scrapy 工作流程,使得发送请求与单独使用 scrapy 一样快。它也比 selenium 更容易安装和集成。scrapy-playwright
插件也由 scrapy 的开发人员维护,通过 pypi 使用pip3 install scrapy-playwright
安装后就像运行playwright install 一样简单
在终端中。更多详细信息和许多示例可以在每个插件的 github 页面找到 https://github.com /scrapy-plugins/scrapy-playwright 和 https://github.com/scrapy-plugins/scrapy-splash。
ps 根据我的经验,这两个项目在 Linux 环境中往往运行得更好。对于 Windows 用户,我建议将其与 The Windows Subsystem for Linux(wsl)。
There are a few more modern alternatives in 2022 that I think should be mentioned, and I would like to list some pros and cons for the methods discussed in the more popular answers to this question.
The top answer and several others discuss using the browsers
dev tools
or packet capturing software to try to identify patterns in responseurl
's, and try to re-construct them to use asscrapy.Request
s.Pros: This is still the best option in my opinion, and when it is available it is quick and often times simpler than even the traditional approach i.e. extracting content from the HTML using
xpath
andcss
selectors.Cons: Unfortunately this is only available on a fraction of dynamic sites and frequently websites have security measures in place that make using this strategy difficult.
Using
Selenium Webdriver
is the other approach mentioned a lot in previous answers.Pros: It's easy to implement, and integrate into the scrapy workflow. Additionally there are a ton of examples, and requires very little configuration if you use 3rd-party extensions like
scrapy-selenium
Cons: It's slow! One of scrapy's key features is it's asynchronous workflow that makes it easy to crawl dozens or even hundreds of pages in seconds. Using selenium cuts this down significantly.
There are two new methods that defenitely worth consideration,
scrapy-splash
andscrapy-playwright
.scrapy-splash:
pip3 install scrapy-splash
, while splash needs to run in it's own process, and is easiest to run from a docker container.scrapy-playwright:
selenium
, but without the crippling decrease in speed that comes with using selenium. Playwright has no issues fitting into the asynchronous scrapy workflow making sending requests just as quick as using scrapy alone. It is also much easier to install and integrate than selenium. Thescrapy-playwright
plugin is maintained by the developers of scrapy as well, and after installing via pypi withpip3 install scrapy-playwright
is as easy as runningplaywright install
in the terminal.More details and many examples can be found at each of the plugin's github pages https://github.com/scrapy-plugins/scrapy-playwright and https://github.com/scrapy-plugins/scrapy-splash.
p.s. Both projects tend to work better in a linux environment in my experience. for windows users i recommend using it with The Windows Subsystem for Linux(wsl).
是的,Scrapy 可以抓取动态网站,即通过 JavaScript 渲染的网站。
有两种方法可以抓取此类网站。
您可以使用
splash
渲染 Javascript 代码,然后解析渲染的 HTML。你可以在这里找到文档和项目Scrapy Splash,git
如前所述,通过监控
网络调用
,是的,您可以找到获取数据的API调用,并在scrapy Spider中模拟该调用可能会帮助您获取所需的数据。Yes, Scrapy can scrape dynamic websites, website that are rendered through JavaScript.
There are Two approaches to scrapy these kind of websites.
you can use
splash
to render Javascript code and then parse the rendered HTML.you can find the doc and project here Scrapy splash, git
as previously stated, by monitoring the
network calls
, yes, you can find the API call that fetch the data and mock that call in your scrapy spider might help you to get desired data.我使用 Selenium 和 Firefox Web 驱动程序处理 ajax 请求。如果您需要爬虫程序作为守护进程,它的速度不是那么快,但比任何手动解决方案要好得多。
I handle the ajax request by using Selenium and the Firefox web driver. It is not that fast if you need the crawler as a daemon, but much better than any manual solution.