Web在Google Chrome扩展中刮擦(JavaScript+ Chrome API)

发布于 2025-02-07 16:09:13 字数 692 浏览 3 评论 0原文

用JavaScript和其他任何可用的技术,在Google Chrome扩展程序中从Google Chrome扩展程序中执行网络刮擦的最佳选择是什么。 其他JavaScript-libraries也被接受。

重要的是要掩盖刮擦,以表现得像普通的Web-Request 。没有AJAX或XMLHTTPREQUEST的指示,例如X-Requested-with:XMLHTTPRequestonect> onect> onect

必须从JavaScript访问刮擦内容,以在扩展中进行进一步的操纵和演示,这很可能是字符串。

在任何WebKit/Chrome特定的API:S中是否有任何钩子可以用来制作正常的Web重新要求并获得操纵结果?

var pageContent = getPageContent(url); // TODO: Implement
var items = $(pageContent).find('.item');
// Display items with further selections

奖励点可以从磁盘上的本地文件 进行最初调试。但是,如果这是唯一的一点是停止解决方案,那么请忽略奖励点。

What are the best options for performing Web Scraping of a not currently open tab from within a Google Chrome Extension with JavaScript and whatever more technologies are available. Other JavaScript-libraries are also accepted.

The important thing is to mask the scraping to behave like a normal web-request. No indications of AJAX or XMLHttpRequest, like X-Requested-With: XMLHttpRequest or Origin.

The scraped content must be accessible from JavaScript for further manipulation and presentation within the extension, most probably as a string.

Are there any hooks in any WebKit/Chrome-specific API:s that can be used to make a normal web-request and get the results for manipulation?

var pageContent = getPageContent(url); // TODO: Implement
var items = $(pageContent).find('.item');
// Display items with further selections

Bonus-points to make this work from a local file on disk, for initial debugging. But if that is the only point is stopping a solution, then disregard the bonus-points.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

怼怹恏 2025-02-14 16:09:13

尝试使用 xhr2 code>并倒在上(new domparser).parsefromstring(withseText,getResponseheader(“ content-type”))带有我的text/html patch 。请参阅 https://gist.github.com/1138724 以示例说明我如何检测ResponseType =“ document支持(同步检查响应=== null在由text/html blob创建的对象URL上。

使用 chrome webrequest api 隐藏x-requested-with-with-with等。

Attempt to use XHR2 responseType = "document" and fall back on (new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type")) with my text/html patch. See https://gist.github.com/1138724 for an example of how I detect responseType = "document support (synchronously checking response === null on an object URL created from a text/html blob).

Use the Chrome WebRequest API to hide X-Requested-With, etc. headers.

盗琴音 2025-02-14 16:09:13

自从提出这个问题以来,已经发布了许多工具。

artoo.js 是其中之一。这是一块JavaScript代码,旨在在浏览器的控制台中运行,以为您提供一些刮擦实用程序。它也可以用作镀铬扩展。

A lot of tools have been released since this question was asked.

artoo.js is one of them. It's a piece of JavaScript code meant to be run in your browser's console to provide you with some scraping utilities. It can also be used as a chrome extension.

是伱的 2025-02-14 16:09:13

如果您可以看一些Google Chrome插件以外的东西,请查看 phantomjs 使用qt-webkit在后台运行并运行就像浏览器提出AJAX请求一样。您可以将其称为无头浏览器,因为它不会在屏幕上显示输出,并且可以在进行其他操作时在后台工作。如果需要,可以将图像导出,PDF从其获取的页面中删除。它提供JS接口来加载页面,单击按钮等,就像您在浏览器中所拥有的一样。您还可以在要刮擦的任何页面上注入自定义JS,例如jQuery,并使用它访问DOM并导出所需的数据。因为它使用 webkit 其渲染行为与Google Chrome完全一样。

另一个选择是使用aptana jaxer 基于Mozilla引擎,本身就是非常好的概念。它也可以用作简单的刮擦工具。

If you are fine looking at something beyond a Google Chrome Plugin, look at phantomjs which uses Qt-Webkit in background and runs just like a browser incuding making ajax requests. You can call it a headless browser as it doesn't display the output on a screen and can quitely work in background while you are doing other stuff. If you want, you can export out images, pdf out of the pages it fetches. It provides JS interface to load pages, clicking on buttons etc much like you have in a browser. You can also inject custom JS for example jQuery on any of the pages you want to scrape and use it to access the dom and export out desired data. As its using Webkit its rendering behaviour is exactly like Google Chrome.

Another option would be to use Aptana Jaxer which is based on Mozilla Engine and is very good concept in itself. It can be used as a simple scraping tool as well.

梦中的蝴蝶 2025-02-14 16:09:13

网络刮擦在镀铬扩展中有点令人费解。一些要点:

  • 您运行内容脚本以访问DOM。
  • 背景页面(每个浏览器)可以发送并接收消息到内容脚本。也就是说,您可以运行一个设置RPC端点的内容脚本,并作为响应在背景页面的上下文中触发指定的回调。
  • 您可以在网页的所有帧中执行内容脚本,然后将文档树(由页面包含的1..n帧组成)拼接在一起。
  • 如SK建议的那样,您的背景页面可以将数据作为XMLHTTPRequest发送到某种轻巧的HTTP服务器,该服务器在本地倾听。

Web scraping is kind of convoluted in a Chrome Extension. Some points:

  • You run content scripts for access to the DOM.
  • Background pages (one per browser) can send and receive messages to content scripts. That is, you can run a content script that sets up an RPC endpoint and fires a specified callback in the context of the background page as a response.
  • You can execute content scripts in all frames of a webpage, then stitch the document tree (composed of the 1..N frames that the page contains) together.
  • As S.K. suggested, your background page can send the data as an XMLHttpRequest to some kind of lightweight HTTP server that listens locally.
万水千山粽是情ミ 2025-02-14 16:09:13

我不确定只有JavaScript是否完全可以使用,但是如果您可以为扩展程序设置专用的PHP脚本,该脚本使用curl来获取页面的HTML,则PHP脚本可能会为您刮擦页面,并且您的扩展程序可能会读取通过AJAX请求进入。

但是,被刮擦的实际页面不知道这是一个Ajax请求,因为它是通过卷发访问的。

I'm not sure it's entirely possible with just JavaScript, but if you can set up a dedicated PHP script for your extension that uses cURL to fetch the HTML for a page, the PHP script could scrape the page for you and your extension could read it in through an AJAX request.

The actual page being scraped wouldn't know it's an AJAX request, however, because it is being accessed through cURL.

给不了的爱 2025-02-14 16:09:13

我认为您可以从此 example

因此,基本上您可以尝试使用扩展 +插件组合。扩展程序将可以访问DOM(包括插件)并驱动该过程。插件将发送实际的HTTP请求。

我可以建议将Firebreath用作crossplatform chrome/firefox插件平台,特别是请看一下此示例: firebreath-制作+http+请求++simpleSteramshelper

I think you can start from this example.

So basically you can try using Extension + Plugin combination. Extension would have access to DOM (including plugin) and drive the process. And Plugin would send actual HTTP requests.

I can recommend using Firebreath as a crossplatform Chrome/Firefox plugin platform, in particular take a look at this example: Firebreath - Making+HTTP+Requests+with+SimpleStreamsHelper

日记撕了你也走了 2025-02-14 16:09:13

你不能只是做一些iframe骗局吗?如果将URL加载到专用的框架中,则在文档对象中有DOM,并且可以进行jQuery选择,不是吗?

couldn't you just do some iframe trickery? if you load the url into a dedicated frame, you have the dom in a document object and can do your jquery selections, no?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文