Web在Google Chrome扩展中刮擦(JavaScript+ Chrome API)
用JavaScript和其他任何可用的技术,在Google Chrome扩展程序中从Google Chrome扩展程序中执行网络刮擦的最佳选择是什么。 其他JavaScript-libraries也被接受。
重要的是要掩盖刮擦,以表现得像普通的Web-Request 。没有AJAX或XMLHTTPREQUEST的指示,例如X-Requested-with:XMLHTTPRequest
或onect> onect> onect
。
必须从JavaScript访问刮擦内容,以在扩展中进行进一步的操纵和演示,这很可能是字符串。
在任何WebKit/Chrome特定的API:S中是否有任何钩子可以用来制作正常的Web重新要求并获得操纵结果?
var pageContent = getPageContent(url); // TODO: Implement
var items = $(pageContent).find('.item');
// Display items with further selections
奖励点可以从磁盘上的本地文件 进行最初调试。但是,如果这是唯一的一点是停止解决方案,那么请忽略奖励点。
What are the best options for performing Web Scraping of a not currently open tab from within a Google Chrome Extension with JavaScript and whatever more technologies are available. Other JavaScript-libraries are also accepted.
The important thing is to mask the scraping to behave like a normal web-request. No indications of AJAX or XMLHttpRequest, like X-Requested-With: XMLHttpRequest
or Origin
.
The scraped content must be accessible from JavaScript for further manipulation and presentation within the extension, most probably as a string.
Are there any hooks in any WebKit/Chrome-specific API:s that can be used to make a normal web-request and get the results for manipulation?
var pageContent = getPageContent(url); // TODO: Implement
var items = $(pageContent).find('.item');
// Display items with further selections
Bonus-points to make this work from a local file on disk, for initial debugging. But if that is the only point is stopping a solution, then disregard the bonus-points.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
尝试使用 xhr2 code>并倒在
上(new domparser).parsefromstring(withseText,getResponseheader(“ content-type”))
带有我的text/html
patch 。请参阅 https://gist.github.com/1138724 以示例说明我如何检测ResponseType =“ document
支持(同步检查响应=== null
在由text/html
blob创建的对象URL上。使用 chrome webrequest api 隐藏
x-requested-with-with-with
等。Attempt to use XHR2
responseType = "document"
and fall back on(new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type"))
with mytext/html
patch. See https://gist.github.com/1138724 for an example of how I detectresponseType = "document
support (synchronously checkingresponse === null
on an object URL created from atext/html
blob).Use the Chrome WebRequest API to hide
X-Requested-With
, etc. headers.自从提出这个问题以来,已经发布了许多工具。
artoo.js 是其中之一。这是一块JavaScript代码,旨在在浏览器的控制台中运行,以为您提供一些刮擦实用程序。它也可以用作镀铬扩展。
A lot of tools have been released since this question was asked.
artoo.js is one of them. It's a piece of JavaScript code meant to be run in your browser's console to provide you with some scraping utilities. It can also be used as a chrome extension.
如果您可以看一些Google Chrome插件以外的东西,请查看 phantomjs 使用qt-webkit在后台运行并运行就像浏览器提出AJAX请求一样。您可以将其称为无头浏览器,因为它不会在屏幕上显示输出,并且可以在进行其他操作时在后台工作。如果需要,可以将图像导出,PDF从其获取的页面中删除。它提供JS接口来加载页面,单击按钮等,就像您在浏览器中所拥有的一样。您还可以在要刮擦的任何页面上注入自定义JS,例如jQuery,并使用它访问DOM并导出所需的数据。因为它使用 webkit 其渲染行为与Google Chrome完全一样。
另一个选择是使用aptana jaxer 基于Mozilla引擎,本身就是非常好的概念。它也可以用作简单的刮擦工具。
If you are fine looking at something beyond a Google Chrome Plugin, look at phantomjs which uses Qt-Webkit in background and runs just like a browser incuding making ajax requests. You can call it a headless browser as it doesn't display the output on a screen and can quitely work in background while you are doing other stuff. If you want, you can export out images, pdf out of the pages it fetches. It provides JS interface to load pages, clicking on buttons etc much like you have in a browser. You can also inject custom JS for example jQuery on any of the pages you want to scrape and use it to access the dom and export out desired data. As its using Webkit its rendering behaviour is exactly like Google Chrome.
Another option would be to use Aptana Jaxer which is based on Mozilla Engine and is very good concept in itself. It can be used as a simple scraping tool as well.
网络刮擦在镀铬扩展中有点令人费解。一些要点:
Web scraping is kind of convoluted in a Chrome Extension. Some points:
我不确定只有JavaScript是否完全可以使用,但是如果您可以为扩展程序设置专用的PHP脚本,该脚本使用curl来获取页面的HTML,则PHP脚本可能会为您刮擦页面,并且您的扩展程序可能会读取通过AJAX请求进入。
但是,被刮擦的实际页面不知道这是一个Ajax请求,因为它是通过卷发访问的。
I'm not sure it's entirely possible with just JavaScript, but if you can set up a dedicated PHP script for your extension that uses cURL to fetch the HTML for a page, the PHP script could scrape the page for you and your extension could read it in through an AJAX request.
The actual page being scraped wouldn't know it's an AJAX request, however, because it is being accessed through cURL.
我认为您可以从此 example 。
因此,基本上您可以尝试使用扩展 +插件组合。扩展程序将可以访问DOM(包括插件)并驱动该过程。插件将发送实际的HTTP请求。
我可以建议将Firebreath用作crossplatform chrome/firefox插件平台,特别是请看一下此示例: firebreath-制作+http+请求++simpleSteramshelper
I think you can start from this example.
So basically you can try using Extension + Plugin combination. Extension would have access to DOM (including plugin) and drive the process. And Plugin would send actual HTTP requests.
I can recommend using Firebreath as a crossplatform Chrome/Firefox plugin platform, in particular take a look at this example: Firebreath - Making+HTTP+Requests+with+SimpleStreamsHelper
你不能只是做一些iframe骗局吗?如果将URL加载到专用的框架中,则在文档对象中有DOM,并且可以进行jQuery选择,不是吗?
couldn't you just do some iframe trickery? if you load the url into a dedicated frame, you have the dom in a document object and can do your jquery selections, no?