客户端JavaScript从在线PDF文档中提取图案
我正在尝试使用客户端脚本(Tampermonkey / GreaseMonKey -Firefox或Chrome)从在线PDF中提取图案。该实现可以是特定于浏览器的,希望尝试在任何一个中使其工作。
我能够使用JS提取内容并在Firefox中手动匹配(该内容会自动加载PDF.JS)。例如,在PDF URL上:
var matchList = document.body.innerText.match(/my_regex/gi);
我现在正在尝试将其移植到GreaseMonKey中以获取用户标记:
// ==UserScript==
// @name MyExtractor
// @version 1
// @grant none
// @include *.pdf
// ==/UserScript==
console.log("User script");
console.log(document.body.innerText); // this JS executed manually logs the PDF to text, but
alert("HI");
脚本没有加载 - 是否可以在Firefox中获得GM脚本在PDF URL上执行?
在Chrome中,PDF文档似乎已嵌入 - 因此,即使使用Direct Console JS,我似乎也无法访问内容。例如,
> document.getElementsByTagName("embed")[0]
<embed name="some_id" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="some_id">
就我能够使用Chrome而言,这大约是可以根据上述元素获取PDF对象并从中提取文本的方法?
关于JS,我不一定需要直接在PDF URL上运行它,我还可以将其识别出一个具有PDF锚href的页面,然后根据请求根据请求获取和解析,如果可能的 - 如果有一种方法可以使用PDF库获取和处理一些方法?
到目前为止使用的参考文献:
- 在每个页面上执行GreaseMonKey脚本,无论页面类型如何(例如foo.com/image.jpg)? - 我需要为此构建扩展名吗?
- 使用javascript使用Javascript 提取文本(并遵循了一些链接) - 具体来说,我试图遵循以下操作:如何从Javasript中的PDF - 但无法创建对PDF源 /将库添加到GM并按预期执行的引用 - 这是遵循和尝试解决我遇到的问题的好途径吗?
I am trying to extract patterns from online PDFs using a client side script (tampermonkey / greasemonkey - Firefox or Chrome). The implementation can be browser specific, would like to try get it working in either 1.
I am able to use JS to extract the content and match on it manually in Firefox (which loads pdf.js automatically). E.g. on a PDF URL:
var matchList = document.body.innerText.match(/my_regex/gi);
I am now trying to port this into Greasemonkey for a user-script:
// ==UserScript==
// @name MyExtractor
// @version 1
// @grant none
// @include *.pdf
// ==/UserScript==
console.log("User script");
console.log(document.body.innerText); // this JS executed manually logs the PDF to text, but
alert("HI");
The script doesn't load - is it possible to get a Gm script to execute on a PDF url in Firefox?
In Chrome, the PDF document seems to be embedded - so even with direct console JS, i can't seem to get access to the content. e.g.
> document.getElementsByTagName("embed")[0]
<embed name="some_id" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="some_id">
This is about as far as I have been able to get with Chrome - is there a way to get the PDF object based on the above element and extract text from it?
With regards to the JS, i do not necessarily need to have it run directly on the PDF url, I can also get it to identify a page that has a PDF anchor href on it, and then fetch and parse it based on a request if possible - if there is a way to fetch and process with a PDf library some how?
References used so far:
- Execute a Greasemonkey script on every page, regardless of page-type (like foo.com/image.jpg)? - do i need to build an extension for this?
- Extract text from pdf file using javascript (and followed some of the links) - specifically, i have tried to follow this: How to extract text from PDF in JavaSript - but have not been able to create a reference to the PDF source / add the library to GM and execute as expected - is this a good path to follow and try solve the problems I am running into?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论