客户端JavaScript从在线PDF文档中提取图案

发布于 2025-02-11 13:03:16 字数 1839 浏览 1 评论 0原文

我正在尝试使用客户端脚本（Tampermonkey / GreaseMonKey -Firefox或Chrome）从在线PDF中提取图案。该实现可以是特定于浏览器的，希望尝试在任何一个中使其工作。

我能够使用JS提取内容并在Firefox中手动匹配（该内容会自动加载PDF.JS）。例如，在PDF URL上：

var matchList = document.body.innerText.match(/my_regex/gi);

我现在正在尝试将其移植到GreaseMonKey中以获取用户标记：

// ==UserScript==
// @name     MyExtractor
// @version  1
// @grant    none
// @include  *.pdf
// ==/UserScript==

console.log("User script");
console.log(document.body.innerText); // this JS executed manually logs the PDF to text, but 
alert("HI");

脚本没有加载 - 是否可以在Firefox中获得GM脚本在PDF URL上执行？

在Chrome中，PDF文档似乎已嵌入 - 因此，即使使用Direct Console JS，我似乎也无法访问内容。例如，

> document.getElementsByTagName("embed")[0]
<embed name="some_id" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="some_id">

就我能够使用Chrome而言，这大约是可以根据上述元素获取PDF对象并从中提取文本的方法？

关于JS，我不一定需要直接在PDF URL上运行它，我还可以将其识别出一个具有PDF锚href的页面，然后根据请求根据请求获取和解析，如果可能的 - 如果有一种方法可以使用PDF库获取和处理一些方法？

到目前为止使用的参考文献：

在每个页面上执行GreaseMonKey脚本，无论页面类型如何（例如foo.com/image.jpg）？ - 我需要为此构建扩展名吗？
使用javascript使用Javascript 提取文本（并遵循了一些链接） - 具体来说，我试图遵循以下操作：如何从Javasript中的PDF - 但无法创建对PDF源 /将库添加到GM并按预期执行的引用 - 这是遵循和尝试解决我遇到的问题的好途径吗？

原文

I am trying to extract patterns from online PDFs using a client side script (tampermonkey / greasemonkey - Firefox or Chrome). The implementation can be browser specific, would like to try get it working in either 1.

I am able to use JS to extract the content and match on it manually in Firefox (which loads pdf.js automatically). E.g. on a PDF URL:

var matchList = document.body.innerText.match(/my_regex/gi);

I am now trying to port this into Greasemonkey for a user-script:

// ==UserScript==
// @name     MyExtractor
// @version  1
// @grant    none
// @include  *.pdf
// ==/UserScript==

console.log("User script");
console.log(document.body.innerText); // this JS executed manually logs the PDF to text, but 
alert("HI");

The script doesn't load - is it possible to get a Gm script to execute on a PDF url in Firefox?

In Chrome, the PDF document seems to be embedded - so even with direct console JS, i can't seem to get access to the content. e.g.

> document.getElementsByTagName("embed")[0]
<embed name="some_id" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="some_id">

This is about as far as I have been able to get with Chrome - is there a way to get the PDF object based on the above element and extract text from it?

With regards to the JS, i do not necessarily need to have it run directly on the PDF url, I can also get it to identify a page that has a PDF anchor href on it, and then fetch and parse it based on a request if possible - if there is a way to fetch and process with a PDf library some how?

References used so far:

Execute a Greasemonkey script on every page, regardless of page-type (like foo.com/image.jpg)? - do i need to build an extension for this?
Extract text from pdf file using javascript (and followed some of the links) - specifically, i have tried to follow this: How to extract text from PDF in JavaSript - but have not been able to create a reference to the PDF source / add the library to GM and execute as expected - is this a good path to follow and try solve the problems I am running into?

分享到QQ

分享到微博