如何验证 PDF 文件中的链接
我有一个 PDF 文件,我想验证其中的链接是否正确。从某种意义上来说是正确的 - 所有指定的 URL 都链接到网页,并且没有任何损坏。我正在寻找一个简单的实用程序或可以轻松完成此操作的脚本?!
示例:
$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt
我不知道是否存在类似的东西,所以用谷歌搜索&也在 stackoverflow 中搜索过。但目前还没有发现什么有用的东西。所以希望任何人对此有任何想法!
更新:为了明确问题。
I have a PDF file which I want to verify whether the links in that are proper. Proper in the sense - all URLs specified are linked to web pages and nothing is broken. I am looking for a simple utility or a script which can do it easily ?!
Example:
$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt
I have no idea of whether something like that exists, so googled & searched in stackoverflow also. But did not find anything useful yet. So would like to anyone has any idea about it !
Updated: to make the question clear.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以使用
pdf-link-checker
要使用 pip 安装它:
不幸的是,一个依赖项 (
pdfminer
) 是破碎的。要修复它:You can use
pdf-link-checker
To install it with pip:
Unfortunately, one dependency (
pdfminer
) is broken. To fix it:我建议首先使用 Linux 命令行实用程序“pdftotext” - 您可以找到手册页:
pdftotext 手册页
该实用程序是 PDF 处理工具 Xpdf 集合的一部分,可在大多数 Linux 发行版上使用。请参阅http://foolabs.com/xpdf/download.html。
安装后,您可以通过 pdftotext 处理 PDF 文件:
处理后,一个简单的 Perl 脚本会在生成的文本文件中搜索 http URL,并使用 LWP::Simple。 LWP::Simple->get('http://...') 将允许您使用代码片段验证 URL,例如:
我认为这将完成您想要做的事情。有大量关于如何编写正则表达式来匹配 http URL 的资源,但一个非常简单的资源如下所示:
“http 后跟一个或多个非空格字符” - 假设 URL 是属性 URL 编码的。
I suggest first using the linux command line utility 'pdftotext' - you can find the man page:
pdftotext man page
The utility is part of the Xpdf collection of PDF processing tools, available on most linux distributions. See http://foolabs.com/xpdf/download.html.
Once installed, you could process the PDF file through pdftotext:
Once processed, a simple perl script that searched the resulting text file for http URLs, and retrieved them using LWP::Simple. LWP::Simple->get('http://...') will allow you to validate the URLs with a code snippet such as:
That would accomplish what you want to do, I think. There are plenty of resources on how to write regular expressions to match http URLs, but a very simple one would look like this:
"http followed by one or more not-space characters" - assuming the URLs are property URL encoded.
您的问题有两条线索。
您是否正在寻找正则表达式验证该链接是否包含关键信息(例如 http:// 和有效的 TLD 代码)?如果是这样,我确信正则表达式专家会过来,或者看看 regexlib.com 其中包含许多现有的正则表达式用于处理 URL。
或者您想验证网站是否存在,那么我会推荐 Python + Requests 因为您可以编写检查脚本来查看网站是否存在并且不返回错误代码。
这是我目前在工作中出于几乎相同目的而承担的一项任务。我们有大约 54k 个链接需要自动处理。
There are two lines of enquiry with your question.
Are you looking for regex verification that the link contains key information such as http:// and valid TLD codes? If so I'm sure a regex expert will drop by, or have a look at regexlib.com which contains lots of existing regex for dealing with URLs.
Or are you wanting to verify that a website exists then I would recommend Python + Requests as you could script out checks to see if websites exist and don't return error codes.
It's a task which I'm currently undertaking for pretty much the same purpose at work. We have about 54k links to get processed automatically.
通过以下方式收集链接:
使用 API 枚举链接,或转储为文本并链接结果,或另存为 html PDFMiner。
提出检查请求:
根据您的需求,有多种选择。
Collect links by:
enumerating links using API, or dumping as text and linkifying the result, or saving as html PDFMiner.
Make requests to check them:
there are plethora of options depending on your needs.
https://stackoverflow.com/a/42178474/1587329 的建议是编写这个简单工具的灵感(请参阅<一个href="https://gist.github.com/serv-inc/0405594483a4115233f47ab19cfbf3b2" rel="nofollow noreferrer">gist):
保存到
filename.py
,运行为python 文件名.py pdfname.pdf
。https://stackoverflow.com/a/42178474/1587329's advice was inspiration to write this simple tool (see gist):
Save to
filename.py
, run aspython filename.py pdfname.pdf
.有一个名为
pdf_link_check.py
的工具可以执行此操作对我来说效果很好。它实际上运行正确,与运行pip install pdf-link-checker
时获得的pdf-link-checker
不同。There is a tool called
pdf_link_check.py
which does this and worked fine for me. It actually ran correctly, unlike thepdf-link-checker
that you get when runningpip install pdf-link-checker
.