如何验证 PDF 文件中的链接

发布于 2024-12-15 02:27:13 字数 414 浏览 7 评论 0原文

我有一个 PDF 文件，我想验证其中的链接是否正确。从某种意义上来说是正确的 - 所有指定的 URL 都链接到网页，并且没有任何损坏。我正在寻找一个简单的实用程序或可以轻松完成此操作的脚本？！

示例：

$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt

我不知道是否存在类似的东西，所以用谷歌搜索&也在 stackoverflow 中搜索过。但目前还没有发现什么有用的东西。所以希望任何人对此有任何想法！

更新：为了明确问题。

原文

I have a PDF file which I want to verify whether the links in that are proper. Proper in the sense - all URLs specified are linked to web pages and nothing is broken. I am looking for a simple utility or a script which can do it easily ?!

Example:

$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt

I have no idea of whether something like that exists, so googled & searched in stackoverflow also. But did not find anything useful yet. So would like to anyone has any idea about it !

Updated: to make the question clear.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甲如呢乙后呢 2024-12-22 02:27:13

您可以使用 pdf-link-checker

pdf-link-checker 是一个简单的工具，可以解析 PDF 文档并检查损坏的超链接。它通过向给定文档中找到的每个链接发送简单的 HTTP 请求来实现此目的。

要使用 pip 安装它：

pip install pdf-link-checker

不幸的是，一个依赖项 (pdfminer) 是破碎的。要修复它：

pip uninstall pdfminer
pip install pdfminer==20110515

You can use pdf-link-checker

pdf-link-checker is a simple tool that parses a PDF document and checks for broken hyperlinks. It does this by sending simple HTTP requests to each link found in a given document.

To install it with pip:

pip install pdf-link-checker

Unfortunately, one dependency (pdfminer) is broken. To fix it:

pip uninstall pdfminer
pip install pdfminer==20110515

回复收藏 0 原文

我的奇迹 2024-12-22 02:27:13

我建议首先使用 Linux 命令行实用程序“pdftotext” - 您可以找到手册页：

pdftotext 手册页

该实用程序是 PDF 处理工具 Xpdf 集合的一部分，可在大多数 Linux 发行版上使用。请参阅http://foolabs.com/xpdf/download.html。

安装后，您可以通过 pdftotext 处理 PDF 文件：

pdftotext file.pdf file.txt

处理后，一个简单的 Perl 脚本会在生成的文本文件中搜索 http URL，并使用 LWP::Simple。 LWP::Simple->get('http://...') 将允许您使用代码片段验证 URL，例如：

use LWP::Simple;
$content = get("http://www.sn.no/");
die "Couldn't get it!" unless defined $content;

我认为这将完成您想要做的事情。有大量关于如何编写正则表达式来匹配 http URL 的资源，但一个非常简单的资源如下所示：

m/http[^\s]+/i

“http 后跟一个或多个非空格字符” - 假设 URL 是属性 URL 编码的。

I suggest first using the linux command line utility 'pdftotext' - you can find the man page:

pdftotext man page

The utility is part of the Xpdf collection of PDF processing tools, available on most linux distributions. See http://foolabs.com/xpdf/download.html.

Once installed, you could process the PDF file through pdftotext:

pdftotext file.pdf file.txt

Once processed, a simple perl script that searched the resulting text file for http URLs, and retrieved them using LWP::Simple. LWP::Simple->get('http://...') will allow you to validate the URLs with a code snippet such as:

use LWP::Simple;
$content = get("http://www.sn.no/");
die "Couldn't get it!" unless defined $content;

That would accomplish what you want to do, I think. There are plenty of resources on how to write regular expressions to match http URLs, but a very simple one would look like this:

m/http[^\s]+/i

"http followed by one or more not-space characters" - assuming the URLs are property URL encoded.

回复收藏 0 原文

说不完的你爱 2024-12-22 02:27:13

您的问题有两条线索。

您是否正在寻找正则表达式验证该链接是否包含关键信息（例如 http:// 和有效的 TLD 代码）？如果是这样，我确信正则表达式专家会过来，或者看看 regexlib.com 其中包含许多现有的正则表达式用于处理 URL。

或者您想验证网站是否存在，那么我会推荐 Python + Requests 因为您可以编写检查脚本来查看网站是否存在并且不返回错误代码。

这是我目前在工作中出于几乎相同目的而承担的一项任务。我们有大约 54k 个链接需要自动处理。

回复收藏 0 原文

水中月 2024-12-22 02:27:13

通过以下方式收集链接：
使用 API 枚举链接，或转储为文本并链接结果，或另存为 html PDFMiner。
提出检查请求：
根据您的需求，有多种选择。

回复收藏 0 原文

岁月流歌 2024-12-22 02:27:13

https://stackoverflow.com/a/42178474/1587329 的建议是编写这个简单工具的灵感（请参阅<一个href="https://gist.github.com/serv-inc/0405594483a4115233f47ab19cfbf3b2" rel="nofollow noreferrer">gist)：

'''loads pdf file in sys.argv[1], extracts URLs, tries to load each URL'''
import urllib
import sys

import PyPDF2

# credits to stackoverflow.com/questions/27744210
def extract_urls(filename):
    '''extracts all urls from filename'''
    PDFFile = open(filename,'rb')
    PDF = PyPDF2.PdfFileReader(PDFFile)
    pages = PDF.getNumPages()

    key = '/Annots'
    uri = '/URI'
    ank = '/A'

    for page in range(pages):
        pageSliced = PDF.getPage(page)
        pageObject = pageSliced.getObject()
        if pageObject.has_key(key):
            ann = pageObject[key]
            for a in ann:
                u = a.getObject()
                if u[ank].has_key(uri):
                    yield u[ank][uri]


def check_http_url(url):
    urllib.urlopen(url)


if __name__ == "__main__":
    for url in extract_urls(sys.argv[1]):
        check_http_url(url)

保存到 filename.py，运行为 python 文件名.py pdfname.pdf。

https://stackoverflow.com/a/42178474/1587329's advice was inspiration to write this simple tool (see gist):

'''loads pdf file in sys.argv[1], extracts URLs, tries to load each URL'''
import urllib
import sys

import PyPDF2

# credits to stackoverflow.com/questions/27744210
def extract_urls(filename):
    '''extracts all urls from filename'''
    PDFFile = open(filename,'rb')
    PDF = PyPDF2.PdfFileReader(PDFFile)
    pages = PDF.getNumPages()

    key = '/Annots'
    uri = '/URI'
    ank = '/A'

    for page in range(pages):
        pageSliced = PDF.getPage(page)
        pageObject = pageSliced.getObject()
        if pageObject.has_key(key):
            ann = pageObject[key]
            for a in ann:
                u = a.getObject()
                if u[ank].has_key(uri):
                    yield u[ank][uri]


def check_http_url(url):
    urllib.urlopen(url)


if __name__ == "__main__":
    for url in extract_urls(sys.argv[1]):
        check_http_url(url)

Save to filename.py, run as python filename.py pdfname.pdf.

回复收藏 0 原文