从 PDF 中提取 URL - 文本与 URL 不匹配

发布于 2025-01-15 18:11:03 字数 907 浏览 0 评论 0原文

我使用以下代码从 PDF 中提取 URL，提取锚点效果很好，但当锚文本与其后面的 URL 不同时则不起作用。例如：“www.page.com/A”在文本中用作短网址，但其后面的实际网址是较长（完整）版本。

我使用的代码是：

import urllib.request
import PyPDF2

urllib.request.urlretrieve(url, "remoteFile")
pdfFile = PyPDF2.PdfFileReader("remoteFile", strict=False)

key = "/Annots"
uri = "/URI"
ank = "/A"
mylist = []

for page_no in range(pdfFile.numPages):
    page = pdfFile.getPage(page_no)
    text = page.extractText()
    pageObject = page.getObject()
    if key in pageObject.keys():
        ann = pageObject.keys()
        for a in ann:
            try:
                u = a.getObject()
                if uri in u[ank].keys():
                    mylist.append(u[ank][uri])
                    print(u[ank][uri])
            except KeyError:
                pass

正如我所说，如果锚点和链接相同，它就可以正常工作。如果链接不同，则会保存锚点。理想情况下，我会保存两者（或仅保存链接）。

原文

I'm using following code to extract URLs from PDF and it works fine to extract the anchor but does not work when anchor text is different than the URL behind it.
For example: 'www.page.com/A' is used as a short url in the text but the actual URL behind it is a longer (full) version.

The code I'm using is:

import urllib.request
import PyPDF2

urllib.request.urlretrieve(url, "remoteFile")
pdfFile = PyPDF2.PdfFileReader("remoteFile", strict=False)

key = "/Annots"
uri = "/URI"
ank = "/A"
mylist = []

for page_no in range(pdfFile.numPages):
    page = pdfFile.getPage(page_no)
    text = page.extractText()
    pageObject = page.getObject()
    if key in pageObject.keys():
        ann = pageObject.keys()
        for a in ann:
            try:
                u = a.getObject()
                if uri in u[ank].keys():
                    mylist.append(u[ank][uri])
                    print(u[ank][uri])
            except KeyError:
                pass

As I said, it works ok if the anchor and the link are the same. If the link is different, it saves the anchor. Ideally I would save both (or just link).

分享到QQ

分享到微博