如何从 PDF 文件中提取突出显示的部分

发布于 2025-01-02 02:17:59 字数 86 浏览 0 评论 0原文

有没有办法以编程方式从 PDF 文件中提取突出显示的文本?欢迎任何语言。我找到了几个使用 Python、Java 和 PHP 的库,但没有一个能完成这项工作。

Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

多彩岁月 2025-01-09 02:17:59

要提取突出显示的部分,您可以使用 PyMuPDF。这是一个与 this 一起使用的示例pdf文件
直接下载

# Based on https://stackoverflow.com/a/62859169/562769

from typing import List, Tuple

import fitz  # install with 'pip install pymupdf'


def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str:
    points = annot.vertices
    quad_count = int(len(points) / 4)
    sentences = []
    for i in range(quad_count):
        # where the highlighted part is
        r = fitz.Quad(points[i * 4 : i * 4 + 4]).rect

        words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
        sentences.append(" ".join(w[4] for w in words))
    sentence = " ".join(sentences)
    return sentence


def handle_page(page):
    wordlist = page.get_text("words")  # list of words on page
    wordlist.sort(key=lambda w: (w[3], w[0]))  # ascending y, then x

    highlights = []
    annot = page.first_annot
    while annot:
        if annot.type[0] == 8:
            highlights.append(_parse_highlight(annot, wordlist))
        annot = annot.next
    return highlights


def main(filepath: str) -> List:
    doc = fitz.open(filepath)

    highlights = []
    for page in doc:
        highlights += handle_page(page)

    return highlights


if __name__ == "__main__":
    print(main("PDF-export-example-with-notes.pdf"))

To extract highlighted parts, you can use PyMuPDF. Here is an example which works with this pdf file:
Direct download

# Based on https://stackoverflow.com/a/62859169/562769

from typing import List, Tuple

import fitz  # install with 'pip install pymupdf'


def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str:
    points = annot.vertices
    quad_count = int(len(points) / 4)
    sentences = []
    for i in range(quad_count):
        # where the highlighted part is
        r = fitz.Quad(points[i * 4 : i * 4 + 4]).rect

        words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
        sentences.append(" ".join(w[4] for w in words))
    sentence = " ".join(sentences)
    return sentence


def handle_page(page):
    wordlist = page.get_text("words")  # list of words on page
    wordlist.sort(key=lambda w: (w[3], w[0]))  # ascending y, then x

    highlights = []
    annot = page.first_annot
    while annot:
        if annot.type[0] == 8:
            highlights.append(_parse_highlight(annot, wordlist))
        annot = annot.next
    return highlights


def main(filepath: str) -> List:
    doc = fitz.open(filepath)

    highlights = []
    for page in doc:
        highlights += handle_page(page)

    return highlights


if __name__ == "__main__":
    print(main("PDF-export-example-with-notes.pdf"))
仙女 2025-01-09 02:17:59

好的,经过查看,我找到了将突出显示的文本从 pdf 导出到文本文件的解决方案。不是很难:

  1. 首先,使用您喜欢使用的工具突出显示文本(就我而言,我是在使用 Goodreader 应用程序在 iPad 上阅读时突出显示文本)。

  2. 将您的 pdf 传输到计算机并使用 Skim(一款 pdf 阅读器,免费且易于在网络上找到)打开它

  3. 在“文件”上,选择“转换笔记”并将文档中的所有笔记转换为“SKIM 笔记”。

  4. 就这样:只需转到“导出”并选择“导出浏览笔记”即可。它将向您导出突出显示文本的列表。打开后,此列表可以再次导出为 txt 格式文件。

不需要做太多工作,结果非常棒。

Ok, after looking I found a solution for exporting highlighted text from a pdf to a text file. Is not very hard:

  1. First, you highlight your text with the tool you like to use (in my case, I highlight while I'm reading on an iPad using Goodreader app).

  2. Transfer your pdf to a computer and open it using Skim (a pdf reader, free and easy to find on the web)

  3. On FILE, choose CONVERT NOTES and convert all the notes of your document to SKIM NOTES.

  4. That's all: simply go to EXPORT an choose EXPORT SKIM NOTES. It will export you a list of your highlighted text. Once opened this list can be exported again to a txt format file.

Not much work to do, and the result is fantastic.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文