当前位置：文江博客话题详情

如何通过xpdf或mupdf获取指定文本pos？

发布于 2024-12-06 02:21:51 字数 117 浏览 2 评论 0原文

我想提取pdf文件中的一些指定文本和文本位置。

我知道 xpdf 和 mupdf 可以解析 pdf 文件，所以我认为它们可以帮助我完成这项任务。

但是如何使用这两个lib来获取文本位置呢？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

断肠人 2024-12-13 02:21:51

如果您不介意为 MuPDF 使用 Python 绑定，这里有一个使用 PyMuPDF 的 Python 解决方案（我是它的开发人员之一）：

import fitz                     # the PyMuPDF module
doc = fitz.open("input.pdf")    # PDF input file
page = doc[n]                   # page number n (0-based)
wordlist = page.getTextWords()  # gives you a list of all words on the
# page, together with their position info (a rectangle containing the word)

# or, if you only are interested in blocks of lines belonging together:
blocklist = page.getTextBlocks()

# If you need yet more details, use a JSON-based output, which also gives
# images and their positions, as well as font information for the text.
tdict = json.loads(page.getText("json"))

如果您感兴趣，我们可以在 GitHub 上找到。

If you don't mind using a Python binding for MuPDF, here is a Python solution using PyMuPDF (I am one of its developers):

import fitz                     # the PyMuPDF module
doc = fitz.open("input.pdf")    # PDF input file
page = doc[n]                   # page number n (0-based)
wordlist = page.getTextWords()  # gives you a list of all words on the
# page, together with their position info (a rectangle containing the word)

# or, if you only are interested in blocks of lines belonging together:
blocklist = page.getTextBlocks()

# If you need yet more details, use a JSON-based output, which also gives
# images and their positions, as well as font information for the text.
tdict = json.loads(page.getText("json"))

We are on GitHub if you are interested.

回复收藏 0 原文