使用 PHP 模式匹配 PDF 正文中的文本并添加超链接
情况如下:我有一系列又大又厚的 PDF 文件,充满图像和随机分布的文本 - 这些是大量产品的巨大促销价格表的部分。我需要的是对每个 PDF 文件文本中的所有目录代码进行模式匹配,并使用指向在线商店中相应页面的超链接对其进行包装。
因此,任务非常简单 - 扫描 PDF 文件中的所有纯文本 10
数字序列,并将其转换为 href 为 http://something?code=[match]< 的链接/代码>。
如果可能的话,我也更愿意将其放在 PHP 脚本中,但任何语言都可以。我有一种直觉,也许闪光灯也是一种选择。
有什么想法吗?提前致谢。
编辑:
一些答案正在教我 PCRE 语法。这里的问题是我需要在 PDF 文件中搜索和替换。所以问题是双重的。假设我们将在 PHP 中执行此操作:
- How do you read / write to a PDF in PHP?
- 由于 PDF 不是纯文本文件,因此我不能仅对它们进行正则表达式,而且我还相信 PDF 链接不会与文本捆绑在一起,而是作为区域分开。这也意味着,如果我只知道匹配的代码位于页面上的位置,我也许可以在目录代码的字符坐标上覆盖一个活动矩形。
你怎么认为?其他语言也是一种选择。
谢谢。
The situation is as follows: I have a series of big, fat PDF files, full of imagery and randomly distributed text - these are the sections of a huge promotional pricelist for a vast array of products. What I need is to pattern-match all the catalogue codes in the text of each PDF file and to wrap it with a hyperlink that will point to the respective page in an online store.
So the task is very simple - scan a PDF file for all plain-text 10
digits sequences, and convert those into links whose href is http://something?code=[match]
.
I would also prefer to put this together in a PHP script if possible, but any language would do. I have a gut feeling that maybe even flash could be an option.
Any ideas? Thanks in advance.
EDIT:
Some answers coming in are teaching me pcre syntax. The problem here is that I need to search and replace in a PDF file. So the problem is twofold. Say we'll do this in PHP:
- How do you read / write to a PDF in PHP?
- As PDFs aren't plaintext files, I can't just regex against them, and I also believe that PDF links are not bundled together with the text but come separate as regions. Which also means that I could maybe overlay an active rectangle over the coordinates of the catalogue code's characters, if I only knew where a matched code resides on a page.
What do you think? Other languages are also an option.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
替换 PDF 中的文本很困难,并且没有任何开源 PDF 解决方案支持此功能。
Apago (www.apago.com) 开发了一种用于替换 PDF 文件中文本的商业解决方案。贺卡制造商使用它来修改定价、“MADE IN”文本、产品编号等。
Replacing text in a PDF is difficult and none of the open source PDF solutions support this capability.
Apago (www.apago.com) has a developed commercial solution for replacing text in PDF files. It's used by greeting card manufacturer to modify pricing, "MADE IN" text, product numbers, etc.
输出
3000 asdf 文本
5000 asdf
output
3000 asdf text
5000 asdf