是否可以将pdf文件读取为txt?
我需要在pdf文件中找到某个键。据我所知,唯一的方法是将 pdf 解释为 txt 文件。我想在 PHP 中执行此操作,而不安装插件/框架/等。
谢谢
I need to find a certain key in a pdf file. As far as I know the only way to do that is to interpret a pdf as txt file. I want to do this in PHP without installing a addon/framework/etc.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您当然可以将 PDF 文件作为文本打开。 PDF 文件格式实际上是对象的集合。第一行有一个标题告诉您版本。然后,您将转到底部查找到外部参照表开头的偏移量,该偏移量表明所有对象的位置。文件中各个对象的内容(例如图形)通常是二进制且经过压缩的。 1.7 规范可在此处找到。
You can certainly open a PDF file as text. PDF file format is actually a collection of objects. There is a header in the first line that tells you the version. You would then go to the bottom to find the offset to the start of the xref table that tells where all the objects are located. The contents of individual objects in the file, like graphics, are often binary and compressed. The 1.7 specification can be found here.
我发现了这个功能,希望对你有帮助。
http://community.livejournal.com/php/295413.html
I found this function, hope it helps.
http://community.livejournal.com/php/295413.html
您不能直接打开该文件,因为它是用于创建 PDF 显示的对象的二进制转储,包括编码、字体、文本、图像。我写了一篇博客文章解释文本如何存储在 http://pdf.jpedal.org/java-pdf-blog/bid/27187/Understanding-the-PDF-file-format-text-streams
You can't just open the file as it is a binary dump of objects used to create the PDF display, including encoding, fonts, text, images. I wrote an blog post explaining how text is stored at http://pdf.jpedal.org/java-pdf-blog/bid/27187/Understanding-the-PDF-file-format-text-streams
谢谢大家的帮助。我欠你这段代码:
Thank you all for your help. I owe you this piece of code: