将 pdf 转换为文本
我需要创建一个将 pdf 文件转换为 txt 的 C# 或 C++ (MFC) 应用程序。我不仅需要转换,还需要删除页眉、页脚、左边距上的一些垃圾字符等。因此,应用程序应允许用户设置页边距以截断不需要的内容。我实际上已经使用 xpdf 创建了这样一个应用程序,但是当我尝试将自定义标签插入到提取的文本中以保留斜体和粗体时,它给我带来了一些问题。也许有人可以建议一些有用的东西?
谢谢。
I need to create a C# or C++ (MFC) application that converts pdf files to txt. I need not only to convert, but remove headers, footers, some garbage characters on the left margin etc. Thus the application shold allow the user to set page margins to cut off what is not needed. I actually have already created such an application using xpdf, but it gives me some problems when I am trying to insert custom tags into the extracted text to preserve italics and bold. Maybe somebody could suggest something useful?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
有共享软件和免费软件实用程序。尝试获取它们的源代码,或者按原样使用它们。
可以在此处找到 PDF 规范的公共版本:Adobe PDF 规范
PDF 共享软件读者可以找到:PDF阅读器源代码@ SourceForge
There are shareware and freeware utilities out there. Try fetching their source code, or perhaps use them the way they are.
A public version of the PDF specification can be found here: Adobe PDF Specification
PDF Shareware readers can be found: PDF Reader source code @ SourceForge
请查看 Podofo。它是一个 LGPL 许可的库,具有许多强大的编辑功能。其中一个示例 txt2pdf IIRC 是一个很好的开始:它显示了基本的文本提取;从那里您可以检查预过滤(在 pdf 引擎中)或后过滤(在文本中)是否足以实现您的目标。我没有使用 Pdf Hummus,但它也应该具有这些功能,尽管它不太简单。
Please look at Podofo. It's a LGPL-licensed library that has many powerful editing features. One of it's examples, txt2pdf IIRC, is a good start: it shows basic text-extraction; From there you can check if pre (in pdf engine) or post (in text) filtering suffices to your goals. I didn't get to use Pdf Hummus, but it's supposed to have these capabilities too, although it's less straightforward.