仅从 PDF 中提取粗体文本的最佳方法
iTextSharp 是一个很棒的工具,我可以使用 PdfTextExtractor.GetTextFromPage(reader, iPage) + " ";
它工作得很好,但是有没有办法从 pdf 中只提取粗体文本(例如标题),而不是所有内容?
无论编程语言如何,任何解决方案都是有用的。谢谢
iTextSharp is a great tool, I can usePdfTextExtractor.GetTextFromPage(reader, iPage) + " ";
and it works great, but is there a way to extract only the bold text (e.g. the headlines) from the pdf, and not everything?
Any solution is useful, regardless of the programing language. Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在 iText 中,您需要使用 com.itextpdf.text.pdf.parser 包中的类。
具体来说,您需要将 PdfTextExtractor 与自定义 TextExtractionStrategy 结合使用来检查字体名称。粗体字体的名称中通常带有“粗体”一词。
潜在问题:
1) 并非所有看起来像文本的东西都是用字体和字母呈现的。它可以是路径或位图。提取此类文本的唯一方法是使用 OCR,并且无法获取字体信息。
2)字体编码。映射到您在 PDF 中看到的字形的字节可能没有从这些字节到实际字符信息的映射。
3) 并非所有看起来粗体的文本都是用粗体字体制作的。一些粗体文本是通过用相当细的线以及通常的填充来抚摸文本轮廓而制成的。在这种情况下,文本渲染模式将设置为“描边和填充”而不是通常的“填充”。这种情况非常罕见,但确实时有发生。
测试问题 1 和 2 的一个简单方法是尝试在 Reader/Acrobat 中复制并粘贴文本。如果您无法选择它,则几乎可以肯定它是路径或图像。如果您可以选择它,但粘贴时字符显示为随机垃圾,则 iText 将出现相同的垃圾。
问题 3 并不难以编程方式进行测试,尽管您必须根据具体情况进行处理。您需要调用TextRenderInfo.getTextRenderMode()。 0 是填充(执行操作的标准方式),2 是“描边和填充”。
因此,您的 TextExtractionStrategy 可以存根 beginTextBlock、endTextBlock、renderImage 和 getResultantText。在 renderText 实现中,您必须检查字体名称(“粗体”,不区分大小写)和文本渲染模式。如果是其中任何一种情况,那么它就是您的标题的一部分。
所有这些都是假设您正在处理任意 PDF 文件。如果您的所有 PDF 都来自同一来源,您就可以开始偷工减料了。我将把它作为读者的练习。
From within iText, You need to use the classes from the com.itextpdf.text.pdf.parser package.
Specifically, you'll need to use a PdfTextExtractor with a custom TextExtractionStrategy that checks the font name. Bold fonts USUALLY have the world "bold" in their name.
Potential Issues:
1) Not everything that looks like text is rendered with fonts and letters. It can be paths or a bitmap. The only way to extract such text is with OCR, and there's no way to get font info.
2) Font Encoding. The bytes that map to the glyphs you're seeing in the PDF may not have a map from those bytes to actual character information.
3) Not all bold-looking text is made with a bold font. Some bold text is made by stroking the text outline with a fairly thin line as well as the usual filling. In this case, the text render mode will be set to "stroke & fill" instead of the usual "fill". This is pretty rare, but it does happen from time to time.
An easy way to test for problems 1 and 2 is to attempt to copy and paste the text within Reader/Acrobat. If you can't select it, it's almost certainly paths or an image. If you can select it but the characters come out as random junk when pasted, then iText will come up with the same junk.
Problem 3 isn't that hard to test for programattically, though you have to handle it on a case by case basis. You need to call TextRenderInfo.getTextRenderMode(). 0 is fill (the standard way of doing things), and 2 is "stroke and fill".
So your TextExtractionStrategy can stub out beginTextBlock, endTextBlock, renderImage, and getResultantText. In your renderText implementation, you'll have to check the font name (for "bold", case insensitive) and the text render mode. If either of those is the case, it's part of on of your headings.
All this is supposing that you are dealing with arbitrary PDF files. If all your PDFs come from the same source, you can start cutting corners. I'll leave that as an Exercise For The Reader.
对于这份工作来说,最好的选择之一肯定是 pdflib.com 的 TET 具有提取为 TETML 格式的能力。适用于 Windows、Mac OS X、Linux、Solaris、AIX、HP-UX...
我不确定它是否确实能够识别“标题”(因为 PDF 不太了解结构 标记,仅视觉标记)——但它肯定可以告诉您每个字符串使用的确切位置和字体。
One of your best bets for this job surely is TET by pdflib.com with its ability to extract to the TETML format. Available for Windows, Mac OS X, Linux, Solaris, AIX, HP-UX...
I'm not sure if it does indeed recognize "headlines" as such (because PDF does not know much of structural markups, only visual ones) -- but it surely can tell you exact position and font used by each string of characters.