在 itextSharp 中使用 LocationTextExtractionStrategy 获取文本坐标
我的目标是从 PDF 中检索数据,这些数据可能位于 Excel 文件的表结构中。
将 LocationTextExtractionStrategy 与 iTextSharp 结合使用,我们可以以纯文本形式获取字符串数据,并以从左到右的方式获取页面内容。
我怎样才能继续前进,以便在
PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy())
我可以使文本在结果字符串中保留其坐标。
例如,如果 pdf 中的第一行文本右对齐,则生成的字符串必须包含尾随空格或保持内容右对齐的空格。
请给出一些建议,我如何才能实现同样的目标。
My goal is to retrieve data from PDF which may be in table structure to an excel file.
using LocationTextExtractionStrategy with iTextSharp we can get the string data in plain text with page content in left to right manner.
How can I move forward such that during
PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy())
I could make the text retain its coordinate in the resulting string.
As for instance if the first line in the pdf has text aligned to right, then the resulting string must be containing trailing space or spaces keeping the content right aligned.
Please give some suggestions, how I may proceed to achieve the same.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
了解 PDF 不支持表格这一点非常重要。任何看起来像表格的东西实际上只是放置在线条背景上特定位置的一堆文本。这非常重要,您在处理此问题时需要牢记这一点。
也就是说,您需要子类化
TextExtractionStrategy
并将其传递给GetTextFromPage()
。请参阅这篇文章 举一个简单的例子。然后查看这篇文章了解更复杂的子类化示例。后者与您的目标并不完全相关,但它确实显示了您可以做的一些更复杂的事情。Its very important to understand that PDFs have no support for tables. Anything that looks like a table is really just a bunch of text placed at specific locations over a background of lines. This is very important and you need to keep this in mind as you work on this.
That said, you need to subclass
TextExtractionStrategy
and pass that intoGetTextFromPage()
. See this post for a simple example of that. Then see this post for a more complex example of subclassing. The latter isn't completely relevant to your goal but it does show some more complex things that you can do.