从 PDF 中的嵌套表格中提取数据

发布于 2024-09-15 22:04:07 字数 367 浏览 1 评论 0原文

  1. 我有一些从 Word 或 Excel 文件创建的 pdf 文件。

  2. 我需要获取表格中的信息。

  3. 文档中的文本不是图像,因此我可以使用 pdfbox 等工具提取文本。

  4. 当我有文本时,我无法知道它属于表格中的哪些单元格,因为我不知道表格边框在哪里。

  5. 我尝试了一些桌面工具,例如 abby 或 Solid pdf Converter,它们能够将文件转换为漂亮的 Word 文档,但这不适合我的需求,因为我希望能够在 C# 中以编程方式完成此操作.

  6. 有些表有嵌套表,我认为这使得这有点困难。

我感谢你的帮助

  1. I have a few pdf files that were created from word or excel files.

  2. I need to get the information thats in the tables.

  3. The text in the document is not an image so I'm able to extract the text using tools such as pdfbox.

  4. When I have the text I have no way of knowing what cells in the table it belongs to because I don't know where the table borders are.

  5. Iv'e tried a few desktop tools such as abby or solid pdf converter and they are able to convert the files into nice word documents but this doesn't suit my needs as I want to be able to do this programatticly in C#.

  6. Some of the tables have nested tables wich I think makes this a little bit more diffucult.

I appreciate your help

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

起风了 2024-09-22 22:04:07

这里的困难是由于 PDF 中的文本不包含在任何表格中。看起来可能是这样,但在表面之下,事实并非如此。

所以我能想到几个选择。但它们都不会像您希望的那样令人满意。

  1. 有些公司提供用于 PDF 到 Excel/Word 转换的 SDK。 Investintech 和 Iceni 就是两个例子。但这些解决方案并不是免费的。
  2. 如果您知道需要从中提取表格数据的 PDF 文件的确切布局,那么您可以使用任何可以从 PDF 中提取文本并告诉您所提取文本的准确坐标的 SDK。使用此方法,您需要提前知道文本将在哪里,以便您可以从页面上的特定区域提取文本。如果您需要处理任何随机文档,它显然不起作用。

这是一项艰巨的任务,但希望这会给您一个起点。

The difficulty here is caused by the fact that the text in the PDF is not contained within any table. It might look like it is, but underneath the surface, it is not.

So there are a couple of options that I can think of. But none of them are going to be quite as satisfying as you'd probably like.

  1. There are some companies that offer SDKs for PDF to Excel/Word conversion. Investintech and Iceni are a couple of examples. But these solutions are not free.
  2. If you know the exact layout of the PDF files that you need to extract the table data from, then you can use any SDK that lets you extract text from a PDF and also tells you the exact co-ordinates of the extracted text. Using this method you need to know in advance where the text is going to be, so that you can extract text from a specific area on the page. It obviously won't work if you need to process any random document.

It's a difficult task, but hopefully this will give you a starting point.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文