当前位置：文江博客话题详情

将 PDF 文件转换为漂亮的表格

发布于 2024-10-25 12:56:15 字数 393 浏览 6 评论 0原文

我有这个 PDF 文件，分为 5 列。

我查了又查 Stack Overflow（并疯狂地用 Google 搜索）并尝试了所有解决方案（包括尝试 Adobe Acrobat 本身的最后手段）。

但是，由于某种原因，我无法获得 csv/xls 格式的这 5 列 - 因为我需要对它们进行排列。通常，当我导出它们时，格式很糟糕，所有条目都是逐行排列的，并且会丢失一些数据。

http://www.2shared.com/document/PagE4A1T/ex1.html

这是上面文件摘录的链接，但我真的很沮丧并且没有选择。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

娇女薄笑 2024-11-01 12:56:15

iText（或 iTextSharp）可以做到这一点，如果你可以给它这 5 列的边界，并且愿意处理一些开销（即重新解析每列的页面文本）

Rectangle2D columnBoxArray[] = buildColumnBoxes();
ArrayList<String> columnTexts = new ArrayList<String>(columnBoxArray.length);
For (Rectangle2D columnBBox : columnBoxArray) {

  FilteredTextRenderListener textInRectStrategy = 
    new FilteredTextRenderListener(new LocationTextExtractionStrategy(), 
      new RegionTextRenderFilter( columnBBox ) );

  columnTexts.add(PdfTextExtractor.extractText( reader, pageNum, textInRectStrategy));
}

每行文本应该用 分隔\n，这样就变成了一个简单的字符串解析问题。

如果您不想为每一列重新解析整个页面，您可能会想出一个 FilteredTextRenderListener 的自定义实现，它需要多个侦听器/过滤器对。然后，您可以解析整个事情一次，而不是为每一列解析一次。

iText (or iTextSharp) could do this, if you can give it the boundaries of those 5 columns, and are willing to deal with some overhead (namely reparsing the page's text for each column)

Rectangle2D columnBoxArray[] = buildColumnBoxes();
ArrayList<String> columnTexts = new ArrayList<String>(columnBoxArray.length);
For (Rectangle2D columnBBox : columnBoxArray) {

  FilteredTextRenderListener textInRectStrategy = 
    new FilteredTextRenderListener(new LocationTextExtractionStrategy(), 
      new RegionTextRenderFilter( columnBBox ) );

  columnTexts.add(PdfTextExtractor.extractText( reader, pageNum, textInRectStrategy));
}

Each line of text should be separated by \n, so it becomes a simple matter of string parsing.

If you wanted to not reparse the whole page for each column, you could probably come up with a custom implementation of FilteredTextRenderListener that would take multiple listener/filter pairs. You could then parse the whole thing once rather than once for each column.

回复收藏 0 原文

~没有更多了~