Pdfbox PDFTextStripperByArea 坐标已移动

发布于 2024-12-24 22:57:17 字数 960 浏览 4 评论 0原文

我遇到坐标问题。 PDFTextStripperByArea 区域似乎被推得太高。

考虑以下示例片段：

...
PDPage page = (PDPage) allPages.get(0);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();

// define region for extraction -- the coordinates and dimensions are x, y, width, height
Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion("test region", region);

// overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right 
PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
contentStream.setNonStrokingColor( Color.CYAN );
contentStream.fillRect(x, y, width, height );
contentStream.close();

// extract the text from the defined region
stripper.extractRegions(page);
String content = stripper.getTextForRegion("test region"); 
... 
document.save(...); ...

青色矩形很好地覆盖了所需区域。另一方面，剥离器错过了矩形底部的几行，并在矩形上方包含了几行——看起来它“向上”移动了（按 y 坐标）。到底是怎么回事？

原文

I am having issues with coordinates. The PDFTextStripperByArea region seems to be pushed too high.

Consider the following example snippet:

...
PDPage page = (PDPage) allPages.get(0);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();

// define region for extraction -- the coordinates and dimensions are x, y, width, height
Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion("test region", region);

// overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right 
PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
contentStream.setNonStrokingColor( Color.CYAN );
contentStream.fillRect(x, y, width, height );
contentStream.close();

// extract the text from the defined region
stripper.extractRegions(page);
String content = stripper.getTextForRegion("test region"); 
... 
document.save(...); ...

The cyan rectangle overlays the desired region nicely. On the other hand, stripper misses a couple of lines at the bottom of the rectangle and includes couple of lines above the rectangle -- it looks like it is shifted "upwards" (by y coordinate). What is going on?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

以可爱出名 2024-12-31 22:57:17

正如 Christian 在他的评论中所说，问题在于 fillRect() 方法的坐标系和 PDFTextStripperByArea 的坐标系不同。

第一个期望原点是页面的左下角，而第二个期望原点是左上角。

因此，为了使其工作，请将给定 PDFTextStripperByArea 的区域更改为：

Rectangle2D.Float region = new Rectangle2D.Float(x, ph - y - height, width, height);

其中 ph 是页面高度：

float ph = page.getMediaBox().getUpperRightY();

PS： 我知道这是一个非常老的问题，但是当我遇到同样的问题，所以我会添加我的答案。

As Christian said in his comment, the problem is that the coordinate system for the fillRect() method and the one for the PDFTextStripperByArea are different.

The first expects the origin to be the lower-left corner of the page, while the second expects it to be the upper-left.

So, to make it work, change the region given to the PDFTextStripperByArea to:

Rectangle2D.Float region = new Rectangle2D.Float(x, ph - y - height, width, height);

where ph is the page height:

float ph = page.getMediaBox().getUpperRightY();

PS: I know this is a very old question, but Google brought me here when I faced the same problem, so I will add my answer.

回复收藏 0 原文

淡写薰衣草的香 2024-12-31 22:57:17

文本通常包含在定位矩形内。有时，文本不在该矩形内的预期位置，PDFBox 使用该矩形来尝试猜测文本所在的位置。因此，如果文本从捕获区域外部开始并流入其中，则可能无法提取该文本。

粗略草图：文本框从捕获区域外部开始，但文本在其中流动。可能不会被捕获。

____________
|Page      |
|   _______|
|   |Area ||
|   |     ||
| ..|.....||
| ⁞ |Text⁞||
| ⁞ |____⁞||
| ⁞......⁞ |
|__________|

Text is usually contained inside a positioning rectangle. Sometimes, the text is not at the expected position inside that rectangle, and PDFBox uses that rectangle to try and guess where the text is located. So if text starts outside the capture area and flows into it, it might not be extracted.

Rough sketch: Textbox starts outside the capture area but text flows inside it. It might not be captured.

____________
|Page      |
|   _______|
|   |Area ||
|   |     ||
| ..|.....||
| ⁞ |Text⁞||
| ⁞ |____⁞||
| ⁞......⁞ |
|__________|

回复收藏 0 原文

~没有更多了~

关于作者

紙鸢

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

Pdfbox PDFTextStripperByArea 坐标已移动

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

Pdfbox PDFTextStripperByArea 坐标已移动

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。