Pdfbox PDFTextStripperByArea 坐标已移动
我遇到坐标问题。 PDFTextStripperByArea 区域似乎被推得太高。
考虑以下示例片段:
...
PDPage page = (PDPage) allPages.get(0);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
// define region for extraction -- the coordinates and dimensions are x, y, width, height
Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion("test region", region);
// overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right
PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
contentStream.setNonStrokingColor( Color.CYAN );
contentStream.fillRect(x, y, width, height );
contentStream.close();
// extract the text from the defined region
stripper.extractRegions(page);
String content = stripper.getTextForRegion("test region");
...
document.save(...); ...
青色矩形很好地覆盖了所需区域。另一方面,剥离器错过了矩形底部的几行,并在矩形上方包含了几行——看起来它“向上”移动了(按 y 坐标)。到底是怎么回事?
I am having issues with coordinates. The PDFTextStripperByArea region seems to be pushed too high.
Consider the following example snippet:
...
PDPage page = (PDPage) allPages.get(0);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
// define region for extraction -- the coordinates and dimensions are x, y, width, height
Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion("test region", region);
// overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right
PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
contentStream.setNonStrokingColor( Color.CYAN );
contentStream.fillRect(x, y, width, height );
contentStream.close();
// extract the text from the defined region
stripper.extractRegions(page);
String content = stripper.getTextForRegion("test region");
...
document.save(...); ...
The cyan rectangle overlays the desired region nicely. On the other hand, stripper misses a couple of lines at the bottom of the rectangle and includes couple of lines above the rectangle -- it looks like it is shifted "upwards" (by y coordinate). What is going on?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正如 Christian 在他的评论中所说,问题在于 fillRect() 方法的坐标系和 PDFTextStripperByArea 的坐标系不同。
第一个期望原点是页面的左下角,而第二个期望原点是左上角。
因此,为了使其工作,请将给定 PDFTextStripperByArea 的区域更改为:
其中 ph 是页面高度:
PS: 我知道这是一个非常老的问题,但是当我遇到同样的问题,所以我会添加我的答案。
As Christian said in his comment, the problem is that the coordinate system for the fillRect() method and the one for the PDFTextStripperByArea are different.
The first expects the origin to be the lower-left corner of the page, while the second expects it to be the upper-left.
So, to make it work, change the region given to the PDFTextStripperByArea to:
where ph is the page height:
PS: I know this is a very old question, but Google brought me here when I faced the same problem, so I will add my answer.
文本通常包含在定位矩形内。有时,文本不在该矩形内的预期位置,PDFBox 使用该矩形来尝试猜测文本所在的位置。因此,如果文本从捕获区域外部开始并流入其中,则可能无法提取该文本。
粗略草图:文本框从捕获区域外部开始,但文本在其中流动。可能不会被捕获。
Text is usually contained inside a positioning rectangle. Sometimes, the text is not at the expected position inside that rectangle, and PDFBox uses that rectangle to try and guess where the text is located. So if text starts outside the capture area and flows into it, it might not be extracted.
Rough sketch: Textbox starts outside the capture area but text flows inside it. It might not be captured.