确定“感兴趣的框” 在 PDF 页面上
我希望能够确定 PDF 页面上文本、图像和路径区域的边界框,类似于此处所示的内容:
http://www.windjack.com/products/screenshot/pdfcanscreenshot2.html
查看PDF规范,我可以看到如何确定路径和图像的边界框,但我不知道如何获取它们的文本。 我是否必须手动计算它,根据字体大小等确定字形的高度和宽度,还是有更直接的方法?
I want to be able to determine the bounding box of areas of text, images and paths on a PDF page, similar to what is shown here:
http://www.windjack.com/products/screenshot/pdfcanscreenshot2.html
Looking at the PDF spec, I can see how to determine the bounding boxes of paths and images, but I can't see how to arrive at them for text. Do I have to calculate it by hand, determining the height and width of the glyphs from the font size, etc., or is there a more straightforward way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以从 “如何从 pdf 文档中获取字符偏移信息?” 这将为您提供文档中字符和/或子字符串的 x、y、宽度和高度。 从这里开始,更困难的部分是将角色组绑定到空间上不同的区域。 无法保证页面上的空间分组文本在文件格式的语法中彼此接近......
You may be able to start with the solution to "How do I get character offset information from a pdf document?" That will give you x, y, width and height for characters and/or substrings in the document. From there, the harder part is to bound the groups of characters into spatially distinct regions. There's no guarantee that spatially grouped text on a page will be close to each other in the syntax of the file format...