pdfbox 获取开始文本部分 (BT ET) 坐标
有人可以帮我获取 pdf begintext 部分的真实像素坐标吗? 我正在使用 pdfbox 从 pdf 文件中检索文本,但现在我需要获取围绕该文本部分/段落的矩形。
$contents = $page->getContents();
$contentsStream = $page->getContents()->getStream();
$resources=$page->getResources();
$fonts = $resources->getFonts();
$xobjects = $resources->getImages();
$tokens=$contentsStream->getStreamTokens();
[PDFOperator{q}、COSFloat{690.48}、COSInt{0}、COSInt{0}、COSFloat{633.6}、COSInt{0}、COSInt{0}、PDFOperator{cm}、COSName{im1}、PDFOperator {Do},PDFOperator{Q},
PDFOperator{BT}、COSInt{1}、COSInt{0}、COSInt{0}、COSInt{1}、 COSFloat{25.92}、COSFloat{588.48}、PDFOperator{Tm}、COSInt{99}、PDFOperator{Tz}、COSName{F30}、COSInt{56}、PDFOperator{Tf}、COSInt{3}、PDFOperator{Tr}、 COSFloat{0.334}、PDFOperator{Tc}、 COSString{Pospremanj}、PDFOperator{Tj}、COSInt{0}、PDFOperator{Tc}、COSString{e}、PDFOperator{Tj}、COSFloat{9.533}、PDFOperator{Tw}、COSString{ i}、PDFOperator{Tj}、 COSFloat{6.062}、PDFOperator{Tw}、COSFloat{0.95}、 PDFOperator{Tc}、COSString{ ciscenj}、PDFOperator{Tj}、COSInt{0}、PDFOperator{Tc}、COSString{e }、PDFOperator{Tj}、COSInt{1}、COSInt{0}、COSInt{0}、 COSInt{1}、COSFloat{55.68}、COSFloat{539.76}、PDFOperator{Tm}、 COSInt{0}、PDFOperator{Tw}、COSFloat{0.262}、PDFOperator{Tc}、COSString{uoè}、PDFOperator{Tj}、COSInt{0}、PDFOperator{Tc}、COSString{i}、PDFOperator{Tj}、 COSFloat{5.443}、PDFOperator{Tw}、COSFloat{-2.145}、 PDFOperator{Tc}、COSString{zimslco}、PDFOperator{Tj}、COSInt{0}、PDFOperator{Tc}、COSString{g}、PDFOperator{Tj}、COSFloat{7.202}、PDFOperator{Tw}、COSFloat{-0.148} , PDFOperator{Tc}, COSString{ odmor}, PDFOperator{Tj}, COSInt{0}、PDFOperator{Tc}、COSString{a }、PDFOperator{Tj}、PDFOperator{ET}、
PDFOperator{BT}、COSInt{1}、COSInt{0}、COSInt{0}、COSInt{1}、COSFloat{6.72}、COSFloat{513.12}、PDFOperator{Tm }、COSInt{0}、PDFOperator{Tw}、COSName{F30}、COSInt{14}、 PDFOperator{Tf}、COSString{}、PDFOperator{Tj}、COSFloat{2.751}、PDFOperator{Tw}、 ...
我想获得类似 PrintTextLocations 函数对每个单词/字符所做的输出。 我可以获取底部和左侧坐标,但如何获取宽度和顶部坐标?
PrintTextLocations:
- 字符串[25.92,45.119995 fs=56.0 xscale=55.440002 height=40.208004 space=15.412322 width=36.978485]p 字符串[63.22914,45.119995 fs=56.0 xscale=55.440002 高度=40.208004 空间=15.412322 宽度=33.87384]o 字符串[97.43364,45.119995 fs=56.0 xscale=55.440002 高度=40.208004 空间=15.412322 宽度=30.824646]s 字符串[128.58894,45.119995 fs=56.0 xscale=55.440002 高度=42.168 空间=15.412322 宽度=33.87384]p 字符串[162.79344,45.119995 fs=56.0 xscale=55.440002 高度=42.168 空间=15.412322 宽度=21.566162]r 字符串[184.69026,45.119995 fs=56.0 xscale=55.440002 高度=42.168 空间=15.412322 宽度=30.824646]e 字符串[215.84557,45.119995 fs=56.0 xscale=55.440002 高度=42.168 空间=15.412322 宽度=49.286148]m ...
can someone please help me with getting real pixel coordinates for pdf begintext sections?
I am using pdfbox to retrieve texts from pdf files but now i need to get rects sorounding that text sections/paragraphs.
$contents = $page->getContents();
$contentsStream = $page->getContents()->getStream();
$resources=$page->getResources();
$fonts = $resources->getFonts();
$xobjects = $resources->getImages();
$tokens=$contentsStream->getStreamTokens();
[PDFOperator{q}, COSFloat{690.48}, COSInt{0}, COSInt{0}, COSFloat{633.6}, COSInt{0}, COSInt{0}, PDFOperator{cm}, COSName{im1}, PDFOperator{Do}, PDFOperator{Q},
PDFOperator{BT}, COSInt{1}, COSInt{0}, COSInt{0}, COSInt{1}, COSFloat{25.92}, COSFloat{588.48}, PDFOperator{Tm}, COSInt{99}, PDFOperator{Tz}, COSName{F30}, COSInt{56}, PDFOperator{Tf}, COSInt{3}, PDFOperator{Tr}, COSFloat{0.334}, PDFOperator{Tc}, COSString{Pospremanj}, PDFOperator{Tj}, COSInt{0}, PDFOperator{Tc}, COSString{e}, PDFOperator{Tj}, COSFloat{9.533}, PDFOperator{Tw}, COSString{ i}, PDFOperator{Tj}, COSFloat{6.062}, PDFOperator{Tw}, COSFloat{0.95}, PDFOperator{Tc}, COSString{ ciscenj}, PDFOperator{Tj}, COSInt{0}, PDFOperator{Tc}, COSString{e }, PDFOperator{Tj}, COSInt{1}, COSInt{0}, COSInt{0}, COSInt{1}, COSFloat{55.68}, COSFloat{539.76}, PDFOperator{Tm}, COSInt{0}, PDFOperator{Tw}, COSFloat{0.262}, PDFOperator{Tc}, COSString{uoè}, PDFOperator{Tj}, COSInt{0}, PDFOperator{Tc}, COSString{i}, PDFOperator{Tj}, COSFloat{5.443}, PDFOperator{Tw}, COSFloat{-2.145}, PDFOperator{Tc}, COSString{ zimslco}, PDFOperator{Tj}, COSInt{0}, PDFOperator{Tc}, COSString{g}, PDFOperator{Tj}, COSFloat{7.202}, PDFOperator{Tw}, COSFloat{-0.148}, PDFOperator{Tc}, COSString{ odmor}, PDFOperator{Tj}, COSInt{0}, PDFOperator{Tc}, COSString{a }, PDFOperator{Tj}, PDFOperator{ET},
PDFOperator{BT}, COSInt{1}, COSInt{0}, COSInt{0}, COSInt{1}, COSFloat{6.72}, COSFloat{513.12}, PDFOperator{Tm}, COSInt{0}, PDFOperator{Tw}, COSName{F30}, COSInt{14}, PDFOperator{Tf}, COSString{}, PDFOperator{Tj}, COSFloat{2.751}, PDFOperator{Tw},
...
i would like to get output something like PrintTextLocations function does for every word/character.
I can get bottom and left coordinate, but how to get width and top coordinate?
PrintTextLocations:
- string[25.92,45.119995 fs=56.0 xscale=55.440002 height=40.208004 space=15.412322 width=36.978485]p
string[63.22914,45.119995 fs=56.0 xscale=55.440002 height=40.208004 space=15.412322 width=33.87384]o
string[97.43364,45.119995 fs=56.0 xscale=55.440002 height=40.208004 space=15.412322 width=30.824646]s
string[128.58894,45.119995 fs=56.0 xscale=55.440002 height=42.168 space=15.412322 width=33.87384]p
string[162.79344,45.119995 fs=56.0 xscale=55.440002 height=42.168 space=15.412322 width=21.566162]r
string[184.69026,45.119995 fs=56.0 xscale=55.440002 height=42.168 space=15.412322 width=30.824646]e
string[215.84557,45.119995 fs=56.0 xscale=55.440002 height=42.168 space=15.412322 width=49.286148]m
...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
...由于 BT 部分为您提供了左下角坐标,因此您需要解析当前 BT 块中包含的所有单词/字母以获取所有其他坐标。
第一个单词高度 + BT 底部 = 顶部,max(左坐标+宽度)= 右侧,最后一个单词底部 = 底部坐标。
我希望这对某人有帮助...
单个字母的示例字符串:
提取、解析和准备行:
功能:
-matija kancijan
...as BT section gives you bottom left coordinates, you need to parse trough all words/letters contained in current BT block to get all other coordinates.
First word height + BT bottom = top, max (left coordinate+width) = right, last word bottom = bottom coordinate.
i hope this helps someone...
Example string for a single letter:
Extracted, parsed and prepared line:
Function:
-matija kancijan