pdfbox 获取开始文本部分 (BT ET) 坐标

发布于 2024-11-26 12:34:09 字数 2733 浏览 3 评论 0原文

有人可以帮我获取 pdf begintext 部分的真实像素坐标吗? 我正在使用 pdfbox 从 pdf 文件中检索文本,但现在我需要获取围绕该文本部分/段落的矩形。

$contents = $page->getContents();
$contentsStream = $page->getContents()->getStream();
$resources=$page->getResources();
$fonts = $resources->getFonts();
$xobjects = $resources->getImages();
$tokens=$contentsStream->getStreamTokens();
  • [PDFOperator{q}、COSFloat{690.48}、COSInt{0}、COSInt{0}、COSFloat{633.6}、COSInt{0}、COSInt{0}、PDFOperator{cm}、COSName{im1}、PDFOperator {Do},PDFOperator{Q},

  • PDFOperator{BT}、COSInt{1}、COSInt{0}、COSInt{0}、COSInt{1}、 COSFloat{25.92}、COSFloat{588.48}、PDFOperator{Tm}、COSInt{99}、PDFOperator{Tz}、COSName{F30}、COSInt{56}、PDFOperator{Tf}、COSInt{3}、PDFOperator{Tr}、 COSFloat{0.334}、PDFOperator{Tc}、 COSString{Pospremanj}、PDFOperator{Tj}、COSInt{0}、PDFOperator{Tc}、COSString{e}、PDFOperator{Tj}、COSFloat{9.533}、PDFOperator{Tw}、COSString{ i}、PDFOperator{Tj}、 COSFloat{6.062}、PDFOperator{Tw}、COSFloat{0.95}、 PDFOperator{Tc}、COSString{ ciscenj}、PDFOperator{Tj}、COSInt{0}、PDFOperator{Tc}、COSString{e }、PDFOperator{Tj}、COSInt{1}、COSInt{0}、COSInt{0}、 COSInt{1}、COSFloat{55.68}、COSFloat{539.76}、PDFOperator{Tm}、 COSInt{0}、PDFOperator{Tw}、COSFloat{0.262}、PDFOperator{Tc}、COSString{uoè}、PDFOperator{Tj}、COSInt{0}、PDFOperator{Tc}、COSString{i}、PDFOperator{Tj}、 COSFloat{5.443}、PDFOperator{Tw}、COSFloat{-2.145}、 PDFOperator{Tc}、COSString{zimslco}、PDFOperator{Tj}、COSInt{0}、PDFOperator{Tc}、COSString{g}、PDFOperator{Tj}、COSFloat{7.202}、PDFOperator{Tw}、COSFloat{-0.148} , PDFOperator{Tc}, COSString{ odmor}, PDFOperator{Tj}, COSInt{0}、PDFOperator{Tc}、COSString{a }、PDFOperator{Tj}、PDFOperator{ET}、

  • PDFOperator{BT}、COSInt{1}、COSInt{0}、COSInt{0}、COSInt{1}、COSFloat{6.72}、COSFloat{513.12}、PDFOperator{Tm }、COSInt{0}、PDFOperator{Tw}、COSName{F30}、COSInt{14}、 PDFOperator{Tf}、COSString{}、PDFOperator{Tj}、COSFloat{2.751}、PDFOperator{Tw}、 ...

我想获得类似 PrintTextLocations 函数对每个单词/字符所做的输出。 我可以获取底部和左侧坐标,但如何获取宽度和顶部坐标?

PrintTextLocations:

  • 字符串[25.92,45.119995 fs=56.0 xscale=55.440002 height=40.208004 space=15.412322 width=36.978485]p 字符串[63.22914,45.119995 fs=56.0 xscale=55.440002 高度=40.208004 空间=15.412322 宽度=33.87384]o 字符串[97.43364,45.119995 fs=56.0 xscale=55.440002 高度=40.208004 空间=15.412322 宽度=30.824646]s 字符串[128.58894,45.119995 fs=56.0 xscale=55.440002 高度=42.168 空间=15.412322 宽度=33.87384]p 字符串[162.79344,45.119995 fs=56.0 xscale=55.440002 高度=42.168 空间=15.412322 宽度=21.566162]r 字符串[184.69026,45.119995 fs=56.0 xscale=55.440002 高度=42.168 空间=15.412322 宽度=30.824646]e 字符串[215.84557,45.119995 fs=56.0 xscale=55.440002 高度=42.168 空间=15.412322 宽度=49.286148]m ...

can someone please help me with getting real pixel coordinates for pdf begintext sections?
I am using pdfbox to retrieve texts from pdf files but now i need to get rects sorounding that text sections/paragraphs.

$contents = $page->getContents();
$contentsStream = $page->getContents()->getStream();
$resources=$page->getResources();
$fonts = $resources->getFonts();
$xobjects = $resources->getImages();
$tokens=$contentsStream->getStreamTokens();
  • [PDFOperator{q}, COSFloat{690.48}, COSInt{0}, COSInt{0}, COSFloat{633.6}, COSInt{0}, COSInt{0}, PDFOperator{cm}, COSName{im1}, PDFOperator{Do}, PDFOperator{Q},

  • PDFOperator{BT}, COSInt{1}, COSInt{0}, COSInt{0}, COSInt{1}, COSFloat{25.92}, COSFloat{588.48}, PDFOperator{Tm}, COSInt{99}, PDFOperator{Tz}, COSName{F30}, COSInt{56}, PDFOperator{Tf}, COSInt{3}, PDFOperator{Tr}, COSFloat{0.334}, PDFOperator{Tc}, COSString{Pospremanj}, PDFOperator{Tj}, COSInt{0}, PDFOperator{Tc}, COSString{e}, PDFOperator{Tj}, COSFloat{9.533}, PDFOperator{Tw}, COSString{ i}, PDFOperator{Tj}, COSFloat{6.062}, PDFOperator{Tw}, COSFloat{0.95}, PDFOperator{Tc}, COSString{ ciscenj}, PDFOperator{Tj}, COSInt{0}, PDFOperator{Tc}, COSString{e }, PDFOperator{Tj}, COSInt{1}, COSInt{0}, COSInt{0}, COSInt{1}, COSFloat{55.68}, COSFloat{539.76}, PDFOperator{Tm}, COSInt{0}, PDFOperator{Tw}, COSFloat{0.262}, PDFOperator{Tc}, COSString{uoè}, PDFOperator{Tj}, COSInt{0}, PDFOperator{Tc}, COSString{i}, PDFOperator{Tj}, COSFloat{5.443}, PDFOperator{Tw}, COSFloat{-2.145}, PDFOperator{Tc}, COSString{ zimslco}, PDFOperator{Tj}, COSInt{0}, PDFOperator{Tc}, COSString{g}, PDFOperator{Tj}, COSFloat{7.202}, PDFOperator{Tw}, COSFloat{-0.148}, PDFOperator{Tc}, COSString{ odmor}, PDFOperator{Tj}, COSInt{0}, PDFOperator{Tc}, COSString{a }, PDFOperator{Tj}, PDFOperator{ET},

  • PDFOperator{BT}, COSInt{1}, COSInt{0}, COSInt{0}, COSInt{1}, COSFloat{6.72}, COSFloat{513.12}, PDFOperator{Tm}, COSInt{0}, PDFOperator{Tw}, COSName{F30}, COSInt{14}, PDFOperator{Tf}, COSString{}, PDFOperator{Tj}, COSFloat{2.751}, PDFOperator{Tw},
    ...

i would like to get output something like PrintTextLocations function does for every word/character.
I can get bottom and left coordinate, but how to get width and top coordinate?

PrintTextLocations:

  • string[25.92,45.119995 fs=56.0 xscale=55.440002 height=40.208004 space=15.412322 width=36.978485]p
    string[63.22914,45.119995 fs=56.0 xscale=55.440002 height=40.208004 space=15.412322 width=33.87384]o
    string[97.43364,45.119995 fs=56.0 xscale=55.440002 height=40.208004 space=15.412322 width=30.824646]s
    string[128.58894,45.119995 fs=56.0 xscale=55.440002 height=42.168 space=15.412322 width=33.87384]p
    string[162.79344,45.119995 fs=56.0 xscale=55.440002 height=42.168 space=15.412322 width=21.566162]r
    string[184.69026,45.119995 fs=56.0 xscale=55.440002 height=42.168 space=15.412322 width=30.824646]e
    string[215.84557,45.119995 fs=56.0 xscale=55.440002 height=42.168 space=15.412322 width=49.286148]m
    ...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

っ左 2024-12-03 12:34:10

...由于 BT 部分为您提供了左下角坐标,因此您需要解析当前 BT 块中包含的所有单词/字母以获取所有其他坐标。
第一个单词高度 + BT 底部 = 顶部,max(左坐标+宽度)= 右侧,最后一个单词底部 = 底部坐标。

我希望这对某人有帮助...

单个字母的示例字符串:

string[32.94,35.099976 fs=8.0 xscale=1.0 height=4.4240003 space=2.2240002 width=3.959999]p

提取、解析和准备行:

32.94,35.099976 fs=8.0 xscale=1.0 height=4.4240003 space=2.2240002 width=3.959999

功能:

/**
 * Parse single word / letter element
 *
 * @param string $str_raw  Extracted word string line.
 * @param string $str_elem Element of interest, word, char.
 * @param int    $pdf_w    Pdf page width.
 * @param int    $pdf_h    Pdf page height.
 * @param int    $pdf_d    Pdf page dpi.
 * @param int    $pdf_r    Pdf page relative dpi.
 *
 * @return array
 */
function createRealCoordinates($str_raw, $str_elem, $pdf_w, $pdf_h, $pdf_d = 400, $pdf_r = 72)
{
    $stringstrip = array('fs=', 'xscale=', 'height=', 'space=', 'width=');
    $string_info = str_replace($stringstrip, '', $str_raw);

    $coord_info = explode(' ', $string_info);
    $coord_xy   = explode(',', $coord_info[0]);

    $coord = array(
        'pdfWidth'  => $pdf_w,
        'pdfHeight' => $pdf_h,
        'pdfDpi'    => $pdf_d,
        'pdfRel'    => $pdf_r,
        'word'      => $str_elem,

        'x1' => null,
        'y1' => null,
        'x2' => null,
        'y2' => null,

        'fontSize'     => null,
        'xScale'       => null,
        'HeightDir'    => null,
        'WidthDir'     => null,
        'WidthOfSpace' => null,
    );

    // Left, Bottom coordinate.
    $coord['x1'] = ($coord_xy[0] / $pdf_r) * $pdf_d;
    $coord['y2'] = ($coord_xy[1] / $pdf_r) * $pdf_d;

    $coord['fontSize']     = $coord_info[1]; // font size.
    $coord['xScale']       = $coord_info[2]; // x size scale.
    $coord['HeightDir']    = $coord_info[3]; // height.
    $coord['WidthDir']     = $coord_info[5]; // word width.
    $coord['WidthOfSpace'] = ($coord_info[4] / $pdf_r) * $pdf_d; // width of space.

    // Right, Top coordinate.
    $coord['x2'] = $coord['x1'] + (($coord['WidthDir'] / $pdf_r) * $pdf_d);
    $coord['y1'] = $coord['y2'] - (($coord['HeightDir'] / $pdf_r) * $pdf_d);

    return $coord;
}

-matija kancijan

...as BT section gives you bottom left coordinates, you need to parse trough all words/letters contained in current BT block to get all other coordinates.
First word height + BT bottom = top, max (left coordinate+width) = right, last word bottom = bottom coordinate.

i hope this helps someone...

Example string for a single letter:

string[32.94,35.099976 fs=8.0 xscale=1.0 height=4.4240003 space=2.2240002 width=3.959999]p

Extracted, parsed and prepared line:

32.94,35.099976 fs=8.0 xscale=1.0 height=4.4240003 space=2.2240002 width=3.959999

Function:

/**
 * Parse single word / letter element
 *
 * @param string $str_raw  Extracted word string line.
 * @param string $str_elem Element of interest, word, char.
 * @param int    $pdf_w    Pdf page width.
 * @param int    $pdf_h    Pdf page height.
 * @param int    $pdf_d    Pdf page dpi.
 * @param int    $pdf_r    Pdf page relative dpi.
 *
 * @return array
 */
function createRealCoordinates($str_raw, $str_elem, $pdf_w, $pdf_h, $pdf_d = 400, $pdf_r = 72)
{
    $stringstrip = array('fs=', 'xscale=', 'height=', 'space=', 'width=');
    $string_info = str_replace($stringstrip, '', $str_raw);

    $coord_info = explode(' ', $string_info);
    $coord_xy   = explode(',', $coord_info[0]);

    $coord = array(
        'pdfWidth'  => $pdf_w,
        'pdfHeight' => $pdf_h,
        'pdfDpi'    => $pdf_d,
        'pdfRel'    => $pdf_r,
        'word'      => $str_elem,

        'x1' => null,
        'y1' => null,
        'x2' => null,
        'y2' => null,

        'fontSize'     => null,
        'xScale'       => null,
        'HeightDir'    => null,
        'WidthDir'     => null,
        'WidthOfSpace' => null,
    );

    // Left, Bottom coordinate.
    $coord['x1'] = ($coord_xy[0] / $pdf_r) * $pdf_d;
    $coord['y2'] = ($coord_xy[1] / $pdf_r) * $pdf_d;

    $coord['fontSize']     = $coord_info[1]; // font size.
    $coord['xScale']       = $coord_info[2]; // x size scale.
    $coord['HeightDir']    = $coord_info[3]; // height.
    $coord['WidthDir']     = $coord_info[5]; // word width.
    $coord['WidthOfSpace'] = ($coord_info[4] / $pdf_r) * $pdf_d; // width of space.

    // Right, Top coordinate.
    $coord['x2'] = $coord['x1'] + (($coord['WidthDir'] / $pdf_r) * $pdf_d);
    $coord['y1'] = $coord['y2'] - (($coord['HeightDir'] / $pdf_r) * $pdf_d);

    return $coord;
}

-matija kancijan

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文