使用 Quartz 2D 解析 pdf 时获取文本位置

发布于 2024-09-17 18:41:17 字数 4825 浏览 4 评论 0原文

关于pdf解析的另一个问题... 刚刚阅读 PDF 参考版本 1.7“5.3.1 文本定位运算符”,我有点困惑。

我编写了一些代码来获取变换矩阵和初始文本位置。

        CGPDFOperatorTableSetCallback (table, "MP", &op_MP);//Define marked-content point
    CGPDFOperatorTableSetCallback (table, "DP", &op_DP);//Define marked-content point with property list
    CGPDFOperatorTableSetCallback (table, "BMC", &op_BMC);//Begin marked-content sequence
    CGPDFOperatorTableSetCallback (table, "BDC", &op_BDC);//Begin marked-content sequence with property list
    CGPDFOperatorTableSetCallback (table, "EMC", &op_EMC);//End marked-content sequence

    //Text State operators
    CGPDFOperatorTableSetCallback(table, "Tc", &op_Tc);
    CGPDFOperatorTableSetCallback(table, "Tw", &op_Tw);
    CGPDFOperatorTableSetCallback(table, "Tz", &op_Tz);
    CGPDFOperatorTableSetCallback(table, "TL", &op_TL);
    CGPDFOperatorTableSetCallback(table, "Tf", &op_Tf);
    CGPDFOperatorTableSetCallback(table, "Tr", &op_Tr);
    CGPDFOperatorTableSetCallback(table, "Ts", &op_Ts);

    //text showing operators
    CGPDFOperatorTableSetCallback(table, "TJ", &op_TJ);
    CGPDFOperatorTableSetCallback(table, "Tj", &op_Tj);
    CGPDFOperatorTableSetCallback(table, "'", &op_apostrof);
    CGPDFOperatorTableSetCallback(table, "\"", &op_double_apostrof);

    //text positioning operators        
    CGPDFOperatorTableSetCallback(table, "Td", &op_Td);
    CGPDFOperatorTableSetCallback(table, "TD", &op_TD);
    CGPDFOperatorTableSetCallback(table, "Tm", &op_Tm);
    CGPDFOperatorTableSetCallback(table, "T*", &op_T);

    //text object operators
    CGPDFOperatorTableSetCallback(table, "BT", &op_BT);//Begin text object
    CGPDFOperatorTableSetCallback(table, "ET", &op_ET);//End text object

这是应用程序午餐后的输出:

    2010-09-02 15:09:23.041 testSearch[8251:207] op_BT begin
    Integer value: 0
    2010-09-02 15:09:23.043 testSearch[8251:207] op_BT end
    2010-09-02 15:09:23.043 testSearch[8251:207] op_Tf begin
    Integer value: 1
    2010-09-02 15:09:23.044 testSearch[8251:207] op_Tf end
    2010-09-02 15:09:23.044 testSearch[8251:207] op_Tm begin
    Float value: 557.364197
    2010-09-02 15:09:23.045 testSearch[8251:207] op_Tm end
    2010-09-02 15:09:23.045 testSearch[8251:207] op_TJ begin
    2010-09-02 15:09:23.046 testSearch[8251:207] Array string value [0]: F
    2010-09-02 15:09:23.046 testSearch[8251:207] Array integer value [1]: 94985208
    2010-09-02 15:09:23.047 testSearch[8251:207] Array string value [2]: r
    2010-09-02 15:09:23.047 testSearch[8251:207] Array integer value [3]: 94985208
    2010-09-02 15:09:23.048 testSearch[8251:207] Array string value [4]: o
    2010-09-02 15:09:23.048 testSearch[8251:207] Array integer value [5]: 94985208
    2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [6]: m s
    2010-09-02 15:09:23.049 testSearch[8251:207] Array integer value [7]: 94985208
    2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [8]: a
    2010-09-02 15:09:23.050 testSearch[8251:207] Array integer value [9]: 94985208
    2010-09-02 15:09:23.050 testSearch[8251:207] Array string value [10]: m
    2010-09-02 15:09:23.051 testSearch[8251:207] Array integer value [11]: 94985208
    2010-09-02 15:09:23.051 testSearch[8251:207] Array string value [12]: p
    2010-09-02 15:09:23.052 testSearch[8251:207] Array integer value [13]: 94985208
    2010-09-02 15:09:23.053 testSearch[8251:207] Array string value [14]: l
    2010-09-02 15:09:23.054 testSearch[8251:207] Array integer value [15]: 94985208
    2010-09-02 15:09:23.055 testSearch[8251:207] Array string value [16]: e t
    2010-09-02 15:09:23.055 testSearch[8251:207] Array integer value [17]: 94985208
    2010-09-02 15:09:23.057 testSearch[8251:207] Array string value [18]: o r
    2010-09-02 15:09:23.057 testSearch[8251:207] Array integer value [19]: 94985208
    2010-09-02 15:09:23.058 testSearch[8251:207] Array string value [20]: e
    2010-09-02 15:09:23.058 testSearch[8251:207] Array integer value [21]: 94985208
    2010-09-02 15:09:23.059 testSearch[8251:207] Array string value [22]: s
    2010-09-02 15:09:23.059 testSearch[8251:207] Array integer value [23]: 94985208
    2010-09-02 15:09:23.060 testSearch[8251:207] Array string value [24]: u
    2010-09-02 15:09:23.061 testSearch[8251:207] Array integer value [25]: 94985208
    2010-09-02 15:09:23.061 testSearch[8251:207] Array string value [26]: l
    2010-09-02 15:09:23.062 testSearch[8251:207] Array integer value [27]: 94985208
    2010-09-02 15:09:23.062 testSearch[8251:207] Array string value [28]: t
    2010-09-02 15:09:23.063 testSearch[8251:207] op_TJ end

如果有人熟悉文本矩阵和文本定位运算符,那么最好解释一下所有这些东西是如何工作的。

如何使用 Tm (变换矩阵和其他数据)计算文本位置(或字形?)?

another question regarding pdf parsing...
Just read PDF Reference version 1.7 "5.3.1 Text-Positioning Operators" and I am a little bit confused.

I wrote some code to get transformation matrix and initial text position.

        CGPDFOperatorTableSetCallback (table, "MP", &op_MP);//Define marked-content point
    CGPDFOperatorTableSetCallback (table, "DP", &op_DP);//Define marked-content point with property list
    CGPDFOperatorTableSetCallback (table, "BMC", &op_BMC);//Begin marked-content sequence
    CGPDFOperatorTableSetCallback (table, "BDC", &op_BDC);//Begin marked-content sequence with property list
    CGPDFOperatorTableSetCallback (table, "EMC", &op_EMC);//End marked-content sequence

    //Text State operators
    CGPDFOperatorTableSetCallback(table, "Tc", &op_Tc);
    CGPDFOperatorTableSetCallback(table, "Tw", &op_Tw);
    CGPDFOperatorTableSetCallback(table, "Tz", &op_Tz);
    CGPDFOperatorTableSetCallback(table, "TL", &op_TL);
    CGPDFOperatorTableSetCallback(table, "Tf", &op_Tf);
    CGPDFOperatorTableSetCallback(table, "Tr", &op_Tr);
    CGPDFOperatorTableSetCallback(table, "Ts", &op_Ts);

    //text showing operators
    CGPDFOperatorTableSetCallback(table, "TJ", &op_TJ);
    CGPDFOperatorTableSetCallback(table, "Tj", &op_Tj);
    CGPDFOperatorTableSetCallback(table, "'", &op_apostrof);
    CGPDFOperatorTableSetCallback(table, "\"", &op_double_apostrof);

    //text positioning operators        
    CGPDFOperatorTableSetCallback(table, "Td", &op_Td);
    CGPDFOperatorTableSetCallback(table, "TD", &op_TD);
    CGPDFOperatorTableSetCallback(table, "Tm", &op_Tm);
    CGPDFOperatorTableSetCallback(table, "T*", &op_T);

    //text object operators
    CGPDFOperatorTableSetCallback(table, "BT", &op_BT);//Begin text object
    CGPDFOperatorTableSetCallback(table, "ET", &op_ET);//End text object

So this is the output after application lunch:

    2010-09-02 15:09:23.041 testSearch[8251:207] op_BT begin
    Integer value: 0
    2010-09-02 15:09:23.043 testSearch[8251:207] op_BT end
    2010-09-02 15:09:23.043 testSearch[8251:207] op_Tf begin
    Integer value: 1
    2010-09-02 15:09:23.044 testSearch[8251:207] op_Tf end
    2010-09-02 15:09:23.044 testSearch[8251:207] op_Tm begin
    Float value: 557.364197
    2010-09-02 15:09:23.045 testSearch[8251:207] op_Tm end
    2010-09-02 15:09:23.045 testSearch[8251:207] op_TJ begin
    2010-09-02 15:09:23.046 testSearch[8251:207] Array string value [0]: F
    2010-09-02 15:09:23.046 testSearch[8251:207] Array integer value [1]: 94985208
    2010-09-02 15:09:23.047 testSearch[8251:207] Array string value [2]: r
    2010-09-02 15:09:23.047 testSearch[8251:207] Array integer value [3]: 94985208
    2010-09-02 15:09:23.048 testSearch[8251:207] Array string value [4]: o
    2010-09-02 15:09:23.048 testSearch[8251:207] Array integer value [5]: 94985208
    2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [6]: m s
    2010-09-02 15:09:23.049 testSearch[8251:207] Array integer value [7]: 94985208
    2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [8]: a
    2010-09-02 15:09:23.050 testSearch[8251:207] Array integer value [9]: 94985208
    2010-09-02 15:09:23.050 testSearch[8251:207] Array string value [10]: m
    2010-09-02 15:09:23.051 testSearch[8251:207] Array integer value [11]: 94985208
    2010-09-02 15:09:23.051 testSearch[8251:207] Array string value [12]: p
    2010-09-02 15:09:23.052 testSearch[8251:207] Array integer value [13]: 94985208
    2010-09-02 15:09:23.053 testSearch[8251:207] Array string value [14]: l
    2010-09-02 15:09:23.054 testSearch[8251:207] Array integer value [15]: 94985208
    2010-09-02 15:09:23.055 testSearch[8251:207] Array string value [16]: e t
    2010-09-02 15:09:23.055 testSearch[8251:207] Array integer value [17]: 94985208
    2010-09-02 15:09:23.057 testSearch[8251:207] Array string value [18]: o r
    2010-09-02 15:09:23.057 testSearch[8251:207] Array integer value [19]: 94985208
    2010-09-02 15:09:23.058 testSearch[8251:207] Array string value [20]: e
    2010-09-02 15:09:23.058 testSearch[8251:207] Array integer value [21]: 94985208
    2010-09-02 15:09:23.059 testSearch[8251:207] Array string value [22]: s
    2010-09-02 15:09:23.059 testSearch[8251:207] Array integer value [23]: 94985208
    2010-09-02 15:09:23.060 testSearch[8251:207] Array string value [24]: u
    2010-09-02 15:09:23.061 testSearch[8251:207] Array integer value [25]: 94985208
    2010-09-02 15:09:23.061 testSearch[8251:207] Array string value [26]: l
    2010-09-02 15:09:23.062 testSearch[8251:207] Array integer value [27]: 94985208
    2010-09-02 15:09:23.062 testSearch[8251:207] Array string value [28]: t
    2010-09-02 15:09:23.063 testSearch[8251:207] op_TJ end

If someone is familiar with text matrix and text positioning operators it would be nice to explain how all those thing work.

How to calculate text position (or glyph?) using Tm (transformation matrix and other data)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

深海蓝天 2024-09-24 18:41:17

@Koteg:嗨!你终于成功了吗?对于 Tm,我可以获得所有六个值,但现在我不知道如何将单词的位置放入一行中......
我有一个想法:如果我们在 Tj 中,只需获取字母之间的空格(每次都相同),然后在 Tm 中获取单词的位置。
对于 TJ,这要复杂得多:获取水平平移的值以减去数组每个部分的 Tm 矩阵,但在该数组中搜索单词将比 Tj 更复杂。

顺便说一句,对于其他人:

for(size_t n = 0; n < CGPDFArrayGetCount(array); n += 2)
{
    if(n >= CGPDFArrayGetCount(array))
        continue;

    CGPDFStringRef string;
    success = CGPDFArrayGetString(array, n, &string);
    if(success)
    {
        NSString *data = (NSString *)CGPDFStringCopyTextString(string);
        NSLog(@"array data : %@", data);

        [searcher.currentData appendFormat:@"%@", data];
        [data release];
    }

    CGPDFReal real;
    success = CGPDFArrayGetNumber(array, n+1, &real);
    if(success)
    {
        NSLog(@"array real : %f", real);
    }
}

谢谢

@Koteg : Hi ! Have you finally managed to get it work ? For Tm, i'm able to get all the six values, but for now i can't see how to get the position of a word into a line ...
I have an idea : if we are in Tj, just get the space between letters (hopping this the same everytime) and with Tm, get the position of a word.
In the case of TJ, this is quite more complicated : get the value of horizontal translation to substract to Tm matrix for each part of the array, but searching a word in that array will be more complicated than for Tj.

BTW, for others people :

for(size_t n = 0; n < CGPDFArrayGetCount(array); n += 2)
{
    if(n >= CGPDFArrayGetCount(array))
        continue;

    CGPDFStringRef string;
    success = CGPDFArrayGetString(array, n, &string);
    if(success)
    {
        NSString *data = (NSString *)CGPDFStringCopyTextString(string);
        NSLog(@"array data : %@", data);

        [searcher.currentData appendFormat:@"%@", data];
        [data release];
    }

    CGPDFReal real;
    success = CGPDFArrayGetNumber(array, n+1, &real);
    if(success)
    {
        NSLog(@"array real : %f", real);
    }
}

Thanks

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文