使用 Quartz 2D 解析 pdf 时获取文本位置
关于pdf解析的另一个问题... 刚刚阅读 PDF 参考版本 1.7“5.3.1 文本定位运算符”,我有点困惑。
我编写了一些代码来获取变换矩阵和初始文本位置。
CGPDFOperatorTableSetCallback (table, "MP", &op_MP);//Define marked-content point
CGPDFOperatorTableSetCallback (table, "DP", &op_DP);//Define marked-content point with property list
CGPDFOperatorTableSetCallback (table, "BMC", &op_BMC);//Begin marked-content sequence
CGPDFOperatorTableSetCallback (table, "BDC", &op_BDC);//Begin marked-content sequence with property list
CGPDFOperatorTableSetCallback (table, "EMC", &op_EMC);//End marked-content sequence
//Text State operators
CGPDFOperatorTableSetCallback(table, "Tc", &op_Tc);
CGPDFOperatorTableSetCallback(table, "Tw", &op_Tw);
CGPDFOperatorTableSetCallback(table, "Tz", &op_Tz);
CGPDFOperatorTableSetCallback(table, "TL", &op_TL);
CGPDFOperatorTableSetCallback(table, "Tf", &op_Tf);
CGPDFOperatorTableSetCallback(table, "Tr", &op_Tr);
CGPDFOperatorTableSetCallback(table, "Ts", &op_Ts);
//text showing operators
CGPDFOperatorTableSetCallback(table, "TJ", &op_TJ);
CGPDFOperatorTableSetCallback(table, "Tj", &op_Tj);
CGPDFOperatorTableSetCallback(table, "'", &op_apostrof);
CGPDFOperatorTableSetCallback(table, "\"", &op_double_apostrof);
//text positioning operators
CGPDFOperatorTableSetCallback(table, "Td", &op_Td);
CGPDFOperatorTableSetCallback(table, "TD", &op_TD);
CGPDFOperatorTableSetCallback(table, "Tm", &op_Tm);
CGPDFOperatorTableSetCallback(table, "T*", &op_T);
//text object operators
CGPDFOperatorTableSetCallback(table, "BT", &op_BT);//Begin text object
CGPDFOperatorTableSetCallback(table, "ET", &op_ET);//End text object
这是应用程序午餐后的输出:
2010-09-02 15:09:23.041 testSearch[8251:207] op_BT begin
Integer value: 0
2010-09-02 15:09:23.043 testSearch[8251:207] op_BT end
2010-09-02 15:09:23.043 testSearch[8251:207] op_Tf begin
Integer value: 1
2010-09-02 15:09:23.044 testSearch[8251:207] op_Tf end
2010-09-02 15:09:23.044 testSearch[8251:207] op_Tm begin
Float value: 557.364197
2010-09-02 15:09:23.045 testSearch[8251:207] op_Tm end
2010-09-02 15:09:23.045 testSearch[8251:207] op_TJ begin
2010-09-02 15:09:23.046 testSearch[8251:207] Array string value [0]: F
2010-09-02 15:09:23.046 testSearch[8251:207] Array integer value [1]: 94985208
2010-09-02 15:09:23.047 testSearch[8251:207] Array string value [2]: r
2010-09-02 15:09:23.047 testSearch[8251:207] Array integer value [3]: 94985208
2010-09-02 15:09:23.048 testSearch[8251:207] Array string value [4]: o
2010-09-02 15:09:23.048 testSearch[8251:207] Array integer value [5]: 94985208
2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [6]: m s
2010-09-02 15:09:23.049 testSearch[8251:207] Array integer value [7]: 94985208
2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [8]: a
2010-09-02 15:09:23.050 testSearch[8251:207] Array integer value [9]: 94985208
2010-09-02 15:09:23.050 testSearch[8251:207] Array string value [10]: m
2010-09-02 15:09:23.051 testSearch[8251:207] Array integer value [11]: 94985208
2010-09-02 15:09:23.051 testSearch[8251:207] Array string value [12]: p
2010-09-02 15:09:23.052 testSearch[8251:207] Array integer value [13]: 94985208
2010-09-02 15:09:23.053 testSearch[8251:207] Array string value [14]: l
2010-09-02 15:09:23.054 testSearch[8251:207] Array integer value [15]: 94985208
2010-09-02 15:09:23.055 testSearch[8251:207] Array string value [16]: e t
2010-09-02 15:09:23.055 testSearch[8251:207] Array integer value [17]: 94985208
2010-09-02 15:09:23.057 testSearch[8251:207] Array string value [18]: o r
2010-09-02 15:09:23.057 testSearch[8251:207] Array integer value [19]: 94985208
2010-09-02 15:09:23.058 testSearch[8251:207] Array string value [20]: e
2010-09-02 15:09:23.058 testSearch[8251:207] Array integer value [21]: 94985208
2010-09-02 15:09:23.059 testSearch[8251:207] Array string value [22]: s
2010-09-02 15:09:23.059 testSearch[8251:207] Array integer value [23]: 94985208
2010-09-02 15:09:23.060 testSearch[8251:207] Array string value [24]: u
2010-09-02 15:09:23.061 testSearch[8251:207] Array integer value [25]: 94985208
2010-09-02 15:09:23.061 testSearch[8251:207] Array string value [26]: l
2010-09-02 15:09:23.062 testSearch[8251:207] Array integer value [27]: 94985208
2010-09-02 15:09:23.062 testSearch[8251:207] Array string value [28]: t
2010-09-02 15:09:23.063 testSearch[8251:207] op_TJ end
如果有人熟悉文本矩阵和文本定位运算符,那么最好解释一下所有这些东西是如何工作的。
如何使用 Tm (变换矩阵和其他数据)计算文本位置(或字形?)?
another question regarding pdf parsing...
Just read PDF Reference version 1.7 "5.3.1 Text-Positioning Operators" and I am a little bit confused.
I wrote some code to get transformation matrix and initial text position.
CGPDFOperatorTableSetCallback (table, "MP", &op_MP);//Define marked-content point
CGPDFOperatorTableSetCallback (table, "DP", &op_DP);//Define marked-content point with property list
CGPDFOperatorTableSetCallback (table, "BMC", &op_BMC);//Begin marked-content sequence
CGPDFOperatorTableSetCallback (table, "BDC", &op_BDC);//Begin marked-content sequence with property list
CGPDFOperatorTableSetCallback (table, "EMC", &op_EMC);//End marked-content sequence
//Text State operators
CGPDFOperatorTableSetCallback(table, "Tc", &op_Tc);
CGPDFOperatorTableSetCallback(table, "Tw", &op_Tw);
CGPDFOperatorTableSetCallback(table, "Tz", &op_Tz);
CGPDFOperatorTableSetCallback(table, "TL", &op_TL);
CGPDFOperatorTableSetCallback(table, "Tf", &op_Tf);
CGPDFOperatorTableSetCallback(table, "Tr", &op_Tr);
CGPDFOperatorTableSetCallback(table, "Ts", &op_Ts);
//text showing operators
CGPDFOperatorTableSetCallback(table, "TJ", &op_TJ);
CGPDFOperatorTableSetCallback(table, "Tj", &op_Tj);
CGPDFOperatorTableSetCallback(table, "'", &op_apostrof);
CGPDFOperatorTableSetCallback(table, "\"", &op_double_apostrof);
//text positioning operators
CGPDFOperatorTableSetCallback(table, "Td", &op_Td);
CGPDFOperatorTableSetCallback(table, "TD", &op_TD);
CGPDFOperatorTableSetCallback(table, "Tm", &op_Tm);
CGPDFOperatorTableSetCallback(table, "T*", &op_T);
//text object operators
CGPDFOperatorTableSetCallback(table, "BT", &op_BT);//Begin text object
CGPDFOperatorTableSetCallback(table, "ET", &op_ET);//End text object
So this is the output after application lunch:
2010-09-02 15:09:23.041 testSearch[8251:207] op_BT begin
Integer value: 0
2010-09-02 15:09:23.043 testSearch[8251:207] op_BT end
2010-09-02 15:09:23.043 testSearch[8251:207] op_Tf begin
Integer value: 1
2010-09-02 15:09:23.044 testSearch[8251:207] op_Tf end
2010-09-02 15:09:23.044 testSearch[8251:207] op_Tm begin
Float value: 557.364197
2010-09-02 15:09:23.045 testSearch[8251:207] op_Tm end
2010-09-02 15:09:23.045 testSearch[8251:207] op_TJ begin
2010-09-02 15:09:23.046 testSearch[8251:207] Array string value [0]: F
2010-09-02 15:09:23.046 testSearch[8251:207] Array integer value [1]: 94985208
2010-09-02 15:09:23.047 testSearch[8251:207] Array string value [2]: r
2010-09-02 15:09:23.047 testSearch[8251:207] Array integer value [3]: 94985208
2010-09-02 15:09:23.048 testSearch[8251:207] Array string value [4]: o
2010-09-02 15:09:23.048 testSearch[8251:207] Array integer value [5]: 94985208
2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [6]: m s
2010-09-02 15:09:23.049 testSearch[8251:207] Array integer value [7]: 94985208
2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [8]: a
2010-09-02 15:09:23.050 testSearch[8251:207] Array integer value [9]: 94985208
2010-09-02 15:09:23.050 testSearch[8251:207] Array string value [10]: m
2010-09-02 15:09:23.051 testSearch[8251:207] Array integer value [11]: 94985208
2010-09-02 15:09:23.051 testSearch[8251:207] Array string value [12]: p
2010-09-02 15:09:23.052 testSearch[8251:207] Array integer value [13]: 94985208
2010-09-02 15:09:23.053 testSearch[8251:207] Array string value [14]: l
2010-09-02 15:09:23.054 testSearch[8251:207] Array integer value [15]: 94985208
2010-09-02 15:09:23.055 testSearch[8251:207] Array string value [16]: e t
2010-09-02 15:09:23.055 testSearch[8251:207] Array integer value [17]: 94985208
2010-09-02 15:09:23.057 testSearch[8251:207] Array string value [18]: o r
2010-09-02 15:09:23.057 testSearch[8251:207] Array integer value [19]: 94985208
2010-09-02 15:09:23.058 testSearch[8251:207] Array string value [20]: e
2010-09-02 15:09:23.058 testSearch[8251:207] Array integer value [21]: 94985208
2010-09-02 15:09:23.059 testSearch[8251:207] Array string value [22]: s
2010-09-02 15:09:23.059 testSearch[8251:207] Array integer value [23]: 94985208
2010-09-02 15:09:23.060 testSearch[8251:207] Array string value [24]: u
2010-09-02 15:09:23.061 testSearch[8251:207] Array integer value [25]: 94985208
2010-09-02 15:09:23.061 testSearch[8251:207] Array string value [26]: l
2010-09-02 15:09:23.062 testSearch[8251:207] Array integer value [27]: 94985208
2010-09-02 15:09:23.062 testSearch[8251:207] Array string value [28]: t
2010-09-02 15:09:23.063 testSearch[8251:207] op_TJ end
If someone is familiar with text matrix and text positioning operators it would be nice to explain how all those thing work.
How to calculate text position (or glyph?) using Tm (transformation matrix and other data)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
@Koteg:嗨!你终于成功了吗?对于 Tm,我可以获得所有六个值,但现在我不知道如何将单词的位置放入一行中......
我有一个想法:如果我们在 Tj 中,只需获取字母之间的空格(每次都相同),然后在 Tm 中获取单词的位置。
对于 TJ,这要复杂得多:获取水平平移的值以减去数组每个部分的 Tm 矩阵,但在该数组中搜索单词将比 Tj 更复杂。
顺便说一句,对于其他人:
谢谢
@Koteg : Hi ! Have you finally managed to get it work ? For Tm, i'm able to get all the six values, but for now i can't see how to get the position of a word into a line ...
I have an idea : if we are in Tj, just get the space between letters (hopping this the same everytime) and with Tm, get the position of a word.
In the case of TJ, this is quite more complicated : get the value of horizontal translation to substract to Tm matrix for each part of the array, but searching a word in that array will be more complicated than for Tj.
BTW, for others people :
Thanks