使用 toUnicode 获取 PDF 中的文本

发布于 2024-12-10 00:11:11 字数 1844 浏览 0 评论 0原文

我正在处理一个 PDF 项目，我需要从 PDF 中获取所有文本。我在使用 PDF 本身提供的 toUnicode 字典表解码 Identity-H 字体时遇到一些问题。 toUnicode 提供到 unicode 十六进制的字符映射，但没有提供大写 CID 字符到 unicode（表中）。那么有没有办法可以在使用表将输入 unichar 映射到 unicode 之前将输入小写？

我可以使用 <000C> 之间的偏移量吗？ <0042>计算大写字符？

到Unicode表。

57 beginbfchar
<0001> <0020>
<0002> <0021>
<0003> <0026>
<0004> <2019>
<0005> <002C>
<0006> <002D>
<0007> <002E>
<0008> <003A>
<0009> <003F>
<000A> <0040>
<000B> <0041>
<000C> <0042>
<000D> <0043>
<000E> <0044>
<000F> <0045>
<0010> <0046>
<0011> <0047>
<0012> <0048>
<0013> <0049>
<0014> <004A>
<0015> <004B>
<0016> <004C>
<0017> <004D>
<0018> <004F>
<0019> <0050>
<001A> <0052>
<001B> <0053>
<001C> <0054>
<001D> <0055>
<001E> <0057>
<001F> <0059>
<0020> <2018>
<0021> <0061>
<0022> <0062>
<0023> <0063>
<0024> <0064>
<0025> <0065>
<0026> <0066>
<0027> <0067>
<0028> <0068>
<0029> <0069>
<002A> <006A>
<002B> <006B>
<002C> <006C>
<002D> <006D>
<002E> <006E>
<002F> <006F>
<0030> <0070>
<0031> <0072>
<0032> <0073>
<0033> <0074>
<0034> <0075>
<0035> <0077>
<0036> <0079>
<0037> <007A>
<0038> <FB01>
<0039> <00FC>
endbfchar

该表没有提供映射到大写字符的字形。那么如何展现人物性格呢？

原文

I am working in a PDF project, where I need to grab all text from the PDF. I've got some problem decoding Identity-H Font using toUnicode dictionary table provide from the PDF itself.
the toUnicode provide character mapping to unicode hex, but didn't provide the uppercase CID character to unicode (in table)..
So is there way that can lowercase the input unichar before process mapping to unicode using the table?

Can I using the offset between the <000C> <0042> to calculate the uppercase character?

toUnicode table .

57 beginbfchar
<0001> <0020>
<0002> <0021>
<0003> <0026>
<0004> <2019>
<0005> <002C>
<0006> <002D>
<0007> <002E>
<0008> <003A>
<0009> <003F>
<000A> <0040>
<000B> <0041>
<000C> <0042>
<000D> <0043>
<000E> <0044>
<000F> <0045>
<0010> <0046>
<0011> <0047>
<0012> <0048>
<0013> <0049>
<0014> <004A>
<0015> <004B>
<0016> <004C>
<0017> <004D>
<0018> <004F>
<0019> <0050>
<001A> <0052>
<001B> <0053>
<001C> <0054>
<001D> <0055>
<001E> <0057>
<001F> <0059>
<0020> <2018>
<0021> <0061>
<0022> <0062>
<0023> <0063>
<0024> <0064>
<0025> <0065>
<0026> <0066>
<0027> <0067>
<0028> <0068>
<0029> <0069>
<002A> <006A>
<002B> <006B>
<002C> <006C>
<002D> <006D>
<002E> <006E>
<002F> <006F>
<0030> <0070>
<0031> <0072>
<0032> <0073>
<0033> <0074>
<0034> <0075>
<0035> <0077>
<0036> <0079>
<0037> <007A>
<0038> <FB01>
<0039> <00FC>
endbfchar

the table did not provide glyph that mapping to uppercase Character. So how to show the character?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷鸟归林 2024-12-17 00:11:11

我解决了问题，问题出在 CGPDFStringCopyTextString() 中。这个方法从 CGPDFStringRef 获取字符串得到了一些我不想要的奇怪字节。因此，我尝试使用以下方法获取字节手册：

NSMutableString *unicodeString = [NSMutableString string];
    for (NSUInteger i = 0; i < [data length]; i++) {
        unsigned char byte;
        [data getBytes:&byte range:NSMakeRange(i, 1)];
        unichar unicodeChar = byte;
        [unicodeString appendFormat:@"%c",unicodeChar];
    }
return unicodeString;

I Solved the problem, the problem is in CGPDFStringCopyTextString(). this method get the string from CGPDFStringRef got some weird bytes that I didn't want. So instead of that I tried get the byte manual by using

NSMutableString *unicodeString = [NSMutableString string];
    for (NSUInteger i = 0; i < [data length]; i++) {
        unsigned char byte;
        [data getBytes:&byte range:NSMakeRange(i, 1)];
        unichar unicodeChar = byte;
        [unicodeString appendFormat:@"%c",unicodeChar];
    }
return unicodeString;

回复收藏 0 原文

~没有更多了~