ReportLab 中的 Unicode 处理

发布于 2024-08-08 03:19:50 字数 1545 浏览 7 评论 0原文

我正在尝试将 ReportLab 与 Unicode 字符一起使用,但它不起作用。我尝试跟踪代码,直到到达以下行:(

class TTFont:
    # ...
    def splitString(self, text, doc, encoding='utf-8'):
        # ...
        cur.append(n & 0xFF) # <-- here is the problem!
        # ...

此代码可以在 ReportLab 的存储库中的文件 pdfbase/ttfonts.py。有问题的代码位于第 1059 行。)

为什么是 n' s 值正在被操作?

在上面显示的行中,n 包含正在处理的字符的代码点(例如,'A' 为 65,'a' 为 97,或者阿拉伯语为 1588)光泽“ô”)。 cur 是一个列表,其中填充了要发送到最终输出 (AFAIU) 的字符。在该行之前,一切(显然)工作正常,但在这一行中,n 的值被操纵,显然将其减少到扩展的 ASCII 范围!

这会导致非 ASCII、Unicode 字符失去其值。我不明白这个声明有什么用处,或者为什么有必要!

所以我的问题是,为什么 n 的值在这里被操纵,我应该如何解决这个问题?

编辑:
为了响应有关我的代码片段的评论,这里有一个导致此错误的示例:

my_doctemplate.build([Paragraph(bulletText = None, encoding = 'utf8',
    caseSensitive = 1, debug = 0,
    text = '\xd8\xa3\xd8\xa8\xd8\xb1\xd8\xa7\xd8\xac',
    frags = [ParaFrag(fontName = 'DejaVuSansMono-BoldOblique',
        text = '\xd8\xa3\xd8\xa8\xd8\xb1\xd8\xa7\xd8\xac',
        sub = 0, rise = 0, greek = 0, link = None, italic = 0, strike = 0,
        fontSize = 12.0, textColor = Color(0,0,0), super = 0, underline = 0,
        bold = 0)])])

PDFTextObject._textOut 中,调用 _formatText ,它将字体标识为 _dynamicFont,并相应地调用 font.splitString,这导致了上述错误。

I am trying to use ReportLab with Unicode characters, but it is not working. I tried tracing through the code until I reached the following line:

class TTFont:
    # ...
    def splitString(self, text, doc, encoding='utf-8'):
        # ...
        cur.append(n & 0xFF) # <-- here is the problem!
        # ...

(This code can be found in ReportLab's repository, in the file pdfbase/ttfonts.py. The code in question is in line 1059.)

Why is n's value being manipulated?

In the line shown above, n contains the code point of the character being processed (e.g. 65 for 'A', 97 for 'a', or 1588 for Arabic sheen 'ش'). cur is a list that is being filled with the characters to be sent to the final output (AFAIU). Before that line, everything was (apparently) working fine, but in this line, the value of n was manipulated, apparently reducing it to the extended ASCII range!

This causes non-ASCII, Unicode characters to lose their value. I cannot understand how this statement is useful, or why it is necessary!

So my question is, why is n's value being manipulated here, and how should I proceed about fixing this issue?

Edit:
In response to the comment regarding my code snippet, here is an example that causes this error:

my_doctemplate.build([Paragraph(bulletText = None, encoding = 'utf8',
    caseSensitive = 1, debug = 0,
    text = '\xd8\xa3\xd8\xa8\xd8\xb1\xd8\xa7\xd8\xac',
    frags = [ParaFrag(fontName = 'DejaVuSansMono-BoldOblique',
        text = '\xd8\xa3\xd8\xa8\xd8\xb1\xd8\xa7\xd8\xac',
        sub = 0, rise = 0, greek = 0, link = None, italic = 0, strike = 0,
        fontSize = 12.0, textColor = Color(0,0,0), super = 0, underline = 0,
        bold = 0)])])

In PDFTextObject._textOut, _formatText is called, which identifies the font as _dynamicFont, and accordingly calls font.splitString, which is causing the error described above.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

嘿哥们儿 2024-08-15 03:19:50

你是什​​么意思,“不工作”?您错误引用了reportlab源代码。它实际上所做的是,每个 16 位 unicode 字符的低位和高位字节都是单独编码的(高位字节仅在更改时才写出,我认为这是 PDF 特定的优化,以使文档更小)。

请准确解释问题是什么,而不是您认为的根本原因。您想要显示的字符很可能在所选字体(“DejaVuSansMono-BoldOblique”)中不存在。

What do you mean, "not working"? You have misquoted the reportlab source code. What it is actually doing is that the lower and upper byte of each 16-bit unicode character are coded separately (the upper byte is only written out when it changes, which I assume is a PDF-specific optimization to make documents smaller).

Please explain exactly what the problem is, not what you think what the underlying reason is. Chances are the characters you want to display simply don't exist in the selected font ('DejaVuSansMono-BoldOblique').

黯然#的苍凉 2024-08-15 03:19:50

我很确定您需要将 0xFF 更改为 0xFFFF 才能使用 4 字节 unicode 字符,如 ~unutbu 建议的那样,因此使用四个字节而不是两个。

I'm pretty sure you'd need to change 0xFF to 0xFFFF to use 4-byte unicode characters, as ~unutbu suggested, hence using four bytes instead of two.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文