ReportLab 中的 Unicode 处理
我正在尝试将 ReportLab 与 Unicode 字符一起使用,但它不起作用。我尝试跟踪代码,直到到达以下行:(
class TTFont:
# ...
def splitString(self, text, doc, encoding='utf-8'):
# ...
cur.append(n & 0xFF) # <-- here is the problem!
# ...
此代码可以在 ReportLab 的存储库中的文件 pdfbase/ttfonts.py。有问题的代码位于第 1059 行。)
为什么是 n
' s 值正在被操作?
在上面显示的行中,n
包含正在处理的字符的代码点(例如,'A' 为 65,'a' 为 97,或者阿拉伯语为 1588)光泽“ô”)。 cur
是一个列表,其中填充了要发送到最终输出 (AFAIU) 的字符。在该行之前,一切(显然)工作正常,但在这一行中,n
的值被操纵,显然将其减少到扩展的 ASCII 范围!
这会导致非 ASCII、Unicode 字符失去其值。我不明白这个声明有什么用处,或者为什么有必要!
所以我的问题是,为什么 n
的值在这里被操纵,我应该如何解决这个问题?
编辑:
为了响应有关我的代码片段的评论,这里有一个导致此错误的示例:
my_doctemplate.build([Paragraph(bulletText = None, encoding = 'utf8',
caseSensitive = 1, debug = 0,
text = '\xd8\xa3\xd8\xa8\xd8\xb1\xd8\xa7\xd8\xac',
frags = [ParaFrag(fontName = 'DejaVuSansMono-BoldOblique',
text = '\xd8\xa3\xd8\xa8\xd8\xb1\xd8\xa7\xd8\xac',
sub = 0, rise = 0, greek = 0, link = None, italic = 0, strike = 0,
fontSize = 12.0, textColor = Color(0,0,0), super = 0, underline = 0,
bold = 0)])])
在 PDFTextObject._textOut
中,调用 _formatText
,它将字体标识为 _dynamicFont
,并相应地调用 font.splitString
,这导致了上述错误。
I am trying to use ReportLab with Unicode characters, but it is not working. I tried tracing through the code until I reached the following line:
class TTFont:
# ...
def splitString(self, text, doc, encoding='utf-8'):
# ...
cur.append(n & 0xFF) # <-- here is the problem!
# ...
(This code can be found in ReportLab's repository, in the file pdfbase/ttfonts.py. The code in question is in line 1059.)
Why is n
's value being manipulated?
In the line shown above, n
contains the code point of the character being processed (e.g. 65 for 'A', 97 for 'a', or 1588 for Arabic sheen 'ش'). cur
is a list that is being filled with the characters to be sent to the final output (AFAIU). Before that line, everything was (apparently) working fine, but in this line, the value of n
was manipulated, apparently reducing it to the extended ASCII range!
This causes non-ASCII, Unicode characters to lose their value. I cannot understand how this statement is useful, or why it is necessary!
So my question is, why is n
's value being manipulated here, and how should I proceed about fixing this issue?
Edit:
In response to the comment regarding my code snippet, here is an example that causes this error:
my_doctemplate.build([Paragraph(bulletText = None, encoding = 'utf8',
caseSensitive = 1, debug = 0,
text = '\xd8\xa3\xd8\xa8\xd8\xb1\xd8\xa7\xd8\xac',
frags = [ParaFrag(fontName = 'DejaVuSansMono-BoldOblique',
text = '\xd8\xa3\xd8\xa8\xd8\xb1\xd8\xa7\xd8\xac',
sub = 0, rise = 0, greek = 0, link = None, italic = 0, strike = 0,
fontSize = 12.0, textColor = Color(0,0,0), super = 0, underline = 0,
bold = 0)])])
In PDFTextObject._textOut
, _formatText
is called, which identifies the font as _dynamicFont
, and accordingly calls font.splitString
, which is causing the error described above.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你是什么意思,“不工作”?您错误引用了reportlab源代码。它实际上所做的是,每个 16 位 unicode 字符的低位和高位字节都是单独编码的(高位字节仅在更改时才写出,我认为这是 PDF 特定的优化,以使文档更小)。
请准确解释问题是什么,而不是您认为的根本原因。您想要显示的字符很可能在所选字体(“DejaVuSansMono-BoldOblique”)中不存在。
What do you mean, "not working"? You have misquoted the reportlab source code. What it is actually doing is that the lower and upper byte of each 16-bit unicode character are coded separately (the upper byte is only written out when it changes, which I assume is a PDF-specific optimization to make documents smaller).
Please explain exactly what the problem is, not what you think what the underlying reason is. Chances are the characters you want to display simply don't exist in the selected font ('DejaVuSansMono-BoldOblique').
我很确定您需要将
0xFF
更改为0xFFFF
才能使用 4 字节 unicode 字符,如 ~unutbu 建议的那样,因此使用四个字节而不是两个。I'm pretty sure you'd need to change
0xFF
to0xFFFF
to use 4-byte unicode characters, as ~unutbu suggested, hence using four bytes instead of two.