如何控制包含东亚字符的 Unicode 字符串的填充
我得到了三个 UTF-8 刺:
hello, world
hello, 世界
hello, 世rld
我只想要前 10 个 ascii-char-width,以便括号在一列中:
[hello, wor]
[hello, 世 ]
[hello, 世r]
在控制台中:
width('世界')==width('worl')
width('世 ')==width('wor') #a white space behind '世'
一个中文字符是三个字节,但在控制台中显示时只有 2 个 ascii 字符宽度:
>>> bytes("hello, 世界", encoding='utf-8')
b'hello, \xe4\xb8\x96\xe7\x95\x8c'
python's当 UTF-8 字符混合时, format()
没有帮助
>>> for s in ['[{0:<{1}.{1}}]'.format(s, 10) for s in ['hello, world', 'hello, 世界', 'hello, 世rld']]:
... print(s)
...
[hello, wor]
[hello, 世界 ]
[hello, 世rl]
这不漂亮:
-----------Songs-----------
| 1: 蝴蝶 |
| 2: 心之城 |
| 3: 支持你的爱人 |
| 4: 根生的种子 |
| 5: 鸽子歌(CUCURRUCUCU PALO|
| 6: 林地之间 |
| 7: 蓝光 |
| 8: 在你眼里 |
| 9: 肖邦离别曲 |
| 10: 西行( 魔戒王者再临主题曲)(INTO |
| X 11: 深陷爱河 |
| X 12: 钟爱大地(THE MO RUN AIR |
| X 13: 时光流逝 |
| X 14: 卡农 |
| X 15: 舒伯特小夜曲(SERENADE) |
| X 16: 甜蜜的摇篮曲(Sweet Lullaby|
---------------------------
所以,我想知道是否有一个标准方法来执行 UTF-8 填充人员?
I got three UTF-8 stings:
hello, world
hello, 世界
hello, 世rld
I only want the first 10 ascii-char-width so that the bracket in one column:
[hello, wor]
[hello, 世 ]
[hello, 世r]
In console:
width('世界')==width('worl')
width('世 ')==width('wor') #a white space behind '世'
One chinese char is three bytes, but it only 2 ascii chars width when displayed in console:
>>> bytes("hello, 世界", encoding='utf-8')
b'hello, \xe4\xb8\x96\xe7\x95\x8c'
python's format()
doesn't help when UTF-8 chars mixed in
>>> for s in ['[{0:<{1}.{1}}]'.format(s, 10) for s in ['hello, world', 'hello, 世界', 'hello, 世rld']]:
... print(s)
...
[hello, wor]
[hello, 世界 ]
[hello, 世rl]
It's not pretty:
-----------Songs-----------
| 1: 蝴蝶 |
| 2: 心之城 |
| 3: 支持你的爱人 |
| 4: 根生的种子 |
| 5: 鸽子歌(CUCURRUCUCU PALO|
| 6: 林地之间 |
| 7: 蓝光 |
| 8: 在你眼里 |
| 9: 肖邦离别曲 |
| 10: 西行( 魔戒王者再临主题曲)(INTO |
| X 11: 深陷爱河 |
| X 12: 钟爱大地(THE MO RUN AIR |
| X 13: 时光流逝 |
| X 14: 卡农 |
| X 15: 舒伯特小夜曲(SERENADE) |
| X 16: 甜蜜的摇篮曲(Sweet Lullaby|
---------------------------
So, I wonder if there is a standard way to do the UTF-8 padding staff?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
当尝试将 ASCII 文本与固定宽度字体的中文对齐时,存在一组可打印 ASCII 字符的全角版本。下面我制作了一个 ASCII 到全角版本的转换表:
输出
它不是太漂亮,但它对齐了。
When trying to line up ASCII text with Chinese in fixed-width font, there is a set of full width versions of the printable ASCII characters. Below I made a translation table of ASCII to full width version:
Output
It's not overly pretty, but it lines up.
似乎没有对此的官方支持,但内置包可能会有所帮助:
返回的值代表 代码点的类别。具体来说,
这个答案类似的问题提供了一个快速的解决方案。但请注意,显示结果取决于所使用的确切等宽字体。 ipython 和 pydev 使用的默认字体不能很好地工作,而 Windows 控制台则可以。
There seems to be no official support for this, but a built-in package may help:
The returned value represents the category of the code point. Specifically,
This answer to a similar question provided a quick solution. Note however, the display result depends on the exact monospaced font used. The default fonts used by ipython and pydev don't work well, while windows console is ok.
看看厨房。我认为它可能有你想要的。
Take a look at kitchen. I think it might have what you want.
首先,看起来你使用的是Python 3,所以我会做出相应的回应。
也许我不明白你的问题,但看起来你得到的输出正是你想要的,除了你的字体中的汉字更宽。
所以 UTF-8 是一个转移注意力的话题,因为我们谈论的不是字节,而是字符。您使用的是 Python 3,因此所有字符串都是 Unicode。底层字节表示(其中每个汉字由三个字节表示)是不相关的。
您想要将每个字符串剪切或填充为恰好 10 个字符,并且这可以正常工作:
唯一的问题是您使用看似等宽字体的字体查看它,但实际上不是等宽字体>。大多数等宽字体都有这个问题。所有普通拉丁字符在此字体中的宽度完全相同,但中文字符稍宽。因此,三个字符
“世界”
比三个字符“wor”
占用更多的水平空间。对此,您无能为力,除了a)获得真正等宽的字体,或者b)精确计算字体中每个字符的宽度,并添加一些空格,这大约会将您带到相同的水平位置(这永远不会准确)。Firstly, it looks like you're using Python 3, so I'll respond accordingly.
Maybe I'm not understanding your question, but it looks like the output you are getting is exactly what you want, except that Chinese characters are wider in your font.
So UTF-8 is a red herring, since we are not talking about bytes, we are talking about characters. You are in Python 3, so all strings are Unicode. The underlying byte representation (where each of those Chinese characters is represented by three bytes) is irrelevant.
You want to clip or pad each string to exactly 10 characters, and that is working correctly:
The only problem is that you are looking at it with what appears to be a monospaced font, but which actually isn't. Most monospaced fonts have this problem. All the normal Latin characters have exactly the same width in this font, but the Chinese characters are slightly wider. Therefore, the three characters
"世界 "
take up more horizontal space than the three characters"wor"
. There isn't much you can do about this, aside from either a) getting a font which is truly monospaced, or b) calculating precisely how wide each character is in your font, and adding a number of spaces which approximately takes you to the same horizontal position (this will never be accurate).如果您正在使用英文和中文字符,也许这个片段可以帮助您。
输出
if you are working with English and Chinese characters, maybe this snippet can help you.
Output
这是一个基于 unicodedata 的脚本,用于检测东亚字符并将其标准化为 NFC 形式,以确保精确的半角/全角匹配。
macOS 中的韩语需要规范化,因为 macOS 使用 NFD 形式,并且韩语字符被分解为单个音节,这些音节在 Python 中被计为字符。
(例如,“у”被分解为两个字符,而“각”被分解为三个字符等,而它们都应算作双角。)
它枚举给定
root_path
并显示文件名是 NFC 还是 NFD 形式。Here is a script based on unicodedata for detecting East-Asian characters and normalize them in to the NFC forms to ensure exact half/full-width matching.
Normalization is required for Korean in macOS because macOS uses NFD forms and Korean characters are decomposed into individual syllables which are counted as characters in Python.
(e.g., "가" is decomposed into two characters while "각" is decomposed into three characters, etc., while both they should be counted as double-width.)
It enumerates all files in the given
root_path
and displays whether the file names are in NFC or NFD forms.这是另一个选项,允许您保留原始宽度的拉丁字符,只要您的目标(例如终端)解释 ANSI 转义并将双角字符显示为单角字符宽度的两倍。
它通过使用两个 ANSI 转义来工作:第一个
\x1b[nG
将光标水平移动到绝对列 n(例如,\x1b[10G
> 移动到第 10 列),然后使用\x1b[K
清除从光标到行尾的内容。以下是终端输出的屏幕截图:
Here's another option that allows you to keep the original-width latin characters, as long as your destination (e.g., terminal) interprets ANSI escapes and displays double-width characters as twice the width as single-width characters.
It works by using two ANSI escapes: first
\x1b[nG
to move the cursor horizontally to the absolute column n (e.g.,\x1b[10G
moves to column 10), then\x1b[K
to clear from the cursor to the end of the line.Here's a screenshot of the output in a terminal: