如何控制包含东亚字符的 Unicode 字符串的填充

发布于 2024-10-10 23:09:44 字数 1347 浏览 4 评论 0原文

我得到了三个 UTF-8 刺：

hello, world
hello, 世界
hello, 世rld

我只想要前 10 个 ascii-char-width，以便括号在一列中：

[hello, wor]
[hello, 世 ]
[hello, 世r]

在控制台中：

width('世界')==width('worl')
width('世 ')==width('wor')  #a white space behind '世'

一个中文字符是三个字节，但在控制台中显示时只有 2 个 ascii 字符宽度：

>>> bytes("hello, 世界", encoding='utf-8')
b'hello, \xe4\xb8\x96\xe7\x95\x8c'

python's当 UTF-8 字符混合时， format() 没有帮助

>>> for s in ['[{0:<{1}.{1}}]'.format(s, 10) for s in ['hello, world', 'hello, 世界', 'hello, 世rld']]:
...    print(s)
...
[hello, wor]
[hello, 世界 ]
[hello, 世rl]

这不漂亮：

 -----------Songs-----------
|    1: 蝴蝶                  |
|    2: 心之城                 |
|    3: 支持你的爱人              |
|    4: 根生的种子               |
|    5: 鸽子歌(CUCURRUCUCU PALO|
|    6: 林地之间                |
|    7: 蓝光                  |
|    8: 在你眼里                |
|    9: 肖邦离别曲               |
|   10: 西行( 魔戒王者再临主题曲)(INTO |
| X 11: 深陷爱河                |
| X 12: 钟爱大地(THE MO RUN AIR |
| X 13: 时光流逝                |
| X 14: 卡农                  |
| X 15: 舒伯特小夜曲(SERENADE)    |
| X 16: 甜蜜的摇篮曲(Sweet Lullaby|
 ---------------------------

所以，我想知道是否有一个标准方法来执行 UTF-8 填充人员？

原文

I got three UTF-8 stings:

hello, world
hello, 世界
hello, 世rld

I only want the first 10 ascii-char-width so that the bracket in one column:

[hello, wor]
[hello, 世 ]
[hello, 世r]

In console:

width('世界')==width('worl')
width('世 ')==width('wor')  #a white space behind '世'

One chinese char is three bytes, but it only 2 ascii chars width when displayed in console:

>>> bytes("hello, 世界", encoding='utf-8')
b'hello, \xe4\xb8\x96\xe7\x95\x8c'

python's format() doesn't help when UTF-8 chars mixed in

>>> for s in ['[{0:<{1}.{1}}]'.format(s, 10) for s in ['hello, world', 'hello, 世界', 'hello, 世rld']]:
...    print(s)
...
[hello, wor]
[hello, 世界 ]
[hello, 世rl]

It's not pretty:

 -----------Songs-----------
|    1: 蝴蝶                  |
|    2: 心之城                 |
|    3: 支持你的爱人              |
|    4: 根生的种子               |
|    5: 鸽子歌(CUCURRUCUCU PALO|
|    6: 林地之间                |
|    7: 蓝光                  |
|    8: 在你眼里                |
|    9: 肖邦离别曲               |
|   10: 西行( 魔戒王者再临主题曲)(INTO |
| X 11: 深陷爱河                |
| X 12: 钟爱大地(THE MO RUN AIR |
| X 13: 时光流逝                |
| X 14: 卡农                  |
| X 15: 舒伯特小夜曲(SERENADE)    |
| X 16: 甜蜜的摇篮曲(Sweet Lullaby|
 ---------------------------

So, I wonder if there is a standard way to do the UTF-8 padding staff?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浊酒尽余欢 2024-10-17 23:09:44

当尝试将 ASCII 文本与固定宽度字体的中文对齐时，存在一组可打印 ASCII 字符的全角版本。下面我制作了一个 ASCII 到全角版本的转换表：

# coding: utf8

# full width versions (SPACE is non-contiguous with ! through ~)
SPACE = '\N{IDEOGRAPHIC SPACE}'
EXCLA = '\N{FULLWIDTH EXCLAMATION MARK}'
TILDE = '\N{FULLWIDTH TILDE}'

# strings of ASCII and full-width characters (same order)
west = ''.join(chr(i) for i in range(ord(' '),ord('~')))
east = SPACE + ''.join(chr(i) for i in range(ord(EXCLA),ord(TILDE)))

# build the translation table
full = str.maketrans(west,east)

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)
'''

# Replace the ASCII characters with full width, and create a song list.
data = data.translate(full).rstrip().split('\n')

# translate each printable line.
print(' ----------Songs-----------'.translate(full))
for i,song in enumerate(data):
    line = '|{:4}: {:20.20}|'.format(i+1,song)
    print(line.translate(full))
print(' --------------------------'.translate(full))

输出

　－－－－－－－－－－Ｓｏｎｇｓ－－－－－－－－－－－
｜　　　１：　蝴蝶（Ａ　ｓｏｎｇ）　　　　　　　　　　｜
｜　　　２：　心之城（Ａｎｏｔｈｅｒ　ｓｏｎｇ）　　　｜
｜　　　３：　支持你的爱人（Ｙｅｔ　ａｎｏｔｈｅｒ　ｓ｜
｜　　　４：　根生的种子　　　　　　　　　　　　　　　｜
｜　　　５：　鸽子歌（Ｃｕｃｕｒｒｕｃｕｃｕ　ｐａｌｏ｜
｜　　　６：　林地之间　　　　　　　　　　　　　　　　｜
｜　　　７：　蓝光　　　　　　　　　　　　　　　　　　｜
｜　　　８：　在你眼里　　　　　　　　　　　　　　　　｜
｜　　　９：　肖邦离别曲　　　　　　　　　　　　　　　｜
｜　　１０：　西行（魔戒王者再临主题曲）（Ｉｎｔｏ　ｓ｜
｜　　１１：　深陷爱河　　　　　　　　　　　　　　　　｜
｜　　１２：　钟爱大地　　　　　　　　　　　　　　　　｜
｜　　１３：　时光流逝　　　　　　　　　　　　　　　　｜
｜　　１４：　卡农　　　　　　　　　　　　　　　　　　｜
｜　　１５：　舒伯特小夜曲（ＳＥＲＥＮＡＤＥ）　　　　｜
｜　　１６：　甜蜜的摇篮曲（Ｓｗｅｅｔ　Ｌｕｌｌａｂｙ｜
　－－－－－－－－－－－－－－－－－－－－－－－－－－

它不是太漂亮，但它对齐了。

When trying to line up ASCII text with Chinese in fixed-width font, there is a set of full width versions of the printable ASCII characters. Below I made a translation table of ASCII to full width version:

# coding: utf8

# full width versions (SPACE is non-contiguous with ! through ~)
SPACE = '\N{IDEOGRAPHIC SPACE}'
EXCLA = '\N{FULLWIDTH EXCLAMATION MARK}'
TILDE = '\N{FULLWIDTH TILDE}'

# strings of ASCII and full-width characters (same order)
west = ''.join(chr(i) for i in range(ord(' '),ord('~')))
east = SPACE + ''.join(chr(i) for i in range(ord(EXCLA),ord(TILDE)))

# build the translation table
full = str.maketrans(west,east)

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)
'''

# Replace the ASCII characters with full width, and create a song list.
data = data.translate(full).rstrip().split('\n')

# translate each printable line.
print(' ----------Songs-----------'.translate(full))
for i,song in enumerate(data):
    line = '|{:4}: {:20.20}|'.format(i+1,song)
    print(line.translate(full))
print(' --------------------------'.translate(full))

Output

　－－－－－－－－－－Ｓｏｎｇｓ－－－－－－－－－－－
｜　　　１：　蝴蝶（Ａ　ｓｏｎｇ）　　　　　　　　　　｜
｜　　　２：　心之城（Ａｎｏｔｈｅｒ　ｓｏｎｇ）　　　｜
｜　　　３：　支持你的爱人（Ｙｅｔ　ａｎｏｔｈｅｒ　ｓ｜
｜　　　４：　根生的种子　　　　　　　　　　　　　　　｜
｜　　　５：　鸽子歌（Ｃｕｃｕｒｒｕｃｕｃｕ　ｐａｌｏ｜
｜　　　６：　林地之间　　　　　　　　　　　　　　　　｜
｜　　　７：　蓝光　　　　　　　　　　　　　　　　　　｜
｜　　　８：　在你眼里　　　　　　　　　　　　　　　　｜
｜　　　９：　肖邦离别曲　　　　　　　　　　　　　　　｜
｜　　１０：　西行（魔戒王者再临主题曲）（Ｉｎｔｏ　ｓ｜
｜　　１１：　深陷爱河　　　　　　　　　　　　　　　　｜
｜　　１２：　钟爱大地　　　　　　　　　　　　　　　　｜
｜　　１３：　时光流逝　　　　　　　　　　　　　　　　｜
｜　　１４：　卡农　　　　　　　　　　　　　　　　　　｜
｜　　１５：　舒伯特小夜曲（ＳＥＲＥＮＡＤＥ）　　　　｜
｜　　１６：　甜蜜的摇篮曲（Ｓｗｅｅｔ　Ｌｕｌｌａｂｙ｜
　－－－－－－－－－－－－－－－－－－－－－－－－－－

It's not overly pretty, but it lines up.

回复收藏 0 原文

も星光 2024-10-17 23:09:44

似乎没有对此的官方支持，但内置包可能会有所帮助：

>>> import unicodedata
>>> print unicodedata.east_asian_width(u'中')

返回的值代表代码点的类别。具体来说，

W - 东亚宽
F - 东亚全角（窄）
Na - 东亚窄
H - 东亚半角（宽）
A - 东亚模糊
N - 非东亚

这个答案类似的问题提供了一个快速的解决方案。但请注意，显示结果取决于所使用的确切等宽字体。 ipython 和 pydev 使用的默认字体不能很好地工作，而 Windows 控制台则可以。

There seems to be no official support for this, but a built-in package may help:

>>> import unicodedata
>>> print unicodedata.east_asian_width(u'中')

The returned value represents the category of the code point. Specifically,

W - East Asian Wide
F - East Asian Full-width (of narrow)
Na - East Asian Narrow
H - East Asian Half-width (of wide)
A - East Asian Ambiguous
N - Not East Asian

This answer to a similar question provided a quick solution. Note however, the display result depends on the exact monospaced font used. The default fonts used by ipython and pydev don't work well, while windows console is ok.

回复收藏 0 原文

混浊又暗下来 2024-10-17 23:09:44

看看厨房。我认为它可能有你想要的。

回复收藏 0 原文

伤痕我心 2024-10-17 23:09:44

首先，看起来你使用的是Python 3，所以我会做出相应的回应。

也许我不明白你的问题，但看起来你得到的输出正是你想要的，除了你的字体中的汉字更宽。

所以 UTF-8 是一个转移注意力的话题，因为我们谈论的不是字节，而是字符。您使用的是 Python 3，因此所有字符串都是 Unicode。底层字节表示（其中每个汉字由三个字节表示）是不相关的。

您想要将每个字符串剪切或填充为恰好 10 个字符，并且这可以正常工作：

>>> len('hello, wor')
10
>>> len('hello, 世界 ')
10
>>> len('hello, 世rl')
10

唯一的问题是您使用看似等宽字体的字体查看它，但实际上不是等宽字体>。大多数等宽字体都有这个问题。所有普通拉丁字符在此字体中的宽度完全相同，但中文字符稍宽。因此，三个字符“世界”比三个字符“wor”占用更多的水平空间。对此，您无能为力，除了a）获得真正等宽的字体，或者b）精确计算字体中每个字符的宽度，并添加一些空格，这大约会将您带到相同的水平位置（这永远不会准确）。

Firstly, it looks like you're using Python 3, so I'll respond accordingly.

Maybe I'm not understanding your question, but it looks like the output you are getting is exactly what you want, except that Chinese characters are wider in your font.

So UTF-8 is a red herring, since we are not talking about bytes, we are talking about characters. You are in Python 3, so all strings are Unicode. The underlying byte representation (where each of those Chinese characters is represented by three bytes) is irrelevant.

You want to clip or pad each string to exactly 10 characters, and that is working correctly:

>>> len('hello, wor')
10
>>> len('hello, 世界 ')
10
>>> len('hello, 世rl')
10

The only problem is that you are looking at it with what appears to be a monospaced font, but which actually isn't. Most monospaced fonts have this problem. All the normal Latin characters have exactly the same width in this font, but the Chinese characters are slightly wider. Therefore, the three characters "世界 " take up more horizontal space than the three characters "wor". There isn't much you can do about this, aside from either a) getting a font which is truly monospaced, or b) calculating precisely how wide each character is in your font, and adding a number of spaces which approximately takes you to the same horizontal position (this will never be accurate).

回复收藏 0 原文

我最亲爱的 2024-10-17 23:09:44

如果您正在使用英文和中文字符，也许这个片段可以帮助您。

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)'''

width = 80

def get_aligned_string(string,width):
    string = "{:{width}}".format(string,width=width)
    bts = bytes(string,'utf-8')
    string = str(bts[0:width],encoding='utf-8',errors='backslashreplace')
    new_width = len(string) + int((width - len(string))/2)
    if new_width!=0:
        string = '{:{width}}'.format(str(string),width=new_width)
    return string

for i,line in enumerate(data.split('\n')):
    song = get_aligned_string(line,width)
    line = '|{:4}: {:}|'.format(i+1,song)
    print(line)

输出

if you are working with English and Chinese characters, maybe this snippet can help you.

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)'''

width = 80

def get_aligned_string(string,width):
    string = "{:{width}}".format(string,width=width)
    bts = bytes(string,'utf-8')
    string = str(bts[0:width],encoding='utf-8',errors='backslashreplace')
    new_width = len(string) + int((width - len(string))/2)
    if new_width!=0:
        string = '{:{width}}'.format(str(string),width=new_width)
    return string

for i,line in enumerate(data.split('\n')):
    song = get_aligned_string(line,width)
    line = '|{:4}: {:}|'.format(i+1,song)
    print(line)

Output

回复收藏 0 原文

梦途 2024-10-17 23:09:44

这是一个基于 unicodedata 的脚本，用于检测东亚字符并将其标准化为 NFC 形式，以确保精确的半角/全角匹配。
macOS 中的韩语需要规范化，因为 macOS 使用 NFD 形式，并且韩语字符被分解为单个音节，这些音节在 Python 中被计为字符。
（例如，“у”被分解为两个字符，而“각”被分解为三个字符等，而它们都应算作双角。）

它枚举给定 root_path 并显示文件名是 NFC 还是 NFD 形式。

#! /usr/bin/env python3
import unicodedata
from pathlib import Path


def len_ea(string: str) -> int:
    nfc_string = unicodedata.normalize('NFC', string)
    return sum((2 if unicodedata.east_asian_width(c) in 'WF' else 1) for c in nfc_string)


def align_string(string: str, width: int):
    nfc_string = unicodedata.normalize('NFC', string)
    num_wide_chars = sum(1 for c in nfc_string if unicodedata.east_asian_width(c) in 'WF')
    width = width - num_wide_chars
    return '{:{width}}'.format(nfc_string, width=width)


def show_filename_encodings(root_path: Path):
    outputs = []
    for p in root_path.glob("*"):
        nfc_name = unicodedata.normalize('NFC', p.name)
        nfd_name = unicodedata.normalize('NFD', p.name)
        if p.name == nfc_name:
            enc = "\033[94mNFC\033[0m"
        elif p.name == nfd_name:
            enc = "\033[91mNFD\033[0m"
        outputs.append((p.name, nfc_name, nfd_name, enc))

    # Take the NFC string to check the maximum length
    colw = max(len_ea(o[1]) for o in outputs) + 2
    for name, nfc_name, nfd_name, enc in outputs:
        print(f"{align_string(nfc_name, colw)}: {enc}")

Here is a script based on unicodedata for detecting East-Asian characters and normalize them in to the NFC forms to ensure exact half/full-width matching.
Normalization is required for Korean in macOS because macOS uses NFD forms and Korean characters are decomposed into individual syllables which are counted as characters in Python.
(e.g., "가" is decomposed into two characters while "각" is decomposed into three characters, etc., while both they should be counted as double-width.)

It enumerates all files in the given root_path and displays whether the file names are in NFC or NFD forms.

#! /usr/bin/env python3
import unicodedata
from pathlib import Path


def len_ea(string: str) -> int:
    nfc_string = unicodedata.normalize('NFC', string)
    return sum((2 if unicodedata.east_asian_width(c) in 'WF' else 1) for c in nfc_string)


def align_string(string: str, width: int):
    nfc_string = unicodedata.normalize('NFC', string)
    num_wide_chars = sum(1 for c in nfc_string if unicodedata.east_asian_width(c) in 'WF')
    width = width - num_wide_chars
    return '{:{width}}'.format(nfc_string, width=width)


def show_filename_encodings(root_path: Path):
    outputs = []
    for p in root_path.glob("*"):
        nfc_name = unicodedata.normalize('NFC', p.name)
        nfd_name = unicodedata.normalize('NFD', p.name)
        if p.name == nfc_name:
            enc = "\033[94mNFC\033[0m"
        elif p.name == nfd_name:
            enc = "\033[91mNFD\033[0m"
        outputs.append((p.name, nfc_name, nfd_name, enc))

    # Take the NFC string to check the maximum length
    colw = max(len_ea(o[1]) for o in outputs) + 2
    for name, nfc_name, nfd_name, enc in outputs:
        print(f"{align_string(nfc_name, colw)}: {enc}")

回复收藏 0 原文

场罚期间 2024-10-17 23:09:44

这是另一个选项，允许您保留原始宽度的拉丁字符，只要您的目标（例如终端）解释 ANSI 转义并将双角字符显示为单角字符宽度的两倍。

它通过使用两个 ANSI 转义来工作：第一个 \x1b[nG 将光标水平移动到绝对列 n（例如，\x1b[10G > 移动到第 10 列），然后使用 \x1b[K 清除从光标到行尾的内容。

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)
'''

width = 40
title = "Songs"

move_to_column = f"\x1b[{width+2}G"  # +2 for borders
clear_line = "\x1b[K"  # clears from cursor to end of line

print(f" {title:-^{width}}")
for i, line in enumerate(data.splitlines(), 1):
    print(f"|{i:>5}: {line}{move_to_column}{clear_line}|")
print(" " + "-" * width)

以下是终端输出的屏幕截图：

Here's another option that allows you to keep the original-width latin characters, as long as your destination (e.g., terminal) interprets ANSI escapes and displays double-width characters as twice the width as single-width characters.

It works by using two ANSI escapes: first \x1b[nG to move the cursor horizontally to the absolute column n (e.g., \x1b[10G moves to column 10), then \x1b[K to clear from the cursor to the end of the line.

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)
'''

width = 40
title = "Songs"

move_to_column = f"\x1b[{width+2}G"  # +2 for borders
clear_line = "\x1b[K"  # clears from cursor to end of line

print(f" {title:-^{width}}")
for i, line in enumerate(data.splitlines(), 1):
    print(f"|{i:>5}: {line}{move_to_column}{clear_line}|")
print(" " + "-" * width)

Here's a screenshot of the output in a terminal: