截断 unicode,使其适合电汇编码时的最大大小

发布于 2024-08-12 21:39:01 字数 569 浏览 5 评论 0原文

给定一个 Unicode 字符串和以下要求:

  • 字符串被编码为某种字节序列格式(例如 UTF-8 或 JSON unicode 转义)
  • 编码字符串具有最大长度

例如,iPhone 推送服务需要具有最大总数据包的 JSON 编码大小为 256 字节。

截断字符串以便其重新编码为有效的 Unicode 并合理正确地显示的最佳方法是什么?

(人类语言理解不是必需的 - 截断的版本可能看起来很奇怪,例如对于孤立的字符串)组合字符或泰语元音,只要软件在处理数据时不会崩溃即可。)

另请参阅:

Given a Unicode string and these requirements:

  • The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
  • The encoded string has a maximum length

For example, the iPhone push service requires JSON encoding with a maximum total packet size of 256 bytes.

What is the best way to truncate the string so that it re-encodes to valid Unicode and that it displays reasonably correctly?

(Human language comprehension is not necessary—the truncated version can look odd e.g. for an orphaned combining character or a Thai vowel, just as long as the software doesn't crash when handling the data.)

See Also:

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

已下线请稍等 2024-08-19 21:39:02
def unicode_truncate(s, length, encoding='utf-8'):
    encoded = s.encode(encoding)[:length]
    return encoded.decode(encoding, 'ignore')

下面是 Unicode 字符串的示例,其中每个字符都用 UTF-8 中的 2 个字节表示,如果不忽略拆分 Unicode 代码点,该字符串将会崩溃:

>>> unicode_truncate(u'абвгд', 5)
u'\u0430\u0431'
def unicode_truncate(s, length, encoding='utf-8'):
    encoded = s.encode(encoding)[:length]
    return encoded.decode(encoding, 'ignore')

Here is an example for a Unicode string where each character is represented with 2 bytes in UTF-8 and that would've crashed if the split Unicode code point wasn't ignored:

>>> unicode_truncate(u'абвгд', 5)
u'\u0430\u0431'
燃情 2024-08-19 21:39:02

UTF-8 的特性之一是易于重新同步,即在编码的字节流中轻松找到 unicode 字符边界。您需要做的就是以最大长度切割编码字符串,然后从末尾向后移动,删除所有 > 的字节。 127——它们是多字节字符的一部分或开头。

正如现在所写,这太简单了——将擦除最后一个 ASCII 字符,可能是整个字符串。我们需要做的是检查是否有截断的两字节(以110yyyxx开头)、三字节(1110yyyy)或四字节(11110zzz) >)

以清晰的代码实现 Python 2.6。无论如何,优化不应该成为问题
对于长度,我们只检查最后 1-4 个字节。

# coding: UTF-8

def decodeok(bytestr):
    try:
        bytestr.decode("UTF-8")
    except UnicodeDecodeError:
        return False
    return True

def is_first_byte(byte):
    """return if the UTF-8 @byte is the first byte of an encoded character"""
    o = ord(byte)
    return ((0b10111111 & o) != o)

def truncate_utf8(bytestr, maxlen):
    u"""

    >>> us = u"ウィキペディアにようこそ"
    >>> s = us.encode("UTF-8")

    >>> trunc20 = truncate_utf8(s, 20)
    >>> print trunc20.decode("UTF-8")
    ウィキペディ
    >>> len(trunc20)
    18

    >>> trunc21 = truncate_utf8(s, 21)
    >>> print trunc21.decode("UTF-8")
    ウィキペディア
    >>> len(trunc21)
    21
    """
    L = maxlen
    for x in xrange(1, 5):
        if is_first_byte(bytestr[L-x]) and not decodeok(bytestr[L-x:L]):
            return bytestr[:L-x]
    return bytestr[:L]

if __name__ == '__main__':
    # unicode doctest hack
    import sys
    reload(sys)
    sys.setdefaultencoding("UTF-8")
    import doctest
    doctest.testmod()

One of UTF-8's properties is that it is easy to resync, that is find the unicode character boundaries easily in the encoded bytestream. All you need to do is to cut the encoded string at max length, then walk backwards from the end removing any bytes that are > 127 -- those are part of, or the start of a multibyte character.

As written now, this is too simple -- will erase to last ASCII char, possibly the whole string. What we need to do is check for no truncated two-byte (start with 110yyyxx) three-byte (1110yyyy) or four-byte (11110zzz)

Python 2.6 implementation in clear code. Optimization should not be an issue -- regardless
of length, we only check the last 1-4 bytes.

# coding: UTF-8

def decodeok(bytestr):
    try:
        bytestr.decode("UTF-8")
    except UnicodeDecodeError:
        return False
    return True

def is_first_byte(byte):
    """return if the UTF-8 @byte is the first byte of an encoded character"""
    o = ord(byte)
    return ((0b10111111 & o) != o)

def truncate_utf8(bytestr, maxlen):
    u"""

    >>> us = u"ウィキペディアにようこそ"
    >>> s = us.encode("UTF-8")

    >>> trunc20 = truncate_utf8(s, 20)
    >>> print trunc20.decode("UTF-8")
    ウィキペディ
    >>> len(trunc20)
    18

    >>> trunc21 = truncate_utf8(s, 21)
    >>> print trunc21.decode("UTF-8")
    ウィキペディア
    >>> len(trunc21)
    21
    """
    L = maxlen
    for x in xrange(1, 5):
        if is_first_byte(bytestr[L-x]) and not decodeok(bytestr[L-x:L]):
            return bytestr[:L-x]
    return bytestr[:L]

if __name__ == '__main__':
    # unicode doctest hack
    import sys
    reload(sys)
    sys.setdefaultencoding("UTF-8")
    import doctest
    doctest.testmod()
鹊巢 2024-08-19 21:39:02

如果您喜欢在正则表达式中执行此操作,这适用于 UTF8。

import re

partial="\xc2\x80\xc2\x80\xc2"

re.sub("([\xf6-\xf7][\x80-\xbf]{0,2}|[\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)

"\xc2\x80\xc2\x80"

它涵盖从 U+0080(2 字节)到 U+10FFFF(4 字节)utf8 字符串,

它非常简单,就像 UTF8算法

U+0080到U+07FF需要2个字节 110yyyxx 10xxxxxx
它的意思是,如果你最后只看到一个字节,如 110yyyxx(0b11000000 到 0b11011111)
它是[\xc0-\xdf],它将是部分的。

U+0800到U+FFFF需要3个字节 1110yyyy 10yyyyxx 10xxxxxx
如果最后只看到 1 或 2 个字节,那么它就是部分 1。
它将与此模式匹配 [\xe0-\xef][\x80-\xbf]{0,1}

U+10000–U+10FFFF 为 4 个字节需要 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
如果最后只看到1到3个字节,那就是部分1
它将与此模式匹配 [\xf6-\xf7][\x80-\xbf]{0,2}

更新:

如果您只需要基本多语言飞机,您可以删除最后一个模式。这样就可以了。

re.sub("([\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)

如果该正则表达式有任何问题,请告诉我。

This will do for UTF8, If you like to do it in regex.

import re

partial="\xc2\x80\xc2\x80\xc2"

re.sub("([\xf6-\xf7][\x80-\xbf]{0,2}|[\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)

"\xc2\x80\xc2\x80"

Its cover from U+0080 (2 bytes) to U+10FFFF (4 bytes) utf8 strings

Its really straight forward just like UTF8 algorithm

From U+0080 to U+07FF It will need 2 bytes 110yyyxx 10xxxxxx
Its mean, if you see only one byte in the end like 110yyyxx (0b11000000 to 0b11011111)
It is [\xc0-\xdf], it will be partial one.

From U+0800 to U+FFFF is 3 bytes needed 1110yyyy 10yyyyxx 10xxxxxx
If you see only 1 or 2 bytes in the end, it will be partial one.
It will match with this pattern [\xe0-\xef][\x80-\xbf]{0,1}

From U+10000–U+10FFFF is 4 bytes needed 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
If you see only 1 to 3 bytes in the end, it will be partial one
It will match with this pattern [\xf6-\xf7][\x80-\xbf]{0,2}

Update :

If you only need Basic Multilingual Plane, You can drop last Pattern. This will do.

re.sub("([\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)

Let me know if there is any problem with that regex.

东北女汉子 2024-08-19 21:39:02

对于 JSON 格式(unicode 转义,例如 \uabcd),我使用以下算法来实现此目的:

  • 将 Unicode 字符串编码为反斜杠转义格式,最终将采用 JSON 版本
  • Truncate 3比我的最终限制多的字节
  • 使用正则表达式来检测并截断 Unicode 值的部分编码

因此(在 Python 2.5 中),使用 some_string 并要求截断到大约 100 个字节:

# Given some_string is a long string with arbitrary Unicode data.
encoded_string = some_string.encode('unicode_escape')
partial_string = re.sub(r'([^\\])\\(u|$)[0-9a-f]{0,3}

现在 < code>final_string 已恢复为 Unicode,但保证稍后适合 JSON 数据包。我截断为 103,因为纯 Unicode 消息将采用 102 字节编码。

免责声明:仅在基本多语言平面上进行测试。是的,是的,我知道。

, r'\1', encoded_string[:103]) final_string = partial_string.decode('unicode_escape')

现在 < code>final_string 已恢复为 Unicode,但保证稍后适合 JSON 数据包。我截断为 103,因为纯 Unicode 消息将采用 102 字节编码。

免责声明:仅在基本多语言平面上进行测试。是的,是的,我知道。

For JSON formatting (unicode escape, e.g. \uabcd), I am using the following algorithm to achieve this:

  • Encode the Unicode string into the backslash-escape format which it would eventually be in the JSON version
  • Truncate 3 bytes more than my final limit
  • Use a regular expression to detect and chop off a partial encoding of a Unicode value

So (in Python 2.5), with some_string and a requirement to cut to around 100 bytes:

# Given some_string is a long string with arbitrary Unicode data.
encoded_string = some_string.encode('unicode_escape')
partial_string = re.sub(r'([^\\])\\(u|$)[0-9a-f]{0,3}

Now final_string is back in Unicode but guaranteed to fit within the JSON packet later. I truncated to 103 because a purely-Unicode message would be 102 bytes encoded.

Disclaimer: Only tested on the Basic Multilingual Plane. Yeah yeah, I know.

, r'\1', encoded_string[:103]) final_string = partial_string.decode('unicode_escape')

Now final_string is back in Unicode but guaranteed to fit within the JSON packet later. I truncated to 103 because a purely-Unicode message would be 102 bytes encoded.

Disclaimer: Only tested on the Basic Multilingual Plane. Yeah yeah, I know.

笑红尘 2024-08-19 21:39:02

检查字符串的最后一个字符。如果设置了高位,则
它不是 UTF-8 字符中的最后一个字节,因此请备份并重试
直到你找到一个。

mxlen=255        
while( toolong.encode("utf8")[mxlen-1] & 0xc0 == 0xc0 ):
    mxlen -= 1

truncated_string = toolong.encode("utf8")[0:mxlen].decode("utf8")

Check the last character of the string. If high bit set, then
it is not the last byte in a UTF-8 character, so back up and try again
until you find one that is.

mxlen=255        
while( toolong.encode("utf8")[mxlen-1] & 0xc0 == 0xc0 ):
    mxlen -= 1

truncated_string = toolong.encode("utf8")[0:mxlen].decode("utf8")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文