截断 unicode,使其适合电汇编码时的最大大小
给定一个 Unicode 字符串和以下要求:
- 字符串被编码为某种字节序列格式(例如 UTF-8 或 JSON unicode 转义)
- 编码字符串具有最大长度
例如,iPhone 推送服务需要具有最大总数据包的 JSON 编码大小为 256 字节。
截断字符串以便其重新编码为有效的 Unicode 并合理正确地显示的最佳方法是什么?
(人类语言理解不是必需的 - 截断的版本可能看起来很奇怪,例如对于孤立的字符串)组合字符或泰语元音,只要软件在处理数据时不会崩溃即可。)
另请参阅:
- 相关 Java 问题:How UTF-8 编码后,我是否需要截断 java 字符串以适应给定的字节数?
- 相关 Javascript 问题:使用 JavaScript将文本截断为特定大小
Given a Unicode string and these requirements:
- The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
- The encoded string has a maximum length
For example, the iPhone push service requires JSON encoding with a maximum total packet size of 256 bytes.
What is the best way to truncate the string so that it re-encodes to valid Unicode and that it displays reasonably correctly?
(Human language comprehension is not necessary—the truncated version can look odd e.g. for an orphaned combining character or a Thai vowel, just as long as the software doesn't crash when handling the data.)
See Also:
- Related Java question: How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?
- Related Javascript question: Using JavaScript to truncate text to a certain size
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
下面是 Unicode 字符串的示例,其中每个字符都用 UTF-8 中的 2 个字节表示,如果不忽略拆分 Unicode 代码点,该字符串将会崩溃:
Here is an example for a Unicode string where each character is represented with 2 bytes in UTF-8 and that would've crashed if the split Unicode code point wasn't ignored:
UTF-8 的特性之一是易于重新同步,即在编码的字节流中轻松找到 unicode 字符边界。您需要做的就是以最大长度切割编码字符串,然后从末尾向后移动,删除所有 > 的字节。 127——它们是多字节字符的一部分或开头。
正如现在所写,这太简单了——将擦除最后一个 ASCII 字符,可能是整个字符串。我们需要做的是检查是否有截断的两字节(以
110yyyxx
开头)、三字节(1110yyyy
)或四字节(11110zzz
) >)以清晰的代码实现 Python 2.6。无论如何,优化不应该成为问题
对于长度,我们只检查最后 1-4 个字节。
One of UTF-8's properties is that it is easy to resync, that is find the unicode character boundaries easily in the encoded bytestream. All you need to do is to cut the encoded string at max length, then walk backwards from the end removing any bytes that are > 127 -- those are part of, or the start of a multibyte character.
As written now, this is too simple -- will erase to last ASCII char, possibly the whole string. What we need to do is check for no truncated two-byte (start with
110yyyxx
) three-byte (1110yyyy
) or four-byte (11110zzz
)Python 2.6 implementation in clear code. Optimization should not be an issue -- regardless
of length, we only check the last 1-4 bytes.
如果您喜欢在正则表达式中执行此操作,这适用于 UTF8。
它涵盖从 U+0080(2 字节)到 U+10FFFF(4 字节)utf8 字符串,
它非常简单,就像 UTF8算法
从U+0080到U+07FF需要2个字节 110yyyxx 10xxxxxx
它的意思是,如果你最后只看到一个字节,如 110yyyxx(0b11000000 到 0b11011111)
它是
[\xc0-\xdf]
,它将是部分的。从U+0800到U+FFFF需要3个字节 1110yyyy 10yyyyxx 10xxxxxx
如果最后只看到 1 或 2 个字节,那么它就是部分 1。
它将与此模式匹配
[\xe0-\xef][\x80-\xbf]{0,1}
从 U+10000–U+10FFFF 为 4 个字节需要 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
如果最后只看到1到3个字节,那就是部分1
它将与此模式匹配
[\xf6-\xf7][\x80-\xbf]{0,2}
更新:
如果您只需要基本多语言飞机,您可以删除最后一个模式。这样就可以了。
如果该正则表达式有任何问题,请告诉我。
This will do for UTF8, If you like to do it in regex.
Its cover from U+0080 (2 bytes) to U+10FFFF (4 bytes) utf8 strings
Its really straight forward just like UTF8 algorithm
From U+0080 to U+07FF It will need 2 bytes 110yyyxx 10xxxxxx
Its mean, if you see only one byte in the end like 110yyyxx (0b11000000 to 0b11011111)
It is
[\xc0-\xdf]
, it will be partial one.From U+0800 to U+FFFF is 3 bytes needed 1110yyyy 10yyyyxx 10xxxxxx
If you see only 1 or 2 bytes in the end, it will be partial one.
It will match with this pattern
[\xe0-\xef][\x80-\xbf]{0,1}
From U+10000–U+10FFFF is 4 bytes needed 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
If you see only 1 to 3 bytes in the end, it will be partial one
It will match with this pattern
[\xf6-\xf7][\x80-\xbf]{0,2}
Update :
If you only need Basic Multilingual Plane, You can drop last Pattern. This will do.
Let me know if there is any problem with that regex.
对于 JSON 格式(unicode 转义,例如
\uabcd
),我使用以下算法来实现此目的:因此(在 Python 2.5 中),使用
some_string
并要求截断到大约 100 个字节:现在 < code>final_string 已恢复为 Unicode,但保证稍后适合 JSON 数据包。我截断为 103,因为纯 Unicode 消息将采用 102 字节编码。
免责声明:仅在基本多语言平面上进行测试。是的,是的,我知道。
For JSON formatting (unicode escape, e.g.
\uabcd
), I am using the following algorithm to achieve this:So (in Python 2.5), with
some_string
and a requirement to cut to around 100 bytes:Now
final_string
is back in Unicode but guaranteed to fit within the JSON packet later. I truncated to 103 because a purely-Unicode message would be 102 bytes encoded.Disclaimer: Only tested on the Basic Multilingual Plane. Yeah yeah, I know.
检查字符串的最后一个字符。如果设置了高位,则
它不是 UTF-8 字符中的最后一个字节,因此请备份并重试
直到你找到一个。
Check the last character of the string. If high bit set, then
it is not the last byte in a UTF-8 character, so back up and try again
until you find one that is.