分割给定字节偏移量的 utf-8 编码字符串 (python 2.7)

发布于 2024-12-05 00:00:06 字数 372 浏览 0 评论 0原文

有一个像这样的 utf-8 编码字符串：

bar = "hello ｡◕‿‿◕｡"

和一个字节偏移量，告诉我必须在哪个字节分割字符串：

bytes_offset = 9

如何将条形字符串分割成两部分，结果是：

>>first_part 
'hello ｡' <---- #9 bytes 'hello \xef\xbd\xa1'
>>second_part 
'◕‿‿◕｡'

简而言之 :
给定字节偏移量，如何将其转换为 utf-8 编码字符串的实际字符索引位置？

原文

Having an utf-8 encoded string like this:

bar = "hello ｡◕‿‿◕｡"

and a bytes offset that tells me at which byte I have to split the string:

bytes_offset = 9

how can I split the bar string in two parts resulting in:

>>first_part 
'hello ｡' <---- #9 bytes 'hello \xef\xbd\xa1'
>>second_part 
'◕‿‿◕｡'

In a nutshell:
given a bytes offset, how can I transform it in the actual char index position of an utf-8 encoded string?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

驱逐舰岛风号 2024-12-12 00:00:06

UTF-8 Python 2.x 字符串基本上是字节字符串。

# -*- coding: utf-8 -*- 

bar = "hello ｡◕‿‿◕｡"
assert(isinstance(bar, str))

first_part = bar[:9]
second_part = bar[9:]
print first_part
print second_part

产量：

hello ｡
◕‿‿◕｡

这里是 OSX 上的 Python 2.6，但我希望 2.7 也有同样的结果。如果我分割为 10 或 11 而不是 9，我会得到 ?字符输出意味着它破坏了多字节字符序列中间的字节序列；对 12 进行分裂将第一个“眼球”移动到字符串的第一部分。

我在终端中将 PYTHONIOENCODING 设置为 utf8。

UTF-8 Python 2.x strings are basically byte strings.

# -*- coding: utf-8 -*- 

bar = "hello ｡◕‿‿◕｡"
assert(isinstance(bar, str))

first_part = bar[:9]
second_part = bar[9:]
print first_part
print second_part

Yields:

hello ｡
◕‿‿◕｡

Python 2.6 on OSX here but I expect the same from 2.7. If I split on 10 or 11 instead of 9, I get ? characters output implying that it broke the sequence of bytes in the middle of a multibyte character sequence; splitting on 12 moves the first "eyeball" to the first part of the string.

I have PYTHONIOENCODING set to utf8 in the terminal.

回复收藏 0 原文

少跟Wǒ拽 2024-12-12 00:00:06

字符偏移量是字节偏移量之前的字符数：

def byte_to_char_offset(b_string, b_offset, encoding='utf8'):
    return len(b_string[:b_offset].decode(encoding))

Character offset is a number of characters before byte offset:

def byte_to_char_offset(b_string, b_offset, encoding='utf8'):
    return len(b_string[:b_offset].decode(encoding))

回复收藏 0 原文

~没有更多了~

关于作者

莫相离

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

分割给定字节偏移量的 utf-8 编码字符串 (python 2.7)

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

離殇

小姐丶请自重

Aik

国产ˉ祖宗

猥琐帝

半仙

友情链接

分割给定字节偏移量的 utf-8 编码字符串 (python 2.7)

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

離殇

小姐丶请自重

Aik

国产ˉ祖宗

猥琐帝

半仙

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。