ruby 1.9:如何获取字符串的基于字节索引的切片?
我正在使用 UTF-8 字符串。我需要使用基于字节的索引而不是基于字符的索引来获取切片。
我在网上找到了对 String#subseq
的引用,它应该类似于 String#[]
,但针对的是字节。唉,好像还没有到1.9.1。
现在,我为什么要这么做?如果我在多字节字符的中间进行切片,那么我最终可能会得到一个无效的字符串。这听起来是一个糟糕的主意。
嗯,我正在使用 StringScanner,结果发现它的内部指针是基于字节的。我在这里接受其他选择。
这是我现在正在处理的内容,但它相当冗长:
s.dup.force_encoding("ASCII-8BIT")[ix...pos].force_encoding("UTF-8")
ix
和 pos
都来自 StringScanner
,因此都是基于字节的。
I'm working with UTF-8 strings. I need to get a slice using byte-based indexes, not char-based.
I found references on the web to String#subseq
, which is supposed to be like String#[]
, but for bytes. Alas, it seems not to have made it to 1.9.1.
Now, why would I want to do that? There's a chance I'll end up with an invalid string should I slice in the middle of a multi-byte char. This sounds like a terrible idea.
Well, I'm working with StringScanner
, and it turns out its internal pointers are byte-based. I accept other options here.
Here's what I'm working with right now, but it's rather verbose:
s.dup.force_encoding("ASCII-8BIT")[ix...pos].force_encoding("UTF-8")
Both ix
and pos
come from StringScanner
, so are byte-based.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你也可以这样做:
s.bytes.to_a[ix...pos].join("")
,但这对我来说看起来更深奥。如果您多次拨打该电话,更好的方法可能是这样:
You can do this too:
s.bytes.to_a[ix...pos].join("")
, but that looks even more esoteric to me.If you're calling the line several times, a nicer way to do it could be this:
String#bytes 不符合你的要求吗?它将枚举器返回到字符串中的字节(作为数字,因为正如您所指出的,它们可能不是有效字符)
Doesn't String#bytes do what you want? It returns an enumerator to the bytes in a string (as numbers, since they might not be valid characters, as you pointed out)
使用这个monkeypatch直到
String#byteslice()
被添加到Ruby 1.9。Use this monkeypatch until
String#byteslice()
is added to Ruby 1.9.