ruby 1.9：如何获取字符串的基于字节索引的切片？

发布于 2024-08-14 20:05:00 字数 491 浏览 9 评论 0原文

我正在使用 UTF-8 字符串。我需要使用基于字节的索引而不是基于字符的索引来获取切片。

我在网上找到了对 String#subseq 的引用，它应该类似于 String#[]，但针对的是字节。唉，好像还没有到1.9.1。

现在，我为什么要这么做？如果我在多字节字符的中间进行切片，那么我最终可能会得到一个无效的字符串。这听起来是一个糟糕的主意。

嗯，我正在使用 StringScanner，结果发现它的内部指针是基于字节的。我在这里接受其他选择。

这是我现在正在处理的内容，但它相当冗长：

s.dup.force_encoding("ASCII-8BIT")[ix...pos].force_encoding("UTF-8")

ix 和 pos 都来自 StringScanner，因此都是基于字节的。

原文

I'm working with UTF-8 strings. I need to get a slice using byte-based indexes, not char-based.

I found references on the web to String#subseq, which is supposed to be like String#[], but for bytes. Alas, it seems not to have made it to 1.9.1.

Now, why would I want to do that? There's a chance I'll end up with an invalid string should I slice in the middle of a multi-byte char. This sounds like a terrible idea.

Well, I'm working with StringScanner, and it turns out its internal pointers are byte-based. I accept other options here.

Here's what I'm working with right now, but it's rather verbose:

s.dup.force_encoding("ASCII-8BIT")[ix...pos].force_encoding("UTF-8")

Both ix and pos come from StringScanner, so are byte-based.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

请持续率性 2024-08-21 20:05:00

你也可以这样做：s.bytes.to_a[ix...pos].join("")，但这对我来说看起来更深奥。

如果您多次拨打该电话，更好的方法可能是这样：

class String
  def byteslice(*args)
    self.dup.force_encoding("ASCII-8BIT").slice(*args).force_encoding("UTF-8")
  end
end

s.byteslice(ix...pos)

You can do this too: s.bytes.to_a[ix...pos].join(""), but that looks even more esoteric to me.

If you're calling the line several times, a nicer way to do it could be this:

class String
  def byteslice(*args)
    self.dup.force_encoding("ASCII-8BIT").slice(*args).force_encoding("UTF-8")
  end
end

s.byteslice(ix...pos)

回复收藏 0 原文

要走干脆点 2024-08-21 20:05:00

String#bytes 不符合你的要求吗？它将枚举器返回到字符串中的字节（作为数字，因为正如您所指出的，它们可能不是有效字符）

str.bytes.to_a.slice(...)

Doesn't String#bytes do what you want? It returns an enumerator to the bytes in a string (as numbers, since they might not be valid characters, as you pointed out)

str.bytes.to_a.slice(...)

回复收藏 0 原文

女皇必胜 2024-08-21 20:05:00

使用这个monkeypatch直到String#byteslice()被添加到Ruby 1.9。

class String
  unless method_defined? :byteslice
    ##
    # Does the same thing as String#slice but
    # operates on bytes instead of characters.
    #
    def byteslice(*args)
      unpack('C*').slice(*args).pack('C*')
    end
  end
end

Use this monkeypatch until String#byteslice() is added to Ruby 1.9.

class String
  unless method_defined? :byteslice
    ##
    # Does the same thing as String#slice but
    # operates on bytes instead of characters.
    #
    def byteslice(*args)
      unpack('C*').slice(*args).pack('C*')
    end
  end
end

回复收藏 0 原文

~没有更多了~