无论 ruby​​ 版本如何,分割 utf8 字符串

发布于 2024-12-10 11:24:21 字数 148 浏览 1 评论 0原文

str = "é-du-Marché"

获取第一个字符?

str.split(//).first

我通过如何获取字符串的其余部分而不考虑我的 ruby​​ 版本来

str = "é-du-Marché"

I get the first char via

str.split(//).first

How I can get the rest of the string regardless of my ruby version ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

佼人 2024-12-17 11:24:21

String首先没有方法。所以你还需要一个分裂。当您以 unicode 模式(确切地说是 utf-8)进行拆分时,您可以访问第一个(和其他字符)。

我的解决方案:

puts RUBY_VERSION
str = "é-du-Marché"
p str.split(//u, 2)

使用 ruby​​ 1.9.2 进行测试:

1.9.2
["\u00E9", "-du-March\u00E9"]

使用 ruby​​ 1.8.6 进行测试:

1.8.6
["\303\251", "-du-March\303\251"]

使用 firstlast 你会得到结果:

  • str.split(//u , 2).first 是第一个字符
  • str.split(//u, 2).last 是第一个字符之后的字符串。

String does not have a method first. So you need in addition a split. When you do the split in unicode-mode (exactly utf-8) you have acces to the first (and other characters).

My solution:

puts RUBY_VERSION
str = "é-du-Marché"
p str.split(//u, 2)

Test with ruby 1.9.2:

1.9.2
["\u00E9", "-du-March\u00E9"]

Test with ruby 1.8.6:

1.8.6
["\303\251", "-du-March\303\251"]

With first and last you get your results:

  • str.split(//u, 2).first is the first character
  • str.split(//u, 2).last is the string after the first character.
毅然前行 2024-12-17 11:24:21

str[1..-1] 通常应该返回第一个数字之后的所有内容。

第一个数字是起始索引,设置为1以跳过第一个数字,第二个数字是长度,设置为-1,因此ruby从后面

注意:多字节字符仅在 Ruby 1.9 中有效。如果你想向下模仿这种行为,你必须自己循环字节并找出需要从数据中删除的内容,因为 Ruby 1.8 不支持这一点。

更新:

您也可以尝试这个,但我不能保证它适用于每个多字节字符:

str = "é-du-Marché"
substring = str.mb_chars[1..-1]

mb_chars是一个代理类,它在处理 UTF- 时将调用定向到适当的实现8、字符的UTF-32或UTF-16编码(例如多字节字符)。
更详细的信息可以在这里找到: http://api.rubyonrails.org/classes /ActiveSupport/Multibyte/Chars.html
但我不知道旧的 Rails 版本中是否存在这种情况

UPDATE2:

Ruby 1.8 将任何字符串视为一堆字节,在其上调用 size() 将返回用于存储的字节数数据。要确定字符而不管编码如何,请尝试以下操作:

char_array = str.scan(/./m)
substring = char_array[1..-1].join

这应该可以正常完成此操作。尝试查看http://blog.grayproducts.net/articles/bytes_and_characters_in_ruby_18,他解释了如何处理旧版 ruby​​ 中的多字节数据。

EDIT3

尝试扫描和编辑加入操作让我更接近您的问题&解决方案。老实说,我没有时间让完整的解决方案发挥作用,但如果你使用 scan(/./mu) 选项,你可以将其转换为 utf-8,所有 ruby​​ 版本都支持它。

str[1..-1] should return you everything after the first digit normally.

The first number is the starting index, which is set to 1 to skip the first digit, the second is the length, which is set to -1, so ruby counts from the back

Note: that multibyte characters only work in Ruby 1.9. If you wish to mimic this behavior downwards, you'll have to loop over the bytes yourself and figure out what needs to be removed from the data, cause Ruby 1.8 does not support this.

UPDATE:

You could try this as well, but I can't guarantee that it will work for every multibyte char:

str = "é-du-Marché"
substring = str.mb_chars[1..-1]

the mb_chars is a proxy class that directs the call to the appropiate implementation when dealing with UTF-8, UTF-32 or UTF-16 encoding of characters (e.g. multibyte chars).
More detailed info can be found here : http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html
But I do not know if this exists in older rails versions

UPDATE2:

Ruby 1.8 treats any string just as a bunch of bytes, calling size() on it will return the amount of bytes that is used to store the data. To determine the characters regardless of the encoding try this:

char_array = str.scan(/./m)
substring = char_array[1..-1].join

This should do the trick normally. Try looking at http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 who explains how to treat multibyte data in older ruby versions.

EDIT3:

Playing around with the scan & join operations brings me closer to your problem & solution. I honestly don't have the time at to get the full solution working but if you play with the scan(/./mu) options you convert it to utf-8, which is supported by all ruby versions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文