Ruby 1.9:输入编码未知的正则表达式
Ruby 1.9 中是否有一种可接受的方法来处理输入编码未知的正则表达式?假设我的输入恰好是 UTF-16 编码的:
x = "foo<p>bar</p>baz"
y = x.encode('UTF-16LE')
re = /<p>(.*)<\/p>/
x.match(re)
=> #<MatchData "<p>bar</p>" 1:"bar">
y.match(re)
Encoding::CompatibilityError: incompatible encoding regexp match (US-ASCII regexp with UTF-16LE string)
我当前的方法是在内部使用 UTF-8,并在必要时重新编码(副本)输入:
if y.methods.include?(:encode) # Ruby 1.8 compatibility
if y.encoding.name != 'UTF-8'
y = y.encode('UTF-8')
end
end
y.match(/<p>(.*)<\/p>/u)
=> #<MatchData "<p>bar</p>" 1:"bar">
但是,这对我来说有点尴尬,我想询问是否有更好的方法。
Is there an accepted way to deal with regular expressions in Ruby 1.9 for which the encoding of the input is unknown? Let's say my input happens to be UTF-16 encoded:
x = "foo<p>bar</p>baz"
y = x.encode('UTF-16LE')
re = /<p>(.*)<\/p>/
x.match(re)
=> #<MatchData "<p>bar</p>" 1:"bar">
y.match(re)
Encoding::CompatibilityError: incompatible encoding regexp match (US-ASCII regexp with UTF-16LE string)
My current approach is to use UTF-8 internally and re-encode (a copy of) the input if necessary:
if y.methods.include?(:encode) # Ruby 1.8 compatibility
if y.encoding.name != 'UTF-8'
y = y.encode('UTF-8')
end
end
y.match(/<p>(.*)<\/p>/u)
=> #<MatchData "<p>bar</p>" 1:"bar">
However, this feels a little awkward to me, and I wanted to ask if there's a better way to do it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
据我所知,没有更好的方法可以使用。但是,我可以建议稍微改变一下吗?
为什么不更改正则表达式的编码,而不是更改输入的编码呢?每次遇到新的编码时都翻译一个正则表达式字符串比翻译数百或数千行输入以匹配正则表达式的编码要少得多。
在病理情况下(输入编码每行都会改变),这会同样慢,因为您每次都通过循环重新编码正则表达式。但在 99.9% 的情况下,数百或数千行的整个文件的编码是恒定的,这将导致重新编码的大量减少。
As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?
Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.
In the pathological case (where the input encoding changes every line) this would be just as slow, since you're re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.
请遵循此页面的建议:http: //gnuu.org/2009/02/02/ruby-19-common-problems-pt-1-encoding/ 并添加
到 rb 文件的顶部。
Follow the advice of this page: http://gnuu.org/2009/02/02/ruby-19-common-problems-pt-1-encoding/ and add
to the top of your rb file.