ruby 1.9：UTF-8 中的无效字节序列

发布于 2024-09-04 14:34:50 字数 313 浏览 3 评论 0原文

我正在用 Ruby (1.9) 编写一个爬虫，它消耗来自许多随机站点的大量 HTML。
当尝试提取链接时，我决定只使用 .scan(/href="(.*?)"/i) 而不是 nokogiri/hpricot （主要加速）。问题是我现在收到很多“UTF-8 中的无效字节序列”错误。
据我了解，net/http 库没有任何编码特定选项，并且传入的内容基本上没有正确标记。
实际处理传入数据的最佳方式是什么？我尝试使用替换和无效选项设置 .encode ，但到目前为止没有成功......

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

大姐，你呐 2024-09-11 14:34:50

在 Ruby 1.9.3 中，可以使用 String.encode 来“忽略”无效的 UTF-8 序列。这是一个在 1.8 中都可以使用的代码片段 (iconv ) 和 1.9 (字符串#encode) ：

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

或者如果你的输入确实很麻烦，你可以进行从 UTF-8 到 UTF-16 再到 UTF-8 的双重转换：

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
  file_contents.encode!('UTF-8', 'UTF-16')
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (iconv) and 1.9 (String#encode) :

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
  file_contents.encode!('UTF-8', 'UTF-16')
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

回复收藏 0 原文

暮年慕年 2024-09-11 14:34:50

接受的答案或其他答案都对我有用。我发现这篇文章建议

string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

这为我解决了问题。

The accepted answer nor the other answer work for me. I found this post which suggested

string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

This fixed the problem for me.

回复收藏 0 原文

心欲静而疯不止 2024-09-11 14:34:50

我当前的解决方案是运行：

my_string.unpack("C*").pack("U*")

这至少会消除我的主要问题的异常

My current solution is to run:

my_string.unpack("C*").pack("U*")

This will at least get rid of the exceptions which was my main problem

回复收藏 0 原文

习惯成性 2024-09-11 14:34:50

试试这个：

def to_utf8(str)
  str = str.force_encoding('UTF-8')
  return str if str.valid_encoding?
  str.encode("UTF-8", 'binary', invalid: :replace, undef: :replace, replace: '')
end

Try this:

def to_utf8(str)
  str = str.force_encoding('UTF-8')
  return str if str.valid_encoding?
  str.encode("UTF-8", 'binary', invalid: :replace, undef: :replace, replace: '')
end

回复收藏 0 原文

不必你懂 2024-09-11 14:34:50

我建议您使用 HTML 解析器。只要找到最快的就可以了。

解析 HTML 并不像看起来那么容易。

浏览器解析无效的UTF-8序列，在UTF-8 HTML文档中，只需放置“�”符号即可。因此，一旦 HTML 中的无效 UTF-8 序列被解析，生成的文本就是有效的字符串。

即使在属性值内部，您也必须解码像 amp 这样的 HTML 实体。

这是一个很好的问题，总结了为什么您不能使用正则表达式可靠地解析 HTML：
RegEx 匹配开放标记（XHTML 自包含标记除外）

回复收藏 0 原文

国粹 2024-09-11 14:34:50

attachment = file.read

begin
   # Try it as UTF-8 directly
   cleaned = attachment.dup.force_encoding('UTF-8')
   unless cleaned.valid_encoding?
     # Some of it might be old Windows code page
     cleaned = attachment.encode( 'UTF-8', 'Windows-1252' )
   end
   attachment = cleaned
 rescue EncodingError
   # Force it to UTF-8, throwing out invalid bits
   attachment = attachment.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
 end

attachment = file.read

begin
   # Try it as UTF-8 directly
   cleaned = attachment.dup.force_encoding('UTF-8')
   unless cleaned.valid_encoding?
     # Some of it might be old Windows code page
     cleaned = attachment.encode( 'UTF-8', 'Windows-1252' )
   end
   attachment = cleaned
 rescue EncodingError
   # Force it to UTF-8, throwing out invalid bits
   attachment = attachment.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
 end

回复收藏 0 原文

影子是时光的心 2024-09-11 14:34:50

这似乎有效：

def sanitize_utf8(string)
  return nil if string.nil?
  return string if string.valid_encoding?
  string.chars.select { |c| c.valid_encoding? }.join
end

This seems to work:

def sanitize_utf8(string)
  return nil if string.nil?
  return string if string.valid_encoding?
  string.chars.select { |c| c.valid_encoding? }.join
end

回复收藏 0 原文

陪你搞怪i 2024-09-11 14:34:50

我遇到过字符串，它混合了英语、俄语和其他一些字母，这导致了异常。我只需要俄语和英语，目前这对我有用：

ec1 = Encoding::Converter.new "UTF-8","Windows-1251",:invalid=>:replace,:undef=>:replace,:replace=>""
ec2 = Encoding::Converter.new "Windows-1251","UTF-8",:invalid=>:replace,:undef=>:replace,:replace=>""
t = ec2.convert ec1.convert t

I've encountered string, which had mixings of English, Russian and some other alphabets, which caused exception. I need only Russian and English, and this currently works for me:

ec1 = Encoding::Converter.new "UTF-8","Windows-1251",:invalid=>:replace,:undef=>:replace,:replace=>""
ec2 = Encoding::Converter.new "Windows-1251","UTF-8",:invalid=>:replace,:undef=>:replace,:replace=>""
t = ec2.convert ec1.convert t

回复收藏 0 原文

姐不稀罕 2024-09-11 14:34:50

虽然 Nakilon 的解决方案有效，至少就克服错误而言，但就我而言，我有一个来自 Microsoft Excel 的奇怪的 fed up 字符转换为 CSV，该字符在 ruby 中注册为（得到这个）西里尔字母 K，其中ruby 是粗体 K。为了解决这个问题，我使用了“iso-8859-1”即。 CSV.parse(f, :encoding => "iso-8859-1")，它把我奇怪的西里尔字母 K 变成了更易于管理的 /\xCA/ ，然后我可以使用 string.gsub!(/\xCA/, '') 删除它

回复收藏 0 原文

も星光 2024-09-11 14:34:50

在使用 scan 之前，请确保请求页面的 Content-Type 标头为 text/html，因为可能存在指向图像等内容的链接不是以 UTF-8 编码的。如果您在元素中选择了 href，则该页面也可能是非 html。如何检查这一点取决于您使用的 HTTP 库。然后，确保结果仅是带有 String#ascii_only? 的 ascii（不是 UTF-8，因为 HTML 只应该使用 ascii，否则可以使用实体）。如果这两项测试都通过，则可以安全地使用 scan。