ruby 1.9:UTF-8 中的无效字节序列

发布于 2024-09-04 14:34:50 字数 313 浏览 3 评论 0原文

我正在用 Ruby (1.9) 编写一个爬虫,它消耗来自许多随机站点的大量 HTML。
当尝试提取链接时,我决定只使用 .scan(/href="(.*?)"/i) 而不是 nokogiri/hpricot (主要加速)。问题是我现在收到很多“UTF-8 中的无效字节序列”错误。
据我了解,net/http 库没有任何编码特定选项,并且传入的内容基本上没有正确标记。
实际处理传入数据的最佳方式是什么?我尝试使用替换和无效选项设置 .encode ,但到目前为止没有成功......

I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

大姐,你呐 2024-09-11 14:34:50

在 Ruby 1.9.3 中,可以使用 String.encode 来“忽略”无效的 UTF-8 序列。这是一个在 1.8 中都可以使用的代码片段 (iconv ) 和 1.9 (字符串#encode) :

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

或者如果你的输入确实很麻烦,你可以进行从 UTF-8 到 UTF-16 再到 UTF-8 的双重转换:

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
  file_contents.encode!('UTF-8', 'UTF-16')
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (iconv) and 1.9 (String#encode) :

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
  file_contents.encode!('UTF-8', 'UTF-16')
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end
暮年慕年 2024-09-11 14:34:50

接受的答案或其他答案都对我有用。我发现这篇文章建议

string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

这为我解决了问题。

The accepted answer nor the other answer work for me. I found this post which suggested

string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

This fixed the problem for me.

心欲静而疯不止 2024-09-11 14:34:50

我当前的解决方案是运行:

my_string.unpack("C*").pack("U*")

这至少会消除我的主要问题的异常

My current solution is to run:

my_string.unpack("C*").pack("U*")

This will at least get rid of the exceptions which was my main problem

习惯成性 2024-09-11 14:34:50

试试这个:

def to_utf8(str)
  str = str.force_encoding('UTF-8')
  return str if str.valid_encoding?
  str.encode("UTF-8", 'binary', invalid: :replace, undef: :replace, replace: '')
end

Try this:

def to_utf8(str)
  str = str.force_encoding('UTF-8')
  return str if str.valid_encoding?
  str.encode("UTF-8", 'binary', invalid: :replace, undef: :replace, replace: '')
end
不必你懂 2024-09-11 14:34:50

我建议您使用 HTML 解析器。只要找到最快的就可以了。

解析 HTML 并不像看起来那么容易。

浏览器解析无效的UTF-8序列,在UTF-8 HTML文档中,只需放置“�”符号即可。因此,一旦 HTML 中的无效 UTF-8 序列被解析,生成的文本就是有效的字符串。

即使在属性值内部,您也必须解码像 amp 这样的 HTML 实体。

这是一个很好的问题,总结了为什么您不能使用正则表达式可靠地解析 HTML:
RegEx 匹配开放标记(XHTML 自包含标记除外)

I recommend you to use a HTML parser. Just find the fastest one.

Parsing HTML is not as easy as it may seem.

Browsers parse invalid UTF-8 sequences, in UTF-8 HTML documents, just putting the "�" symbol. So once the invalid UTF-8 sequence in the HTML gets parsed the resulting text is a valid string.

Even inside attribute values you have to decode HTML entities like amp

Here is a great question that sums up why you can not reliably parse HTML with a regular expression:
RegEx match open tags except XHTML self-contained tags

国粹 2024-09-11 14:34:50
attachment = file.read

begin
   # Try it as UTF-8 directly
   cleaned = attachment.dup.force_encoding('UTF-8')
   unless cleaned.valid_encoding?
     # Some of it might be old Windows code page
     cleaned = attachment.encode( 'UTF-8', 'Windows-1252' )
   end
   attachment = cleaned
 rescue EncodingError
   # Force it to UTF-8, throwing out invalid bits
   attachment = attachment.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
 end
attachment = file.read

begin
   # Try it as UTF-8 directly
   cleaned = attachment.dup.force_encoding('UTF-8')
   unless cleaned.valid_encoding?
     # Some of it might be old Windows code page
     cleaned = attachment.encode( 'UTF-8', 'Windows-1252' )
   end
   attachment = cleaned
 rescue EncodingError
   # Force it to UTF-8, throwing out invalid bits
   attachment = attachment.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
 end
影子是时光的心 2024-09-11 14:34:50

这似乎有效:

def sanitize_utf8(string)
  return nil if string.nil?
  return string if string.valid_encoding?
  string.chars.select { |c| c.valid_encoding? }.join
end

This seems to work:

def sanitize_utf8(string)
  return nil if string.nil?
  return string if string.valid_encoding?
  string.chars.select { |c| c.valid_encoding? }.join
end
陪你搞怪i 2024-09-11 14:34:50

我遇到过字符串,它混合了英语、俄语和其他一些字母,这导致了异常。我只需要俄语和英语,目前这对我有用:

ec1 = Encoding::Converter.new "UTF-8","Windows-1251",:invalid=>:replace,:undef=>:replace,:replace=>""
ec2 = Encoding::Converter.new "Windows-1251","UTF-8",:invalid=>:replace,:undef=>:replace,:replace=>""
t = ec2.convert ec1.convert t

I've encountered string, which had mixings of English, Russian and some other alphabets, which caused exception. I need only Russian and English, and this currently works for me:

ec1 = Encoding::Converter.new "UTF-8","Windows-1251",:invalid=>:replace,:undef=>:replace,:replace=>""
ec2 = Encoding::Converter.new "Windows-1251","UTF-8",:invalid=>:replace,:undef=>:replace,:replace=>""
t = ec2.convert ec1.convert t
姐不稀罕 2024-09-11 14:34:50

虽然 Nakilon 的解决方案有效,至少就克服错误而言,但就我而言,我有一个来自 Microsoft Excel 的奇怪的 fed up 字符转换为 CSV,该字符在 ruby​​ 中注册为(得到这个)西里尔字母 K,其中ruby 是粗体 K。为了解决这个问题,我使用了“iso-8859-1”即。 CSV.parse(f, :encoding => "iso-8859-1"),它把我奇怪的西里尔字母 K 变成了更易于管理的 /\xCA/ ,然后我可以使用 string.gsub!(/\xCA/, '') 删除它

While Nakilon's solution works, at least as far as getting past the error, in my case, I had this weird f-ed up character originating from Microsoft Excel converted to CSV that was registering in ruby as a (get this) cyrillic K which in ruby was a bolded K. To fix this I used 'iso-8859-1' viz. CSV.parse(f, :encoding => "iso-8859-1"), which turned my freaky deaky cyrillic K's into a much more manageable /\xCA/, which I could then remove with string.gsub!(/\xCA/, '')

も星光 2024-09-11 14:34:50

在使用 scan 之前,请确保请求页面的 Content-Type 标头为 text/html,因为可能存在指向图像等内容的链接不是以 UTF-8 编码的。如果您在 元素中选择了 href,则该页面也可能是非 html。如何检查这一点取决于您使用的 HTTP 库。然后,确保结果仅是带有 String#ascii_only? 的 ascii(不是 UTF-8,因为 HTML 只应该使用 ascii,否则可以使用实体)。如果这两项测试都通过,则可以安全地使用 scan

Before you use scan, make sure that the requested page's Content-Type header is text/html, since there can be links to things like images which are not encoded in UTF-8. The page could also be non-html if you picked up a href in something like a <link> element. How to check this varies on what HTTP library you are using. Then, make sure the result is only ascii with String#ascii_only? (not UTF-8 because HTML is only supposed to be using ascii, entities can be used otherwise). If both of those tests pass, it is safe to use scan.

南冥有猫 2024-09-11 14:34:50

还有 scrub 方法来过滤无效字节。

string.scrub('')

There is also the scrub method to filter invalid bytes.

string.scrub('')
我不咬妳我踢妳 2024-09-11 14:34:50

如果您不“关心”数据,您可以执行以下操作:

search_params = params[:search].valid_encoding? ? params[:search].gsub(/\W+/, '') : "nothing"

我刚刚使用 valid_encoding? 来传递它。我的是一个搜索字段,所以我一遍又一遍地发现同样的奇怪之处,所以我使用了类似的东西:只是为了让系统不会崩溃。由于我无法控制用户体验在发送此信息之前自动验证(例如自动反馈说“假装!”),我可以将其接收,将其删除并返回空白结果。

If you don't "care" about the data you can just do something like:

search_params = params[:search].valid_encoding? ? params[:search].gsub(/\W+/, '') : "nothing"

I just used valid_encoding? to get passed it. Mine is a search field, and so i was finding the same weirdness over and over so I used something like: just to have the system not break. Since i don't control the user experience to autovalidate prior to sending this info (like auto feedback to say "dummy up!") I can just take it in, strip it out and return blank results.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文