ruby 1.9:UTF-8 中的无效字节序列
我正在用 Ruby (1.9) 编写一个爬虫,它消耗来自许多随机站点的大量 HTML。
当尝试提取链接时,我决定只使用 .scan(/href="(.*?)"/i)
而不是 nokogiri/hpricot (主要加速)。问题是我现在收到很多“UTF-8 中的无效字节序列
”错误。
据我了解,net/http
库没有任何编码特定选项,并且传入的内容基本上没有正确标记。
实际处理传入数据的最佳方式是什么?我尝试使用替换和无效选项设置 .encode
,但到目前为止没有成功......
I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i)
instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8
" errors.
From what I understood, the net/http
library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode
with the replace and invalid options set, but no success so far...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
在 Ruby 1.9.3 中,可以使用 String.encode 来“忽略”无效的 UTF-8 序列。这是一个在 1.8 中都可以使用的代码片段 (iconv ) 和 1.9 (字符串#encode) :
或者如果你的输入确实很麻烦,你可以进行从 UTF-8 到 UTF-16 再到 UTF-8 的双重转换:
In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (iconv) and 1.9 (String#encode) :
or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:
接受的答案或其他答案都对我有用。我发现这篇文章建议
这为我解决了问题。
The accepted answer nor the other answer work for me. I found this post which suggested
This fixed the problem for me.
我当前的解决方案是运行:
这至少会消除我的主要问题的异常
My current solution is to run:
This will at least get rid of the exceptions which was my main problem
试试这个:
Try this:
我建议您使用 HTML 解析器。只要找到最快的就可以了。
解析 HTML 并不像看起来那么容易。
浏览器解析无效的UTF-8序列,在UTF-8 HTML文档中,只需放置“�”符号即可。因此,一旦 HTML 中的无效 UTF-8 序列被解析,生成的文本就是有效的字符串。
即使在属性值内部,您也必须解码像 amp 这样的 HTML 实体。
这是一个很好的问题,总结了为什么您不能使用正则表达式可靠地解析 HTML:
RegEx 匹配开放标记(XHTML 自包含标记除外)
I recommend you to use a HTML parser. Just find the fastest one.
Parsing HTML is not as easy as it may seem.
Browsers parse invalid UTF-8 sequences, in UTF-8 HTML documents, just putting the "�" symbol. So once the invalid UTF-8 sequence in the HTML gets parsed the resulting text is a valid string.
Even inside attribute values you have to decode HTML entities like amp
Here is a great question that sums up why you can not reliably parse HTML with a regular expression:
RegEx match open tags except XHTML self-contained tags
这似乎有效:
This seems to work:
我遇到过字符串,它混合了英语、俄语和其他一些字母,这导致了异常。我只需要俄语和英语,目前这对我有用:
I've encountered string, which had mixings of English, Russian and some other alphabets, which caused exception. I need only Russian and English, and this currently works for me:
虽然 Nakilon 的解决方案有效,至少就克服错误而言,但就我而言,我有一个来自 Microsoft Excel 的奇怪的 fed up 字符转换为 CSV,该字符在 ruby 中注册为(得到这个)西里尔字母 K,其中ruby 是粗体 K。为了解决这个问题,我使用了“iso-8859-1”即。
CSV.parse(f, :encoding => "iso-8859-1")
,它把我奇怪的西里尔字母 K 变成了更易于管理的/\xCA/
,然后我可以使用 string.gsub!(/\xCA/, '') 删除它While Nakilon's solution works, at least as far as getting past the error, in my case, I had this weird f-ed up character originating from Microsoft Excel converted to CSV that was registering in ruby as a (get this) cyrillic K which in ruby was a bolded K. To fix this I used 'iso-8859-1' viz.
CSV.parse(f, :encoding => "iso-8859-1")
, which turned my freaky deaky cyrillic K's into a much more manageable/\xCA/
, which I could then remove withstring.gsub!(/\xCA/, '')
在使用
scan
之前,请确保请求页面的Content-Type
标头为text/html
,因为可能存在指向图像等内容的链接不是以 UTF-8 编码的。如果您在元素中选择了
href
,则该页面也可能是非 html。如何检查这一点取决于您使用的 HTTP 库。然后,确保结果仅是带有String#ascii_only?
的 ascii(不是 UTF-8,因为 HTML 只应该使用 ascii,否则可以使用实体)。如果这两项测试都通过,则可以安全地使用scan
。Before you use
scan
, make sure that the requested page'sContent-Type
header istext/html
, since there can be links to things like images which are not encoded in UTF-8. The page could also be non-html if you picked up ahref
in something like a<link>
element. How to check this varies on what HTTP library you are using. Then, make sure the result is only ascii withString#ascii_only?
(not UTF-8 because HTML is only supposed to be using ascii, entities can be used otherwise). If both of those tests pass, it is safe to usescan
.还有 scrub 方法来过滤无效字节。
There is also the scrub method to filter invalid bytes.
如果您不“关心”数据,您可以执行以下操作:
search_params = params[:search].valid_encoding? ? params[:search].gsub(/\W+/, '') : "nothing"
我刚刚使用
valid_encoding?
来传递它。我的是一个搜索字段,所以我一遍又一遍地发现同样的奇怪之处,所以我使用了类似的东西:只是为了让系统不会崩溃。由于我无法控制用户体验在发送此信息之前自动验证(例如自动反馈说“假装!”),我可以将其接收,将其删除并返回空白结果。If you don't "care" about the data you can just do something like:
search_params = params[:search].valid_encoding? ? params[:search].gsub(/\W+/, '') : "nothing"
I just used
valid_encoding?
to get passed it. Mine is a search field, and so i was finding the same weirdness over and over so I used something like: just to have the system not break. Since i don't control the user experience to autovalidate prior to sending this info (like auto feedback to say "dummy up!") I can just take it in, strip it out and return blank results.