Ruby读取编码为“GB2313”的网页,如何检查内容是否包含某些关键字?
我使用ruby读取网页,其内容为:
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=GB2312" />
</HEAD>
<BODY>
中文
</BODY>
</HTML>
从meta中,我们可以看到它使用GB2312
编码。
我的代码是:
res = Net::HTTP.post_form(URI.parse("http://xxx/check"),
{:query=>'xxx'})
然后我使用:
res.include?("中文")
检查内容是否有这个词。但如果显示false
。
我不知道为什么它是假的,我该怎么办? ruby 1.8.7 使用什么编码?如果我需要转换编码,该怎么做?
I use ruby reading a web page, and its content is:
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=GB2312" />
</HEAD>
<BODY>
中文
</BODY>
</HTML>
From the meta, we can see it uses a GB2312
encoding.
My code is:
res = Net::HTTP.post_form(URI.parse("http://xxx/check"),
{:query=>'xxx'})
Then I use:
res.include?("中文")
to check if the content has that word. But if shows false
.
I don't know why it is false, and what should I do? What encoding ruby 1.8.7 use? If I need to convert the encoding, how to do it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Ruby 1.8 不使用编码,它使用纯字节字符串。如果您希望程序中的字节字符串与网页中的字节字符串匹配,则必须以网页使用的相同编码 (GB2312) 保存 .rb 文件,以便 Ruby 能够看到相同的字节。
可能更好的方法是显式写入字节字符串,避免与 .rb 文件的编码有关的问题:
但是,当使用多字节编码时(UTF-8 除外,UTF-8 除外),匹配字节字符串无法可靠地匹配字符。故意设计允许它)。如果网页
中包含字符串:,则该字符串将被编码为
"\xD0\xD6\xD0\xCE\xC4\xD0"
。其中包含字节序列"\xD6\xD0\xCE\xC4"
,因此include?
将为true
,即使字符>中文
不存在。如果您需要完全可靠地处理非 ASCII 字符,则需要一种支持 Unicode 的语言。
Ruby 1.8 doesn't use encodings, it uses plain byte strings. If you want the byte string in your program to match the byte string in the web page, you'd have to save the .rb file in the same encoding the web pages uses (GB2312) so that Ruby will see the same bytes.
Probably better would be to write the byte string explicitly, avoiding issues to do with the encoding of the .rb file:
However, matching byte strings doesn't match characters reliably when multibyte encodings are in use (except for UTF-8, which is deliberately designed to allow it). If the web page had the string:
in it, that would be encoded as
"\xD0\xD6\xD0\xCE\xC4\xD0"
. Which contains the byte sequence"\xD6\xD0\xCE\xC4"
, so theinclude?
would betrue
even though the characters中文
are not present.If you need to handle non-ASCII characters fully reliably, you'd need a language with Unicode support.