gsub :: ArgumentError（UTF-8 中的字节序列无效）

发布于 2025-01-08 12:42:58 字数 308 浏览 1 评论 0原文

此代码使用 Hpricot gem 获取包含 UTF-8 字符的 HTML。

# <div>This is a test<a href="">测试</a></div>
div[0].to_html.gsub(/test/, "")

当运行时，它会输出此错误（指向 gsub）：

ArgumentError (invalid byte sequence in UTF-8)

我们如何解决此问题？

原文

This code uses the Hpricot gem to get HTML that contains UTF-8 characters.

# <div>This is a test<a href="">测试</a></div>
div[0].to_html.gsub(/test/, "")

When that is run, it spits out this error (pointing at gsub):

ArgumentError (invalid byte sequence in UTF-8)

How can we fix this issue?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

全部不再 2025-01-15 12:42:58

弄清楚了问题。 Hpricot 的 to_html 调用触发错误的方法，因此为了消除错误，我们需要将 Hpricot 文档编码为 UTF-8，而不仅仅是那个字符串。我们这样做：

ic = Iconv.new("UTF-8//IGNORE", "UTF-8")
doc = open("http://example.com") {|f| Hpricot(ic.iconv(f.read)) }

然后我们可以调用其他 Hpricot 方法，但现在整个文档都有 UTF-8 编码，并且不会给我们任何错误。

Figured out the issue. Hpricot's to_html calls methods that trigger the error so to get rid of that we need to make the Hpricot document encoding UTF-8, not just that one string. We do that like this:

ic = Iconv.new("UTF-8//IGNORE", "UTF-8")
doc = open("http://example.com") {|f| Hpricot(ic.iconv(f.read)) }

And then we can call other Hpricot methods but now the whole document has UTF-8 encoding and it won't give us any errors.

回复收藏 0 原文

高速公鹿 2025-01-15 12:42:58

在这种情况下，to_html 看起来返回一个非 utf8 字符串。

我对包含一些非 utf8 字符的文件有同样的问题。我发现的修复方法并不是很漂亮，但它也适用于您的情况：

the_utf8_string = the_non_utf8_string.unpack('C*').pack('U*')

小心，我不确定是否没有任何数据丢失。

The to_html looks to return a non-utf8 string in this case.

I had same problem with file containing some non-utf8 characters. The fix I found is not really beautiful, but it could also works for your case :

the_utf8_string = the_non_utf8_string.unpack('C*').pack('U*')

Be careful, I'm not sure there is no one data lost.

回复收藏 0 原文

~没有更多了~

关于作者

知你几分

暂无简介

文章

25 人气

关注发私信

微信用户

文章 0 评论 0

关注

小情绪

文章 0 评论 0

关注

追我者格杀勿论

文章 0 评论 0

关注

ゞ记忆︶ㄣ

文章 0 评论 0

关注

笨死的猪

文章 0 评论 0

关注

彭明超

文章 0 评论 0

友情链接

文江博客

gsub :: ArgumentError（UTF-8 中的字节序列无效）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

微信用户

小情绪

追我者格杀勿论

ゞ记忆︶ㄣ

笨死的猪

彭明超

友情链接

gsub :: ArgumentError（UTF-8 中的字节序列无效）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

微信用户

小情绪

追我者格杀勿论

ゞ记忆︶ㄣ

笨死的猪

彭明超

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。