以前的十六进制编码的 Ruby 编码

发布于 2025-01-16 00:10:38 字数 1531 浏览 0 评论 0原文

我遇到的情况是 Nokogiri 结果将 hex 编码到我的结果中。问题在于结果的实际编码为 UTF-8,但包含十六进制字符:

Best 100+ Fishing Pictures | Download Free Images on Unsplash
https%3A%2F%2Funsplash.com%2Fs%2Fphotos%2Ffishing&rut=d1dd8233a6ad628121fa36d8d5a51be0b6fb0eda75e234d5036bf7b49efcf25b
current encoding: UTF-8

Fish Images | Free Vectors, Stock Photos & PSD
https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68
current encoding: UTF-8

How to Use Fish vs. fishes Correctly
https%3A%2F%2Fgrammarist.com%2Fusage%2Ffish%2Dfishes%2F&rut=e0897e219c9b0b125a1442b59e36c49753417a1b7812ae9d3ab0bc3179ffe6b5
current encoding: UTF-8

URL 在技术上编码为 UTF-8,但包含十六进制字符。我还没有找到任何将它们视为十六进制来翻译为 UTF-8 的内容,因此我不知道如何识别这些字符分组进行翻译。除了编写一个可能工作的复杂方法之外,我想我应该看看是否有对原始字符串的强制识别,然后使用force_encode或类似的东西进行翻译。

有人对如何实现这一目标有任何建议吗?任何见解表示赞赏。我宁愿避免将这些字符手动编码到方法中。

更新CGI::unescapeHTML(] 不起作用:

irb(main):024:0> a
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):025:0> CGI::unescapeHTML(a)
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):026:0> CGI::unescapeHTML(a) == a
=> true

I have a situation where a Nokogiri result has hex encoding into my results. The problem is where the actual encoding of the result is UTF-8, but contains hex characters:

Best 100+ Fishing Pictures | Download Free Images on Unsplash
https%3A%2F%2Funsplash.com%2Fs%2Fphotos%2Ffishing&rut=d1dd8233a6ad628121fa36d8d5a51be0b6fb0eda75e234d5036bf7b49efcf25b
current encoding: UTF-8

Fish Images | Free Vectors, Stock Photos & PSD
https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68
current encoding: UTF-8

How to Use Fish vs. fishes Correctly
https%3A%2F%2Fgrammarist.com%2Fusage%2Ffish%2Dfishes%2F&rut=e0897e219c9b0b125a1442b59e36c49753417a1b7812ae9d3ab0bc3179ffe6b5
current encoding: UTF-8

The URLs are technically encoded as UTF-8, but have hex characters. I haven't found anything that has seen them as hex to translate to UTF-8, so I'm lost as to how to recognize those character groupings for translation. Outside of writing a complex method that might work, I thought I would see if there's a force-recognition of the original string to be then translated using force_encode or something of that sort.

Anybody have any advice how to accomplish this? Any insight appreciated. I'd rather avoid having to hand-code these characters into a method.

Update:
CGI::unescapeHTML(<string>] isn't working:

irb(main):024:0> a
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):025:0> CGI::unescapeHTML(a)
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):026:0> CGI::unescapeHTML(a) == a
=> true

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

酒浓于脸红 2025-01-23 00:10:39

您没有在原始问题中给出“结果的编码是 UTF-8,但包含十六进制字符”的来源。我想我不明白这个问题。

在您的更新中,您使用了错误的方法。 unescapeHTML 用于解析 HTML 实体:

irb(main):010:0> CGI.escapeHTML '<'
=> "<"
irb(main):012:0> CGI.unescapeHTML '<'
=> "<"

您需要使用的方法是用于解码 URL 序列:

irb(main):017:0> encoded_url = "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):018:0> CGI.unescape encoded_url
=> "https://www.freepik.com/free-photos-vectors/fish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"

如果这不能解决您的实际问题,我很乐意修改,以便在问题。

You didn't give the source for your "encoding of the result is UTF-8, but contains hex characters" in the original question. I don't think I understand that question.

In your update, you used the incorrect method. unescapeHTML is for resolving HTML entities:

irb(main):010:0> CGI.escapeHTML '<'
=> "<"
irb(main):012:0> CGI.unescapeHTML '<'
=> "<"

The method you need to use is for decoding URL sequences:

irb(main):017:0> encoded_url = "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):018:0> CGI.unescape encoded_url
=> "https://www.freepik.com/free-photos-vectors/fish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"

If that doesn't solve your actual problem, I'm happy to revise given a more debuggable source code in the question.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文