以前的十六进制编码的 Ruby 编码
我遇到的情况是 Nokogiri
结果将 hex
编码到我的结果中。问题在于结果的实际编码为 UTF-8
,但包含十六进制字符:
Best 100+ Fishing Pictures | Download Free Images on Unsplash
https%3A%2F%2Funsplash.com%2Fs%2Fphotos%2Ffishing&rut=d1dd8233a6ad628121fa36d8d5a51be0b6fb0eda75e234d5036bf7b49efcf25b
current encoding: UTF-8
Fish Images | Free Vectors, Stock Photos & PSD
https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68
current encoding: UTF-8
How to Use Fish vs. fishes Correctly
https%3A%2F%2Fgrammarist.com%2Fusage%2Ffish%2Dfishes%2F&rut=e0897e219c9b0b125a1442b59e36c49753417a1b7812ae9d3ab0bc3179ffe6b5
current encoding: UTF-8
URL 在技术上编码为 UTF-8
,但包含十六进制字符。我还没有找到任何将它们视为十六进制来翻译为 UTF-8
的内容,因此我不知道如何识别这些字符分组进行翻译。除了编写一个可能工作的复杂方法之外,我想我应该看看是否有对原始字符串的强制识别,然后使用force_encode
或类似的东西进行翻译。
有人对如何实现这一目标有任何建议吗?任何见解表示赞赏。我宁愿避免将这些字符手动编码到方法中。
更新: CGI::unescapeHTML(
不起作用:
irb(main):024:0> a
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):025:0> CGI::unescapeHTML(a)
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):026:0> CGI::unescapeHTML(a) == a
=> true
I have a situation where a Nokogiri
result has hex
encoding into my results. The problem is where the actual encoding of the result is UTF-8
, but contains hex characters:
Best 100+ Fishing Pictures | Download Free Images on Unsplash
https%3A%2F%2Funsplash.com%2Fs%2Fphotos%2Ffishing&rut=d1dd8233a6ad628121fa36d8d5a51be0b6fb0eda75e234d5036bf7b49efcf25b
current encoding: UTF-8
Fish Images | Free Vectors, Stock Photos & PSD
https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68
current encoding: UTF-8
How to Use Fish vs. fishes Correctly
https%3A%2F%2Fgrammarist.com%2Fusage%2Ffish%2Dfishes%2F&rut=e0897e219c9b0b125a1442b59e36c49753417a1b7812ae9d3ab0bc3179ffe6b5
current encoding: UTF-8
The URLs are technically encoded as UTF-8
, but have hex characters. I haven't found anything that has seen them as hex to translate to UTF-8
, so I'm lost as to how to recognize those character groupings for translation. Outside of writing a complex method that might work, I thought I would see if there's a force-recognition of the original string to be then translated using force_encode
or something of that sort.
Anybody have any advice how to accomplish this? Any insight appreciated. I'd rather avoid having to hand-code these characters into a method.
Update:CGI::unescapeHTML(<string>]
isn't working:
irb(main):024:0> a
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):025:0> CGI::unescapeHTML(a)
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):026:0> CGI::unescapeHTML(a) == a
=> true
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您没有在原始问题中给出“结果的编码是 UTF-8,但包含十六进制字符”的来源。我想我不明白这个问题。
在您的更新中,您使用了错误的方法。
unescapeHTML
用于解析 HTML 实体:您需要使用的方法是用于解码 URL 序列:
如果这不能解决您的实际问题,我很乐意修改,以便在问题。
You didn't give the source for your "encoding of the result is UTF-8, but contains hex characters" in the original question. I don't think I understand that question.
In your update, you used the incorrect method.
unescapeHTML
is for resolving HTML entities:The method you need to use is for decoding URL sequences:
If that doesn't solve your actual problem, I'm happy to revise given a more debuggable source code in the question.