完全访问&百分比解码URL
我正在研究RSS新闻解析器。我可以在内容中获得非常不同的URL:带有逃逸/未逃脱或编码/未编码/未编码的HREFS:
URL编码:
https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France
Escaped:
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect
未编码&未逃脱:
https://newsquawk.com/daily/article?id=2490-us-market-open-concerns&utm_source=tradingview&utm_medium=research&utm_campaign=partner-post
此外,RSSS最初可能包含一些未编码的 Unsafe
https://www.unsafe.com/a<b>c{d}e[f ]\g^
符号URL正式“安全”。似乎正式安全URL的唯一方法是完全取消escape&amp;首先解码吗?
我可以以某种方式使所有不同的URL标准化吗?有没有一种方法可以完全取消并; Golang解码的URL?
func(url string) (completelyDecodedUrl string, error) {
// ??
}
I am working on RSS news parser. I can get very different URLs in contents: with escaped/ not escaped or url-encoded/not url encoded hrefs:
URL-encoded:
https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France
Escaped:
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect
Not encoded & not escaped:
https://newsquawk.com/daily/article?id=2490-us-market-open-concerns&utm_source=tradingview&utm_medium=research&utm_campaign=partner-post
Additionally, RSSs initially may contain some uncoded unsafe symbols:
https://www.unsafe.com/a<b>c{d}e[f ]\g^
I need to make all the URLs formally "safe". Seems the only way to get formally safe URL is to completely unescape & decode it first?
Can I somehow normalize all the different URLs? Is there a way to get completely unescaped & decoded URL in golang?
func(url string) (completelyDecodedUrl string, error) {
// ??
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
URL编码的示例非常好,这就是您作为URL的一部分传输数据的方式。如果需要解码版本,请解析URL并打印其
url.fragment
字段。至于第二,只需使用
html.unescape()
。
例如:
(尝试在 Go Playground ):
这将输出 解码链接,因为编码的表单是有效的。您必须使用编码的表单,接收服务器是需要解码的表单。
要检测URL是否是html逃脱的,您可以检查它是否包含semicolon字符
;
,因为它保留在URL中(请参阅 rfc 1738 ),而html逃生序列包含半角色。因此,decode()
可能看起来像这样:如果您害怕恶意或无效的URL,则可以解析并重新编码URL:
测试它:
这将输出(在 go playground ):
The URL encoded example is good as-is, that's how you transmit data as part of the URL. If you need the decoded version, parse the URL and print its
URL.Fragment
field.As to the second, simply use
html.Unescape()
.For example:
This will output (try it on the Go Playground):
You do not need to decode the link, as the encoded form is the valid one. You must use the encoded form, the receiving server is the one who needs to decode it.
To detect if the URL is HTML escaped, you may check if it contains the semicolon character
;
as it is reserved in URLs (see RFC 1738), and HTML escape sequences contain the semicolon character. Sodecode()
may look like this:If you're afraid of malicious or invalid URLs, you may parse and reencode the URL:
Testing it:
This will output (try it on the Go Playground):