完全访问&百分比解码URL

发布于 2025-02-06 23:17:57 字数 1106 浏览 1 评论 0原文

我正在研究RSS新闻解析器。我可以在内容中获得非常不同的URL:带有逃逸/未逃脱或编码/未编码/未编码的HREFS:

URL编码:

https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France

Escaped:

http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect

未编码&未逃脱:

https://newsquawk.com/daily/article?id=2490-us-market-open-concerns&utm_source=tradingview&utm_medium=research&utm_campaign=partner-post

此外,RSSS最初可能包含一些未编码的 Unsafe

https://www.unsafe.com/a<b>c{d}e[f ]\g^

符号URL正式“安全”。似乎正式安全URL的唯一方法是完全取消escape&amp;首先解码吗?


我可以以某种方式使所有不同的URL标准化吗?有没有一种方法可以完全取消并; Golang解码的URL?

func(url string) (completelyDecodedUrl string, error) {
    // ??
}

I am working on RSS news parser. I can get very different URLs in contents: with escaped/ not escaped or url-encoded/not url encoded hrefs:

URL-encoded:

https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France

Escaped:

http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect

Not encoded & not escaped:

https://newsquawk.com/daily/article?id=2490-us-market-open-concerns&utm_source=tradingview&utm_medium=research&utm_campaign=partner-post

Additionally, RSSs initially may contain some uncoded unsafe symbols:

https://www.unsafe.com/a<b>c{d}e[f ]\g^

I need to make all the URLs formally "safe". Seems the only way to get formally safe URL is to completely unescape & decode it first?


Can I somehow normalize all the different URLs? Is there a way to get completely unescaped & decoded URL in golang?

func(url string) (completelyDecodedUrl string, error) {
    // ??
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

你与昨日 2025-02-13 23:17:57

URL编码的示例非常好,这就是您作为URL的一部分传输数据的方式。如果需要解码版本,请解析URL并打印其url.fragment字段。

至于第二,只需使用 html.unescape()

例如:

s := "https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France"
u, err := url.Parse(s)
if err != nil {
    panic(err)
}

fmt.Println(u.Fragment)

s2 := "http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect"
fmt.Println(html.UnescapeString(s2))

(尝试在 Go Playground ):

:~:text=La Russie a engrangé 93,qui épingle particulièrement la France
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect

这将输出 解码链接,因为编码的表单是有效的。您必须使用编码的表单,接收服务器是需要解码的表单。

要检测URL是否是html逃脱的,您可以检查它是否包含semicolon字符;,因为它保留在URL中(请参阅 rfc 1738 ),而html逃生序列包含半角色。因此,decode()可能看起来像这样:

func decode(s string) string {
    if strings.IndexByte(s, ';') >= 0 {
        s = html.UnescapeString(s)
    }
    return s
}

如果您害怕恶意或无效的URL,则可以解析并重新编码URL:

func decode(s string) (string, bool) {
    if strings.IndexByte(s, ';') >= 0 {
        s = html.UnescapeString(s)
    }
    u, err := url.ParseRequestURI(s)
    if err != nil {
        return "", false
    }
    return u.String(), true
}

测试它:

fmt.Println(decode(`http//foo.bar`))
fmt.Println(decode(`http://foo.bar/doc?query=abc#first`))
fmt.Println(decode(`https://www.unsafe.com/a<b>c{d}e[f ]\g^`))

这将输出(在 go playground ):

 false
http://foo.bar/doc?query=abc#first true
https://www.unsafe.com/a%3Cb%3Ec%7Bd%7De%5Bf%20%5D%5Cg%5E true

The URL encoded example is good as-is, that's how you transmit data as part of the URL. If you need the decoded version, parse the URL and print its URL.Fragment field.

As to the second, simply use html.Unescape().

For example:

s := "https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France"
u, err := url.Parse(s)
if err != nil {
    panic(err)
}

fmt.Println(u.Fragment)

s2 := "http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect"
fmt.Println(html.UnescapeString(s2))

This will output (try it on the Go Playground):

:~:text=La Russie a engrangé 93,qui épingle particulièrement la France
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect

You do not need to decode the link, as the encoded form is the valid one. You must use the encoded form, the receiving server is the one who needs to decode it.

To detect if the URL is HTML escaped, you may check if it contains the semicolon character ; as it is reserved in URLs (see RFC 1738), and HTML escape sequences contain the semicolon character. So decode() may look like this:

func decode(s string) string {
    if strings.IndexByte(s, ';') >= 0 {
        s = html.UnescapeString(s)
    }
    return s
}

If you're afraid of malicious or invalid URLs, you may parse and reencode the URL:

func decode(s string) (string, bool) {
    if strings.IndexByte(s, ';') >= 0 {
        s = html.UnescapeString(s)
    }
    u, err := url.ParseRequestURI(s)
    if err != nil {
        return "", false
    }
    return u.String(), true
}

Testing it:

fmt.Println(decode(`http//foo.bar`))
fmt.Println(decode(`http://foo.bar/doc?query=abc#first`))
fmt.Println(decode(`https://www.unsafe.com/a<b>c{d}e[f ]\g^`))

This will output (try it on the Go Playground):

 false
http://foo.bar/doc?query=abc#first true
https://www.unsafe.com/a%3Cb%3Ec%7Bd%7De%5Bf%20%5D%5Cg%5E true
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文