站点地图编码问题

发布于 2024-09-06 01:21:37 字数 1178 浏览 6 评论 0原文

我很难理解有关如何正确转义和编码 URL 以在站点地图中提交的规范和指南。

转义)示例中,他们有一个示例 URL:

http://www.example.com/ümlat.php&q=name

sitemap.org (实体 -8 编码最终为(根据他们):

http://www.example.com/%C3%BCmlat.php&q=name

但是,当我在 PHP 上尝试这个(rawurlencode)时,我最终得到:

http%3A%2F%2Fwww.example.com%2F%C3%BCmlat.php%26q%3Dname

我通过使用 PHP.net

$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', 
    '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%23', '%5B', '%5D');
    
$replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+",
    "$", ",", "/", "?", "#", "[", "]");

$string = str_replace($entities, $replacements, rawurlencode($string));

但根据我采访过的人(Kohana BDFM)的说法,这种解释是错误的。老实说,我很困惑,我什至不知道什么是正确的。

对站点地图中使用的 URL 进行编码的正确方法是什么?

相关 RFC 3986

I'm having real trouble understanding the specification and guidelines on how to properly escape and encode a URL for submission in a sitemap.

In the sitemap.org (entity escaping) examples, they have an example URL:

http://www.example.com/ümlat.php&q=name

Which when UTF-8 encoded ends up as (according to them):

http://www.example.com/%C3%BCmlat.php&q=name

However, when I try this (rawurlencode) on PHP I end up with:

http%3A%2F%2Fwww.example.com%2F%C3%BCmlat.php%26q%3Dname

I've sort of beaten this by using this function found on PHP.net

$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', 
    '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%23', '%5B', '%5D');
    
$replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+",
    "
quot;, ",", "/", "?", "#", "[", "]");

$string = str_replace($entities, $replacements, rawurlencode($string));

but according to someone I spoke to (Kohana BDFM), this interpretation is wrong. Honestly, I'm so confused I don't even know what's right.

What's the correct way to encode a URL for use in the sitemap?

Relevant RFC 3986

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

帥小哥 2024-09-13 01:21:37

问题是 http://www.example.com/ümlat.php&q=name 不是有效的网址。

(来源:RFC 1738,它已过时,但在这里发挥了作用,确实是 RFC 3986允许更多字符,但转义不需要转义的字符不会造成任何损害)

httpurl        = "http://" hostport [ "/" hpath [ "?" search ]]
hpath          = hsegment *[ "/" hsegment ]
hsegment       = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
uchar          = unreserved | escape
unreserved     = alpha | digit | safe | extra
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
escape         = "%" hex hex
search         = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

因此,除了 ;:@&=$-_.+!*'(), 之外的任何字符,a < code>0-9a-zA-Z 字符或转义序列(例如 %A0 或等效的 %a0)必须进行转义。 ? 字符最多可以出现一次。 / 字符可以出现在路径部分,但不能出现在查询字符串中。对其他字符进行编码的约定是计算它们的 UTF-8 表示形式并转义该序列。

您的算法应该(假设主机部分不是问题...):

  • 提取路径部分
  • 查询字符串部分
  • 提取每个
  • ,查找无效字符以 UTF-8 编码这些字符
  • 将结果传递给 rawurlencode
  • 将 URL 中的字符替换为 rawurlencode 的结果

The problem is that http://www.example.com/ümlat.php&q=name is not a valid url.

(source: RFC 1738, which is obsolete but serves its purpose here, RFC 3986 indeed allows more characters, but no harm is done by escaping characters that don't need escaping)

httpurl        = "http://" hostport [ "/" hpath [ "?" search ]]
hpath          = hsegment *[ "/" hsegment ]
hsegment       = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
uchar          = unreserved | escape
unreserved     = alpha | digit | safe | extra
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
escape         = "%" hex hex
search         = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

So any character except ;:@&=$-_.+!*'(),, a 0-9a-zA-Z character or an escape sequence (e.g. %A0 or, equivalently, %a0) must be escaped. The ? character can appear at most once. The / character can appear in the path portion, but not in the query string. The convention for encoding the other characters is to compute their UTF-8 representation and escape that sequence.

Your algorithm should (assuming the host part is not a problem...):

  • extract the path part
  • extract the query string part
  • for each of those, look for invalid characters
  • encode those characters in UTF-8
  • pass the result to rawurlencode
  • replace the character in the URL with the result of rawurlencode
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文