站点地图编码问题

发布于 2024-09-06 01:21:37 字数 1178 浏览 6 评论 0原文

我很难理解有关如何正确转义和编码 URL 以在站点地图中提交的规范和指南。

转义）示例中，他们有一个示例 URL：

http://www.example.com/ümlat.php&q=name

在 sitemap.org （实体 -8 编码最终为（根据他们）：

http://www.example.com/%C3%BCmlat.php&q=name

但是，当我在 PHP 上尝试这个（rawurlencode）时，我最终得到：

http%3A%2F%2Fwww.example.com%2F%C3%BCmlat.php%26q%3Dname

我通过使用 PHP.net

$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', 
    '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%23', '%5B', '%5D');
    
$replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+",
    "$", ",", "/", "?", "#", "[", "]");

$string = str_replace($entities, $replacements, rawurlencode($string));

但根据我采访过的人（Kohana BDFM）的说法，这种解释是错误的。老实说，我很困惑，我什至不知道什么是正确的。

对站点地图中使用的 URL 进行编码的正确方法是什么？

相关 RFC 3986

原文

I'm having real trouble understanding the specification and guidelines on how to properly escape and encode a URL for submission in a sitemap.

In the sitemap.org (entity escaping) examples, they have an example URL:

http://www.example.com/ümlat.php&q=name

Which when UTF-8 encoded ends up as (according to them):

http://www.example.com/%C3%BCmlat.php&q=name

However, when I try this (rawurlencode) on PHP I end up with:

http%3A%2F%2Fwww.example.com%2F%C3%BCmlat.php%26q%3Dname

I've sort of beaten this by using this function found on PHP.net

$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', 
    '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%23', '%5B', '%5D');
    
$replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+",
    "quot;, ",", "/", "?", "#", "[", "]");

$string = str_replace($entities, $replacements, rawurlencode($string));

but according to someone I spoke to (Kohana BDFM), this interpretation is wrong. Honestly, I'm so confused I don't even know what's right.

What's the correct way to encode a URL for use in the sitemap?

Relevant RFC 3986

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

帥小哥 2024-09-13 01:21:37

问题是 http://www.example.com/ümlat.php&q=name 不是有效的网址。

（来源：RFC 1738，它已过时，但在这里发挥了作用，确实是 RFC 3986允许更多字符，但转义不需要转义的字符不会造成任何损害）

httpurl        = "http://" hostport [ "/" hpath [ "?" search ]]
hpath          = hsegment *[ "/" hsegment ]
hsegment       = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
uchar          = unreserved | escape
unreserved     = alpha | digit | safe | extra
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
escape         = "%" hex hex
search         = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

因此，除了 ;:@&=$-_.+!*'(), 之外的任何字符，a < code>0-9a-zA-Z 字符或转义序列（例如 %A0 或等效的 %a0）必须进行转义。 ? 字符最多可以出现一次。 / 字符可以出现在路径部分，但不能出现在查询字符串中。对其他字符进行编码的约定是计算它们的 UTF-8 表示形式并转义该序列。

您的算法应该（假设主机部分不是问题...）：

提取路径部分
查询字符串部分
提取每个
，查找无效字符以 UTF-8 编码这些字符
将结果传递给 rawurlencode
将 URL 中的字符替换为 rawurlencode 的结果

The problem is that http://www.example.com/ümlat.php&q=name is not a valid url.

(source: RFC 1738, which is obsolete but serves its purpose here, RFC 3986 indeed allows more characters, but no harm is done by escaping characters that don't need escaping)

httpurl        = "http://" hostport [ "/" hpath [ "?" search ]]
hpath          = hsegment *[ "/" hsegment ]
hsegment       = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
uchar          = unreserved | escape
unreserved     = alpha | digit | safe | extra
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
escape         = "%" hex hex
search         = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

So any character except ;:@&=$-_.+!*'(),, a 0-9a-zA-Z character or an escape sequence (e.g. %A0 or, equivalently, %a0) must be escaped. The ? character can appear at most once. The / character can appear in the path portion, but not in the query string. The convention for encoding the other characters is to compute their UTF-8 representation and escape that sequence.

Your algorithm should (assuming the host part is not a problem...):