站点地图编码问题
我很难理解有关如何正确转义和编码 URL 以在站点地图中提交的规范和指南。
转义)示例中,他们有一个示例 URL:
http://www.example.com/ümlat.php&q=name
在 sitemap.org (实体 -8 编码最终为(根据他们):
http://www.example.com/%C3%BCmlat.php&q=name
但是,当我在 PHP 上尝试这个(rawurlencode)时,我最终得到:
http%3A%2F%2Fwww.example.com%2F%C3%BCmlat.php%26q%3Dname
我通过使用 PHP.net
$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40',
'%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%23', '%5B', '%5D');
$replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+",
"$", ",", "/", "?", "#", "[", "]");
$string = str_replace($entities, $replacements, rawurlencode($string));
但根据我采访过的人(Kohana BDFM)的说法,这种解释是错误的。老实说,我很困惑,我什至不知道什么是正确的。
对站点地图中使用的 URL 进行编码的正确方法是什么?
相关 RFC 3986
I'm having real trouble understanding the specification and guidelines on how to properly escape and encode a URL for submission in a sitemap.
In the sitemap.org (entity escaping) examples, they have an example URL:
http://www.example.com/ümlat.php&q=name
Which when UTF-8 encoded ends up as (according to them):
http://www.example.com/%C3%BCmlat.php&q=name
However, when I try this (rawurlencode) on PHP I end up with:
http%3A%2F%2Fwww.example.com%2F%C3%BCmlat.php%26q%3Dname
I've sort of beaten this by using this function found on PHP.net
$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40',
'%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%23', '%5B', '%5D');
$replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+",
"quot;, ",", "/", "?", "#", "[", "]");
$string = str_replace($entities, $replacements, rawurlencode($string));
but according to someone I spoke to (Kohana BDFM), this interpretation is wrong. Honestly, I'm so confused I don't even know what's right.
What's the correct way to encode a URL for use in the sitemap?
Relevant RFC 3986
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
问题是
http://www.example.com/ümlat.php&q=name
不是有效的网址。(来源:RFC 1738,它已过时,但在这里发挥了作用,确实是 RFC 3986允许更多字符,但转义不需要转义的字符不会造成任何损害)
因此,除了
;:@&=$-_.+!*'(),
之外的任何字符,a < code>0-9a-zA-Z 字符或转义序列(例如%A0
或等效的%a0
)必须进行转义。?
字符最多可以出现一次。/
字符可以出现在路径部分,但不能出现在查询字符串中。对其他字符进行编码的约定是计算它们的 UTF-8 表示形式并转义该序列。您的算法应该(假设主机部分不是问题...):
rawurlencode
rawurlencode
的结果The problem is that
http://www.example.com/ümlat.php&q=name
is not a valid url.(source: RFC 1738, which is obsolete but serves its purpose here, RFC 3986 indeed allows more characters, but no harm is done by escaping characters that don't need escaping)
So any character except
;:@&=$-_.+!*'(),
, a0-9a-zA-Z
character or an escape sequence (e.g.%A0
or, equivalently,%a0
) must be escaped. The?
character can appear at most once. The/
character can appear in the path portion, but not in the query string. The convention for encoding the other characters is to compute their UTF-8 representation and escape that sequence.Your algorithm should (assuming the host part is not a problem...):
rawurlencode
rawurlencode