我应该使用哪个:urlparse 或 urlsplit?

发布于 2024-10-27 01:54:45 字数 675 浏览 11 评论 0 原文

我应该使用哪个 URL 解析函数对,为什么?

Which URL parsing function pair should I be using and why?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

沙与沫 2024-11-03 01:54:45

直接来自您自己链接的文档

urllib.parse.urlsplit(urlstring,scheme='',allow_fragments=True)
这与 urlparse() 类似,但不会将参数从 URL 中分离出来。如果需要更新的 URL 语法,允许将参数应用于 URL 路径部分的每个段(请参阅 RFC 2396),则通常应使用此函数来代替 urlparse()

Directly from the docs you linked yourself:

urllib.parse.urlsplit(urlstring, scheme='', allow_fragments=True)
This is similar to urlparse(), but does not split the params from the URL. This should generally be used instead of urlparse() if the more recent URL syntax allowing parameters to be applied to each segment of the path portion of the URL (see RFC 2396) is wanted.

铜锣湾横着走 2024-11-03 01:54:45

鉴于您链接的文档不包含带有非空 params 的示例,我也很困惑,直到我发现 this

>>> urllib.parse.urlparse("http://example.com/pa/th;param1=foo;param2=bar?name=val#frag")
ParseResult(scheme='http', netloc='example.com', path='/pa/th', params='param1=foo;param2=bar', query='name=val', fragment='frag')

(一些历史,因为我被书呆子狙击了。)

除了 url 组件参数(即 /user/213/settings 或查询)之外,我从未听说过 URL“参数” params /user?id=213 我认为它基本上已经过时了。

一开始,RFC 1738 定义 HTTP URL 永远不允许 ;路径中:

http://<host>:<port>/<path>?<searchpart>

组件中,“/”、“;”、“?”已保留。

; 在其他方案中保留有特殊含义,就像 ftp:// url-path 一样:

<cwd1>/<cwd2>/.../<cwdN>/<name>;type=<typecode>

显然是在 1995 年,RFC 1808 定义 URL params作为pathquery之间的顶级组件:

<scheme>://<net_loc>/<path>;<params>?<query>#<fragment>

然后在1998年, RFC 2396 定义 URI 具有相邻的顶级组件 路径查询

<scheme>://<authority><path>?<query>

其中路径定义为多个path_segments,每个路径可以包含param

path          = [ abs_path | opaque_part ]
abs_path      = "/"  path_segments
path_segments = segment *( "/" segment )
segment       = *pchar *( ";" param )

最终在 2005 年,RFC 3986 废弃了 RFC 1808 和 2396,定义 URI类似于RFC 2396:

URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ] 

hier-part   = "//" authority path-abempty
            / path-absolute
            / path-rootless
            / path-empty

以及特殊语法;params视为不透明部分可能特定于 HTTP(S) 方案或只是某些特定实现的 URI 语法:

除了分层路径中的点段之外,通用语法认为路径段是不透明的。 URI 生成应用程序通常使用段中允许的保留字符来分隔特定于方案或特定于解引用处理程序的子组件。例如,分号(“;”)和等号(“=”)保留字符通常用于分隔适用于该段的参数和参数值。逗号(“,”)保留字符通常用于类似目的。例如,一个 URI 生产者可能使用诸如“name;v=1.1”之类的段来指示对“name”版本 1.1 的引用,而另一个 URI 生产者可能使用诸如“name,1.1”之类的段来指示相同的内容。 参数类型可以通过特定于方案的语义来定义,但在大多数情况下参数的语法特定于 URI 解除引用算法的实现。

Given the documentation you linked didn't include an example with an nonempty params I was also confused until I found this.

>>> urllib.parse.urlparse("http://example.com/pa/th;param1=foo;param2=bar?name=val#frag")
ParseResult(scheme='http', netloc='example.com', path='/pa/th', params='param1=foo;param2=bar', query='name=val', fragment='frag')

(Some history because I got nerd-sniped.)

I'd never heard of the URL "parameters" other than url component params i.e. /user/213/settings or query params /user?id=213 and I think it's essentially obsolete.

In the beginning, RFC 1738 defined the HTTP URL to never allow ; in the path:

http://<host>:<port>/<path>?<searchpart>

Within the <path> and <searchpart> components, "/", ";", "?" are reserved.

; was reserved with special meaning in other schemes, like the ftp:// url-path:

<cwd1>/<cwd2>/.../<cwdN>/<name>;type=<typecode>

Apparently in 1995, RFC 1808 defined URL params as a top-level component between path and query:

<scheme>://<net_loc>/<path>;<params>?<query>#<fragment>

Then in 1998, RFC 2396 defined URIs as having adjacent top-level components path and query:

<scheme>://<authority><path>?<query>

where the path is defined as multiple path_segments that each could include param:

path          = [ abs_path | opaque_part ]
abs_path      = "/"  path_segments
path_segments = segment *( "/" segment )
segment       = *pchar *( ";" param )

Finally in 2005, RFC 3986 obsoleted RFC 1808 and 2396, defining URI similarly to RFC 2396:

URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ] 

hier-part   = "//" authority path-abempty
            / path-absolute
            / path-rootless
            / path-empty

And the special syntax of ;params is considered an opaque part of the URI syntax that may be specific to the HTTP(S) scheme or just some specific implementation:

Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment to delimit scheme-specific or dereference-handler-specific subcomponents. For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas another might use a segment such as "name,1.1" to indicate the same. Parameter types may be defined by scheme-specific semantics, but in most cases the syntax of a parameter is specific to the implementation of the URI's dereferencing algorithm.

别靠近我心 2024-11-03 01:54:45

正如文档所述
urlparse.urlparse 返回 6 元组(带有附加参数元组)
urlparse.urlsplit 返回 5 元组

属性|索引 |价值                                                           |如果不存在则值
参数    |     3   |最后一个路径元素的参数 |空字符串

FYI: According to [RFC2396](https://www.rfc-editor.org/rfc/rfc2396.html#appendix-C), _parameter_ in URL specification
> Extensive testing of current client applications demonstrated that
the majority of deployed systems do not use the ";" character to
indicate trailing parameter information, and that the presence of a
semicolon in a path segment does not affect the relative parsing of
that segment. Therefore, parameters have been removed as a separate
component and may now appear in any path segment. Their influence
has been removed from the algorithm for resolving a relative URI
reference.

As the document says
urlparse.urlparse returns 6-tuple(with additional parameter tuple)
urlparse.urlsplit returns 5-tuple

Attribute   |Index | Value                                             | Value if not present
params    |     3   | Parameters for last path element | empty string

FYI: According to [RFC2396](https://www.rfc-editor.org/rfc/rfc2396.html#appendix-C), _parameter_ in URL specification
> Extensive testing of current client applications demonstrated that
the majority of deployed systems do not use the ";" character to
indicate trailing parameter information, and that the presence of a
semicolon in a path segment does not affect the relative parsing of
that segment. Therefore, parameters have been removed as a separate
component and may now appear in any path segment. Their influence
has been removed from the algorithm for resolving a relative URI
reference.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文