当前位置：文江博客话题详情

Jsoup.parse() 与 Jsoup.parse() - 或者 Jsoup 中的 URL 检测如何工作？

发布于 2024-11-30 23:47:19 字数 837 浏览 3 评论 0原文

Jsoup 有 2 个 html parse() 方法：

parse(String html) - “由于没有指定基本 URI，因此绝对 URL 检测依赖于包含标签的 HTML。”
parse(String html, String baseUri) - "HTML 所在的 URL 被检索自。用于将相对 URL 解析为绝对 URL，在 HTML 声明标签之前发生。”

我很难理解两者之间差异的含义：

在第二个 parse() 版本中，“将发生的相对 URL 解析为绝对 URL before HTML 声明标记”是什么意思？如果标签从未出现在页面中？
绝对URL检测的目的是什么？为什么Jsoup需要找到绝对URL？
最后但最重要的是：baseUri 是 HTML 页面的完整 URL （如原始文档中的措辞）还是它的基本 URL HTML 页面？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笛声青案梦长安 2024-12-07 23:47:19

它用于 元素等#absUrl() 以便您可以检索、、、

for (Element link : document.select("a")) {
    System.out.println(link.absUrl("href"));
}

如果您想下载和/或解析也链接了资源。

在第二个 parse() 版本中，“在 HTML 声明标记之前发生的将相对 URL 解析为绝对 URL”是什么意思？如果页面中从未出现标记怎么办？

某些（较差的）网站可能声明了或

绝对 URL 检测的目的是什么？为什么Jsoup需要查找绝对URL？

以便在Element#absUrl()上返回正确的URL。这纯粹是为了最终用户的方便。 Jsoup 不需要它来成功地自行解析 HTML。

最后但也是最重要的一点：baseUri 是 HTML 页面的完整 URL（如原始文档中的措辞）还是 HTML 页面的基本 URL？

前者。如果是后者，那么文档就会撒谎。 baseUri 不得与混淆。

It's used for among others Element#absUrl() so that you can retrieve the (intended) absolute URL of an <a href>, <img src>, <link href>, <script src>, etc. E.g.

for (Element link : document.select("a")) {
    System.out.println(link.absUrl("href"));
}

This is very useful if you want to download and/or parse the linked resources as well.

In the 2nd parse() version, what does "resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag" mean? What if a <base href> tag never occurs in the page?

Some (poor) websites may have declared a <link> or <script> with a relative URL before the <base> tag. Or if there is no means of a <base> tag, then just the given baseUri will be used for resolving relative URLs of the entire document.

What is the purpose of absolute URL detection? Why does Jsoup need to find the absolute URL?

In order to return the right URL on Element#absUrl(). This is purely for enduser's convenience. Jsoup doesn't need it in order to successfully parse the HTML at its own.