解析 HTML 以获取 Web 链接,某些链接在 URL 中显示 :80。这是为什么呢?

发布于 2024-12-25 11:51:46 字数 1618 浏览 0 评论 0原文

我正在使用 JTidy 解析 HTML 页面以提取页面链接。我在几个网页上测试了它,其中之一是 www.support.xbox.com。某些链接在 URL 中显示 :80,例如

http://support.xbox.com:80/en-US/xbox-360/disc-drive/disc-replacement-program http://support.xbox.com:80/en-US/xbox-live/marketplace-and-purchasing/play-xbox-live-content http://support .xbox.com:80/en-US/games/call-of-duty/call-of-duty-elite-subscription

我认为 :80 表示正在使用的端口,但这是为什么呢?

不确定是否需要我的代码,但无论如何它都在这里!

package urltest;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import org.jsoup.Connection;

public class UrlTest {

public static void main(String[] args) throws IOException {
    String url = "http://support.xbox.com";
    print("Fetching %s...", url);

    Document doc = Jsoup.connect(url).timeout(0).get();
    Elements links = doc.select("div#BodyContent a[href]");

    print("\nLinks: (%d)", links.size());
    for (Element link : links) {
        print("%s  (%s)", link.attr("abs:href"), trim(link.text(), 35));
    }
}

private static void print(String msg, Object... args) {
    System.out.println(String.format(msg, args));
}

private static String trim(String s, int width) {
    if (s.length() > width)
        return s.substring(0, width-1) + ".";
    else
        return s;
}
}

I'm using JTidy to Parse HTML pages to extract the pages links. I was testing it on a few web pages, one of them was www.support.xbox.com. Some of the links are showing a :80 in the URL e.g.

http://support.xbox.com:80/en-US/xbox-360/disc-drive/disc-replacement-program
http://support.xbox.com:80/en-US/xbox-live/marketplace-and-purchasing/play-xbox-live-content
http://support.xbox.com:80/en-US/games/call-of-duty/call-of-duty-elite-subscription

I presume the :80 indicates the port that's being used, but why is this?

Not sure if my code is needed but here it is anyways!

package urltest;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import org.jsoup.Connection;

public class UrlTest {

public static void main(String[] args) throws IOException {
    String url = "http://support.xbox.com";
    print("Fetching %s...", url);

    Document doc = Jsoup.connect(url).timeout(0).get();
    Elements links = doc.select("div#BodyContent a[href]");

    print("\nLinks: (%d)", links.size());
    for (Element link : links) {
        print("%s  (%s)", link.attr("abs:href"), trim(link.text(), 35));
    }
}

private static void print(String msg, Object... args) {
    System.out.println(String.format(msg, args));
}

private static String trim(String s, int width) {
    if (s.length() > width)
        return s.substring(0, width-1) + ".";
    else
        return s;
}
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

风蛊 2025-01-01 11:51:46

如果您在浏览器上输入

www.stackoverflow.com

www.stackoverflow.com:80

你会得到同样的结果。因为浏览器隐式地在所有代表http请求端口号的地址上添加:80。

现在为什么有的网站加了,有的没有,这取决于很多问题。但毕竟不会对你的程序产生任何影响

If you go on a browser and type

www.stackoverflow.com

or

www.stackoverflow.com:80

you will get the same result. Because the browser implicitily adds :80 to all addresses which represent the port number for http request.

Now why is some websites adding it and some no, it depends on many issues. But after all, it won't affect your program in any way

尐籹人 2025-01-01 11:51:46

如果您注意到重定向标头:

HTTP/1.1 302 Moved Temporarily
Cache-Control: private
Location: http://support.xbox.com:80/en-US/
Server: Microsoft-IIS/7.0
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Content-Length: 0
Date: Sun, 08 Jan 2012 00:19:43 GMT
Connection: keep-alive

请注意 URL 中的 Location: 字段包含端口号。这就是为什么直接访问页面时得不到端口号,而通过重定向访问页面时却得到端口号的原因。

至于为什么该网站发出带有显式 :80 端口号的重定向(无论如何这应该是默认端口号),您必须询问网站所有者。

If you notice the redirection header:

HTTP/1.1 302 Moved Temporarily
Cache-Control: private
Location: http://support.xbox.com:80/en-US/
Server: Microsoft-IIS/7.0
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Content-Length: 0
Date: Sun, 08 Jan 2012 00:19:43 GMT
Connection: keep-alive

notice that the Location: field in the URL contains the port number. This is why you don't get the port number when visiting the page directly, while you got the port number while visiting the page through the redirect.

As to why the site issued a redirect with an explicit :80 port number, which should be the default anyway, you'd have to ask the site owner.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文