解析 HTML 以获取 Web 链接,某些链接在 URL 中显示 :80。这是为什么呢?
我正在使用 JTidy 解析 HTML 页面以提取页面链接。我在几个网页上测试了它,其中之一是 www.support.xbox.com。某些链接在 URL 中显示 :80,例如
http://support.xbox.com:80/en-US/xbox-360/disc-drive/disc-replacement-program http://support.xbox.com:80/en-US/xbox-live/marketplace-and-purchasing/play-xbox-live-content http://support .xbox.com:80/en-US/games/call-of-duty/call-of-duty-elite-subscription
我认为 :80 表示正在使用的端口,但这是为什么呢?
不确定是否需要我的代码,但无论如何它都在这里!
package urltest;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import org.jsoup.Connection;
public class UrlTest {
public static void main(String[] args) throws IOException {
String url = "http://support.xbox.com";
print("Fetching %s...", url);
Document doc = Jsoup.connect(url).timeout(0).get();
Elements links = doc.select("div#BodyContent a[href]");
print("\nLinks: (%d)", links.size());
for (Element link : links) {
print("%s (%s)", link.attr("abs:href"), trim(link.text(), 35));
}
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
private static String trim(String s, int width) {
if (s.length() > width)
return s.substring(0, width-1) + ".";
else
return s;
}
}
I'm using JTidy to Parse HTML pages to extract the pages links. I was testing it on a few web pages, one of them was www.support.xbox.com. Some of the links are showing a :80 in the URL e.g.
http://support.xbox.com:80/en-US/xbox-360/disc-drive/disc-replacement-program
http://support.xbox.com:80/en-US/xbox-live/marketplace-and-purchasing/play-xbox-live-content
http://support.xbox.com:80/en-US/games/call-of-duty/call-of-duty-elite-subscription
I presume the :80 indicates the port that's being used, but why is this?
Not sure if my code is needed but here it is anyways!
package urltest;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import org.jsoup.Connection;
public class UrlTest {
public static void main(String[] args) throws IOException {
String url = "http://support.xbox.com";
print("Fetching %s...", url);
Document doc = Jsoup.connect(url).timeout(0).get();
Elements links = doc.select("div#BodyContent a[href]");
print("\nLinks: (%d)", links.size());
for (Element link : links) {
print("%s (%s)", link.attr("abs:href"), trim(link.text(), 35));
}
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
private static String trim(String s, int width) {
if (s.length() > width)
return s.substring(0, width-1) + ".";
else
return s;
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您在浏览器上输入
或
你会得到同样的结果。因为浏览器隐式地在所有代表http请求端口号的地址上添加:80。
现在为什么有的网站加了,有的没有,这取决于很多问题。但毕竟不会对你的程序产生任何影响
If you go on a browser and type
or
you will get the same result. Because the browser implicitily adds :80 to all addresses which represent the port number for http request.
Now why is some websites adding it and some no, it depends on many issues. But after all, it won't affect your program in any way
如果您注意到重定向标头:
请注意 URL 中的
Location:
字段包含端口号。这就是为什么直接访问页面时得不到端口号,而通过重定向访问页面时却得到端口号的原因。至于为什么该网站发出带有显式 :80 端口号的重定向(无论如何这应该是默认端口号),您必须询问网站所有者。
If you notice the redirection header:
notice that the
Location:
field in the URL contains the port number. This is why you don't get the port number when visiting the page directly, while you got the port number while visiting the page through the redirect.As to why the site issued a redirect with an explicit :80 port number, which should be the default anyway, you'd have to ask the site owner.