使用 Java URL 解析带有 unicode 字符的 wikipedia url 时出错
我在获取包括 unicode 在内的维基百科 url 时遇到问题!
给定一个页面标题,例如: 1992\u201393_UE_Lleida_seasonnow
只是简单的 url ... http://en.wikipedia.org/wiki/1992\u201393_UE_Lleida_seasonnow
使用 URLEncoder(设置为 UTF -8) .... http://en.wikipedia.org/wiki/1992%5Cu201393_UE_Lleida_seasonnow
当我尝试解析任一网址,我什么也没得到。如果我将 url 复制到浏览器中,我也什么也得不到——只有当我实际复制 unicode 字符时,我才能得到该页面。
维基百科是否有一些奇怪的方法在 url 中对 unicode 进行编码?或者我只是做了一些愚蠢的事情?
这是我正在使用的代码:
URL url = new URL("http://en.wikipedia.org/wiki/"+x);
System.out.println("trying "+url);
// Attempt to open the wiki page
InputStream is;
try{ is = url.openStream();
} catch(Exception e){ return null; }
I'm having trouble getting wikipedia urls including unicode!
Given a page title like: 1992\u201393_UE_Lleida_seasonnow
Just plain url ...
http://en.wikipedia.org/wiki/1992\u201393_UE_Lleida_seasonnow
Using URLEncoder (set to UTF-8) ....
http://en.wikipedia.org/wiki/1992%5Cu201393_UE_Lleida_seasonnow
When I try to resolve either url, I get nothing. If I copy the urls into my browser, I get nothing too- it's only if I actually copy the unicode character in that I get the page.
Does wikipedia have some strange way to encode unicode in urls? Or am I just doing something dumb?
Here's the code I'm using:
URL url = new URL("http://en.wikipedia.org/wiki/"+x);
System.out.println("trying "+url);
// Attempt to open the wiki page
InputStream is;
try{ is = url.openStream();
} catch(Exception e){ return null; }
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
正确的 URI 是
http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season
。许多浏览器显示文字而不是百分比编码转义序列。这被认为更加用户友好。但是,正确编码的 URI 必须对 中不允许的字符使用百分比编码路径部分:
URI 类 可以帮助您处理此类序列:
您可以阅读有关 URI 编码的更多信息 在这里。
The correct URI is
http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season
.Many browsers display literals instead of percent-encoded escape sequences. This is considered to be more user-friendly. However, correctly encoded URIs must use percent encoding for characters not permitted in the path part:
The URI class can help you with such sequences:
You can read more about URI encoding here.
这并不奇怪,它是 IRI 的标准用法。 IRI:
包含 Unicode 破折号,相当于 URI:
您可以在链接中包含 IRI 形式,它将在现代浏览器中工作。但许多网络库(包括 Java 的网络库以及较旧的浏览器)都需要仅 ASCII 的 URI。 (现代浏览器仍然会在地址栏中显示漂亮的 IRI 版本,即使您使用编码的 URI 版本链接到它。)
要将 IRI 转换为 URI,通常可以使用 IDN 主机名算法,并将任何其他非 ASCII 字符 URL 编码为 UTF-8 字节。在您的情况下,它应该是:
注意:将
+
替换为%20
对于在工作中使x
的值具有空格是必要的。URLEncoder
执行application/x-www-form-urlencoded
编码,就像在查询字符串中使用一样。但在这样的路径 URL 段中,+
-means-space 规则不适用。路径中的空格必须使用普通 URL 编码进行编码,为%20
。再说一次......在维基百科的特定情况下,为了可读性,他们用下划线替换空格,所以你最好用
"_""+"
代码> 直接。%20
版本仍然有效,因为它们从那里重定向到下划线版本。It's not really strange, it's standard use of IRIs. The IRI:
which includes a Unicode en-dash, is equivalent to the URI:
You can include the IRI form in links and it will work in modern browsers. But many network libraries—including Java's, along with older browsers—require ASCII-only URIs. (Modern browsers will still show the pretty IRI version in the address bar, even if you linked to it with the encoded URI version.)
To convert an IRI to a URI in general, you use the IDN algorithm on the hostname, and URL-encode any other non-ASCII characters as UTF-8 bytes. In your case, it should be:
Note: replacing
+
with%20
is necessary to make values ofx
with spaces in work.URLEncoder
doesapplication/x-www-form-urlencoded
-encoding as using in query strings. But in a path-URL-segment like this, the+
-means-space rule does not apply. Spaces in paths must be encoded with normal-URL-encoding, to%20
.Then again... in the specific case of Wikipedia, for readability, they replace spaces with underlines instead, so you'd probably be better off replacing
"+"
with"_"
directly. The%20
version will still work because they redirect from there to the underline version.下面是一个简单的算法,用于对使用 Unicode 的 URL 进行编码,以便您可以使用 HttpURLConnection 来检索它们:
该算法是使用 字符串分割 和 检测 Unicode 字符
Here's a simple algorithm for encoding URLs that use Unicode so that you can use HttpURLConnection to retrieve them:
The algorithm was written using these answers on string splitting and detecting Unicode characters
这是 Chi 答案中对 URL 进行编码的一种更简单的方法:
请参阅此答案澄清。
Here's a simpler way of encoding the URL in Chi's answer:
See this answer for clarification.