使用 Java URL 解析带有 unicode 字符的 wikipedia url 时出错

发布于 2024-11-13 08:08:21 字数 842 浏览 4 评论 0原文

我在获取包括 unicode 在内的维基百科 url 时遇到问题!

给定一个页面标题,例如: 1992\u201393_UE_Lleida_seasonnow

只是简单的 url ... http://en.wikipedia.org/wiki/1992\u201393_UE_Lleida_seasonnow

使用 URLEncoder(设置为 UTF -8) .... http://en.wikipedia.org/wiki/1992%5Cu201393_UE_Lleida_seasonnow

当我尝试解析任一网址,我什么也没得到。如果我将 url 复制到浏览器中,我也什么也得不到——只有当我实际复制 unicode 字符时,我才能得到该页面。

维基百科是否有一些奇怪的方法在 url 中对 unicode 进行编码?或者我只是做了一些愚蠢的事情?

这是我正在使用的代码:

URL url = new URL("http://en.wikipedia.org/wiki/"+x);
System.out.println("trying "+url);  

// Attempt to open the wiki page
InputStream is;
        try{ is = url.openStream();
} catch(Exception e){ return null; }

I'm having trouble getting wikipedia urls including unicode!

Given a page title like: 1992\u201393_UE_Lleida_seasonnow

Just plain url ...
http://en.wikipedia.org/wiki/1992\u201393_UE_Lleida_seasonnow

Using URLEncoder (set to UTF-8) ....
http://en.wikipedia.org/wiki/1992%5Cu201393_UE_Lleida_seasonnow

When I try to resolve either url, I get nothing. If I copy the urls into my browser, I get nothing too- it's only if I actually copy the unicode character in that I get the page.

Does wikipedia have some strange way to encode unicode in urls? Or am I just doing something dumb?

Here's the code I'm using:

URL url = new URL("http://en.wikipedia.org/wiki/"+x);
System.out.println("trying "+url);  

// Attempt to open the wiki page
InputStream is;
        try{ is = url.openStream();
} catch(Exception e){ return null; }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

带刺的爱情 2024-11-20 08:08:21

正确的 URI 是 http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season

许多浏览器显示文字而不是百分比编码转义序列。这被认为更加用户友好。但是,正确编码的 URI 必须对 中不允许的字符使用百分比编码路径部分

   path          = path-abempty    ; begins with "/" or is empty
                 / path-absolute   ; begins with "/" but not "//"
                 / path-noscheme   ; begins with a non-colon segment
                 / path-rootless   ; begins with a segment
                 / path-empty      ; zero characters
   path-abempty  = *( "/" segment )
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-noscheme = segment-nz-nc *( "/" segment )
   path-rootless = segment-nz *( "/" segment )
   path-empty    = 0<pchar>
   segment       = *pchar
   segment-nz    = 1*pchar
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                 ; non-zero-length segment without any colon ":"
   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
   pct-encoded   = "%" HEXDIG HEXDIG
   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

URI 类 可以帮助您处理此类序列:

  • 只要 RFC 2396 允许转义八位字节,即在用户信息、路径、查询和片段组件中,以及在权限组件中,则允许使用其他类别中的字符(如果权限是基于注册表的。这允许 URI 包含 US-ASCII 字符集中以外的 Unicode 字符。
String literal = "http://en.wikipedia.org/wiki/1992\u201393_UE_Lleida_seasonnow";
URI uri = new URI(literal);
System.out.println(uri.toASCIIString());

您可以阅读有关 URI 编码的更多信息 在这里

The correct URI is http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season.

Many browsers display literals instead of percent-encoded escape sequences. This is considered to be more user-friendly. However, correctly encoded URIs must use percent encoding for characters not permitted in the path part:

   path          = path-abempty    ; begins with "/" or is empty
                 / path-absolute   ; begins with "/" but not "//"
                 / path-noscheme   ; begins with a non-colon segment
                 / path-rootless   ; begins with a segment
                 / path-empty      ; zero characters
   path-abempty  = *( "/" segment )
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-noscheme = segment-nz-nc *( "/" segment )
   path-rootless = segment-nz *( "/" segment )
   path-empty    = 0<pchar>
   segment       = *pchar
   segment-nz    = 1*pchar
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                 ; non-zero-length segment without any colon ":"
   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
   pct-encoded   = "%" HEXDIG HEXDIG
   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   sub-delims    = "!" / "
quot; / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

The URI class can help you with such sequences:

  • Characters in the other category are permitted wherever RFC 2396 permits escaped octets, that is, in the user-information, path, query, and fragment components, as well as in the authority component if the authority is registry-based. This allows URIs to contain Unicode characters beyond those in the US-ASCII character set.
String literal = "http://en.wikipedia.org/wiki/1992\u201393_UE_Lleida_seasonnow";
URI uri = new URI(literal);
System.out.println(uri.toASCIIString());

You can read more about URI encoding here.

孤凫 2024-11-20 08:08:21

维基百科是否有一些奇怪的方法在网址中对 unicode 进行编码?

这并不奇怪,它是 IRI 的标准用法。 IRI:

http://en.wikipedia.org/wiki/2009–10_UE_Lleida_season

包含 Unicode 破折号,相当于 URI:

http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season

您可以在链接中包含 IRI 形式,它将在现代浏览器中工作。但许多网络库(包括 Java 的网络库以及较旧的浏览器)都需要仅 ASCII 的 URI。 (现代浏览器仍然会在地址栏中显示漂亮的 IRI 版本,即使您使用编码的 URI 版本链接到它。)

要将 IRI 转换为 URI,通常可以使用 IDN 主机名算法,并将任何其他非 ASCII 字符 URL 编码为 UTF-8 字节。在您的情况下,它应该是:

String urlencoded= URLEncoder.encode(x, "utf-8").replace("+", "%20");
URL url= new URL("http://en.wikipedia.org/wiki/"+urlencoded);

注意:将 + 替换为 %20 对于在工作中使 x 的值具有空格是必要的。 URLEncoder 执行 application/x-www-form-urlencoded 编码,就像在查询字符串中使用一样。但在这样的路径 URL 段中,+-means-space 规则不适用。路径中的空格必须使用普通 URL 编码进行编码,为 %20

再说一次......在维基百科的特定情况下,为了可读性,他们用下划线替换空格,所以你最好用 "_""+"代码> 直接。 %20 版本仍然有效,因为它们从那里重定向到下划线版本。

Does wikipedia have some strange way to encode unicode in urls?

It's not really strange, it's standard use of IRIs. The IRI:

http://en.wikipedia.org/wiki/2009–10_UE_Lleida_season

which includes a Unicode en-dash, is equivalent to the URI:

http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season

You can include the IRI form in links and it will work in modern browsers. But many network libraries—including Java's, along with older browsers—require ASCII-only URIs. (Modern browsers will still show the pretty IRI version in the address bar, even if you linked to it with the encoded URI version.)

To convert an IRI to a URI in general, you use the IDN algorithm on the hostname, and URL-encode any other non-ASCII characters as UTF-8 bytes. In your case, it should be:

String urlencoded= URLEncoder.encode(x, "utf-8").replace("+", "%20");
URL url= new URL("http://en.wikipedia.org/wiki/"+urlencoded);

Note: replacing + with %20 is necessary to make values of x with spaces in work. URLEncoder does application/x-www-form-urlencoded-encoding as using in query strings. But in a path-URL-segment like this, the +-means-space rule does not apply. Spaces in paths must be encoded with normal-URL-encoding, to %20.

Then again... in the specific case of Wikipedia, for readability, they replace spaces with underlines instead, so you'd probably be better off replacing "+" with "_" directly. The %20 version will still work because they redirect from there to the underline version.

不顾 2024-11-20 08:08:21

下面是一个简单的算法,用于对使用 Unicode 的 URL 进行编码,以便您可以使用 HttpURLConnection 来检索它们:

import static org.junit.Assert.*;

import java.net.URLEncoder;

import org.apache.commons.lang.CharUtils;
import org.junit.Test;

public class InternationalURLEncoderTest {

    static String encodeUrl(String urlToEncode) {
        String[] pathSegments = urlToEncode.split("((?<=/)|(?=/))");
        StringBuilder encodedUrlBuilder = new StringBuilder();
        for (String pathSegment : pathSegments) {
            boolean needsEncoding = false;
            for (char ch : pathSegment.toCharArray()) {
                if (!CharUtils.isAscii(ch)) {
                    needsEncoding = true;
                    break;
                }
            }
            String encodedSegment = needsEncoding ? URLEncoder
                    .encode(pathSegment) : pathSegment;
            encodedUrlBuilder.append(encodedSegment);
        }
        return encodedUrlBuilder.toString();
    }

    @Test
    public void test() {
        assertEquals(
                "http://www.chinatimes.com/realtimenews/%E5%8D%97%E6%8A%95%E4%B8%80%E8%90%AC%E5%A4%9A%E6%88%B6%E5%A4%A7%E5%81%9C%E9%9B%BB-%E4%B9%9D%E6%88%90%E4%BB%A5%E4%B8%8A%E6%81%A2%E5%BE%A9%E4%BE%9B%E9%9B%BB-20130603003259-260401",
                encodeUrl("http://www.chinatimes.com/realtimenews/南投一萬多戶大停電-九成以上恢復供電-20130603003259-260401"));
        assertEquals("http://www.ttv.com.tw/",
                encodeUrl("http://www.ttv.com.tw/"));
        assertEquals("http://www.ttv.com.tw",
                encodeUrl("http://www.ttv.com.tw"));
        assertEquals("http://www.rt-drive.com.tw/shopping/?st=16",
                encodeUrl("http://www.rt-drive.com.tw/shopping/?st=16"));
    }

}

该算法是使用 字符串分割检测 Unicode 字符

Here's a simple algorithm for encoding URLs that use Unicode so that you can use HttpURLConnection to retrieve them:

import static org.junit.Assert.*;

import java.net.URLEncoder;

import org.apache.commons.lang.CharUtils;
import org.junit.Test;

public class InternationalURLEncoderTest {

    static String encodeUrl(String urlToEncode) {
        String[] pathSegments = urlToEncode.split("((?<=/)|(?=/))");
        StringBuilder encodedUrlBuilder = new StringBuilder();
        for (String pathSegment : pathSegments) {
            boolean needsEncoding = false;
            for (char ch : pathSegment.toCharArray()) {
                if (!CharUtils.isAscii(ch)) {
                    needsEncoding = true;
                    break;
                }
            }
            String encodedSegment = needsEncoding ? URLEncoder
                    .encode(pathSegment) : pathSegment;
            encodedUrlBuilder.append(encodedSegment);
        }
        return encodedUrlBuilder.toString();
    }

    @Test
    public void test() {
        assertEquals(
                "http://www.chinatimes.com/realtimenews/%E5%8D%97%E6%8A%95%E4%B8%80%E8%90%AC%E5%A4%9A%E6%88%B6%E5%A4%A7%E5%81%9C%E9%9B%BB-%E4%B9%9D%E6%88%90%E4%BB%A5%E4%B8%8A%E6%81%A2%E5%BE%A9%E4%BE%9B%E9%9B%BB-20130603003259-260401",
                encodeUrl("http://www.chinatimes.com/realtimenews/南投一萬多戶大停電-九成以上恢復供電-20130603003259-260401"));
        assertEquals("http://www.ttv.com.tw/",
                encodeUrl("http://www.ttv.com.tw/"));
        assertEquals("http://www.ttv.com.tw",
                encodeUrl("http://www.ttv.com.tw"));
        assertEquals("http://www.rt-drive.com.tw/shopping/?st=16",
                encodeUrl("http://www.rt-drive.com.tw/shopping/?st=16"));
    }

}

The algorithm was written using these answers on string splitting and detecting Unicode characters

謌踐踏愛綪 2024-11-20 08:08:21

这是 Chi 答案中对 URL 进行编码的一种更简单的方法:

static String encodeUrl(String urlToEncode) throws URISyntaxException {
    return new URI(urlToEncode).toASCIIString();
}

请参阅此答案澄清。

Here's a simpler way of encoding the URL in Chi's answer:

static String encodeUrl(String urlToEncode) throws URISyntaxException {
    return new URI(urlToEncode).toASCIIString();
}

See this answer for clarification.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文