Java 中拉丁字符的 URL 编码

发布于 2024-08-25 19:53:44 字数 724 浏览 4 评论 0原文

我正在尝试读取图像 URL。 将 URL 转换为 URI

String imageURL = "http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg";
URL url = new URL(imageURL);
url = new URI(url.getProtocol(), url.getHost(), url.getFile(), null).toURL();  
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();

正如 java 文档中提到的,我尝试通过获取文件的 Java.io.FileNotFound 异常 http://www.shefinds.com/files/Christian-Louboutin- Décolleté-100-pumps.jpg

我做错了什么以及编码此 URL 的正确方法是什么?

更新:
我正在使用 Rome 阅读 RSS 提要。根据 BalusC 的建议,我打印出了不同阶段的原始输入,看起来 ROME rss 解析器正在使用 ISO-8859-1 而不是 UTF-8。

I'm trying to read in an image URL. As mentioned in the java documentation, I tried converting the URL to URI by

String imageURL = "http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg";
URL url = new URL(imageURL);
url = new URI(url.getProtocol(), url.getHost(), url.getFile(), null).toURL();  
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();

I get the a Java.io.FileNotFound Exception for file
http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg

What am I doing wrong and what is the right way to encode this URL?

Update:
I'm using Rome to read in RSS feeds. Taking suggestions from BalusC I have printed out the raw input from different stages and seems like that the ROME rss parser is using ISO-8859-1 instead of UTF-8.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

情泪▽动烟 2024-09-01 19:53:44

这里工作正常(返回 403,至少不是 404):

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
InputStream input = connection.getInputStream();

当我修复它以便它不返回 403 时,图片已正确退休:

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/4.0");
InputStream input = connection.getInputStream();
OutputStream output = new FileOutputStream("/pic.jpg");
for (int data = 0; (data = input.read()) != -1;) {
    output.write(data));
}

所以你的问题出在其他地方。其实不需要转换。初始 URL 有效。

也许您正在使用错误的字符编码从某些二进制源获取实际的 URL? éé 的转换表明原始源是 UTF-8 编码的,并且代码在使用 ISO-8859-1 而不是错误地读取它UTF-8。

更新:或者您可能实际上已将其硬编码到 Java 源代码中并使用错误的编码保存源文件本身。我已将编辑器 (Eclipse) 配置为使用 UTF-8 保存文件,并且 -Dfile.encoding 也默认为 UTF-8,这可以解释为什么它可以在我的机器上运行< /em> ;)

更新 2:根据评论,简而言之,如果用于保存源文件的编码与默认 -Dfile.encoding 运行时平台的(并且相关字符编码支持 é)。为了避免在分发代码时发生那些不可预见的冲突,最好用 unicode 转义符替换硬编码的非 ASCII 字符。

Works fine here (returns a 403, it's at least not a 404):

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
InputStream input = connection.getInputStream();

When I fix it so that it doesn't return a 403, the picture is correctly retireved:

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/4.0");
InputStream input = connection.getInputStream();
OutputStream output = new FileOutputStream("/pic.jpg");
for (int data = 0; (data = input.read()) != -1;) {
    output.write(data));
}

So your problem lies somewhere else. Converting is actually not needed. The initial URL is valid.

Maybe you're obtaining the actual URL from some binary source using the wrong character encoding? The transition of é to é namely suggests that the original source was UTF-8 encoded and that the code has incorrectly read it in in using ISO-8859-1 instead of UTF-8.

Update: or maybe you've actually hardcoded it in the Java source code and saving the source file itself using the wrong encoding. I've configured my editor (Eclipse) to save files using UTF-8 and the -Dfile.encoding is also defaulted to UTF-8, that would explain why it works at my machine ;)

Update 2: as per the comments, in a nutshell, everything should work fine if the encoding used to save the source file matches the default -Dfile.encoding of the runtime platform (and the character encoding in question supports the é). To avoid those unforeseen clashes whenever you like to distribute the code, it's indeed better to replace hardcoded non-ASCII chars by unicode escapes.

你在看孤独的风景 2024-09-01 19:53:44

我认为技术上的答案是“你不能”。根据标准,URL 中不能使用非 ASCII 字符,甚至某些 ASCII 字符也必须使用“%XX”语法进行转义,其中 XX 是该字符的 ASCII 值。

如果有的话,您可以使用“%E9”转义“é”,但这依赖于服务器根据 ISO-8859-1 将其解释为字符编码。虽然这在技术上是不允许的,但我相信很多服务器都会这样做。

I think the technical answer is "you can't." Non-ASCII characters can't be used in a URL according to the standard, and even some ASCII characters must be escaped with "%XX" syntax, where XX is the ASCII value of the character.

If anything, you can escape 'é' with '%E9' but this relies on the server interpreting this as an encoding of the character according to ISO-8859-1. While this isn't technically allowed, I believe many servers will do it.

我做我的改变 2024-09-01 19:53:44

源文件的编码是罪魁祸首。使用 IDE 将其设置为 UTF-8,然后重新粘贴 URL。

The encoding of your source file is to blame. Using your IDE, set it to UTF-8, and then repaste the URL.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文