Java 中拉丁字符的 URL 编码

发布于 2024-08-25 19:53:44 字数 724 浏览 4 评论 0原文

我正在尝试读取图像 URL。将 URL 转换为 URI

String imageURL = "http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg";
URL url = new URL(imageURL);
url = new URI(url.getProtocol(), url.getHost(), url.getFile(), null).toURL();  
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();

正如 java 文档中提到的，我尝试通过获取文件的 Java.io.FileNotFound 异常 http://www.shefinds.com/files/Christian-Louboutin- Décolleté-100-pumps.jpg

我做错了什么以及编码此 URL 的正确方法是什么？

更新：
我正在使用 Rome 阅读 RSS 提要。根据 BalusC 的建议，我打印出了不同阶段的原始输入，看起来 ROME rss 解析器正在使用 ISO-8859-1 而不是 UTF-8。

原文

I'm trying to read in an image URL. As mentioned in the java documentation, I tried converting the URL to URI by

String imageURL = "http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg";
URL url = new URL(imageURL);
url = new URI(url.getProtocol(), url.getHost(), url.getFile(), null).toURL();  
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();

I get the a Java.io.FileNotFound Exception for file
http://www.shefinds.com/files/Christian-Louboutin-DÃ©colletÃ©-100-pumps.jpg

What am I doing wrong and what is the right way to encode this URL?

Update:
I'm using Rome to read in RSS feeds. Taking suggestions from BalusC I have printed out the raw input from different stages and seems like that the ROME rss parser is using ISO-8859-1 instead of UTF-8.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情泪▽动烟 2024-09-01 19:53:44

这里工作正常（返回 403，至少不是 404）：

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
InputStream input = connection.getInputStream();

当我修复它以便它不返回 403 时，图片已正确退休：

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/4.0");
InputStream input = connection.getInputStream();
OutputStream output = new FileOutputStream("/pic.jpg");
for (int data = 0; (data = input.read()) != -1;) {
    output.write(data));
}

所以你的问题出在其他地方。其实不需要转换。初始 URL 有效。

也许您正在使用错误的字符编码从某些二进制源获取实际的 URL？ é 到 é 的转换表明原始源是 UTF-8 编码的，并且代码在使用 ISO-8859-1 而不是错误地读取它UTF-8。

更新：或者您可能实际上已将其硬编码到 Java 源代码中并使用错误的编码保存源文件本身。我已将编辑器 (Eclipse) 配置为使用 UTF-8 保存文件，并且 -Dfile.encoding 也默认为 UTF-8，这可以解释为什么它可以在我的机器上运行< /em> ;)

更新 2：根据评论，简而言之，如果用于保存源文件的编码与默认 -Dfile.encoding 运行时平台的（并且相关字符编码支持 é）。为了避免在分发代码时发生那些不可预见的冲突，最好用 unicode 转义符替换硬编码的非 ASCII 字符。

Works fine here (returns a 403, it's at least not a 404):

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
InputStream input = connection.getInputStream();

When I fix it so that it doesn't return a 403, the picture is correctly retireved:

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/4.0");
InputStream input = connection.getInputStream();
OutputStream output = new FileOutputStream("/pic.jpg");
for (int data = 0; (data = input.read()) != -1;) {
    output.write(data));
}

So your problem lies somewhere else. Converting is actually not needed. The initial URL is valid.

Maybe you're obtaining the actual URL from some binary source using the wrong character encoding? The transition of é to Ã© namely suggests that the original source was UTF-8 encoded and that the code has incorrectly read it in in using ISO-8859-1 instead of UTF-8.

Update: or maybe you've actually hardcoded it in the Java source code and saving the source file itself using the wrong encoding. I've configured my editor (Eclipse) to save files using UTF-8 and the -Dfile.encoding is also defaulted to UTF-8, that would explain why it works at my machine ;)

Update 2: as per the comments, in a nutshell, everything should work fine if the encoding used to save the source file matches the default -Dfile.encoding of the runtime platform (and the character encoding in question supports the é). To avoid those unforeseen clashes whenever you like to distribute the code, it's indeed better to replace hardcoded non-ASCII chars by unicode escapes.

回复收藏 0 原文