Java UTF-8 编码未设置为 URLConnection

发布于 2024-12-27 20:29:22 字数 1475 浏览 2 评论 0原文

我正在尝试从 http://api.freebase.com/api/trans/raw/ 检索数据m/0h47

正如你在文本中看到的那样,有这样的歌声:/ælˈdʒɪəriə/

当我尝试从页面获取源代码时,我会收到带有 ú 等字符的文本。

到目前为止,我已尝试使用以下代码:

urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");

我做错了什么?

我的整个代码:

URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}

try {
    urlConn = url.openConnection(); 
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");

urlConn.setDoInput(true);
urlConn.setUseCaches(false);

StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
    strBseznam.deleteCharAt(strBseznam.length() - 1);

try {
    input = new DataInputStream(urlConn.getInputStream()); 
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
    while (null != ((str = input.readLine()))) 
    {
        strB.append(str); 
    }
    input.close();
} catch (IOException e) { e.printStackTrace(); }

I'm trying to retrieve data from http://api.freebase.com/api/trans/raw/m/0h47

As you can see in text there are sings like this: /ælˈdʒɪəriə/.

When I try to get source from the page I get text with sings like ú etc.

So far I've tried with the following code:

urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");

What am I doing wrong?

My entire code:

URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}

try {
    urlConn = url.openConnection(); 
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");

urlConn.setDoInput(true);
urlConn.setUseCaches(false);

StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
    strBseznam.deleteCharAt(strBseznam.length() - 1);

try {
    input = new DataInputStream(urlConn.getInputStream()); 
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
    while (null != ((str = input.readLine()))) 
    {
        strB.append(str); 
    }
    input.close();
} catch (IOException e) { e.printStackTrace(); }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

凉城 2025-01-03 20:29:23

HTML 页面采用 UTF-8 格式,可以使用阿拉伯字符等。但 Unicode 127 以上的字符仍被编码为数字实体,如 ú。 Accept-Encoding 不会有帮助,并且加载为 UTF-8 是完全正确的。

您必须自己解码实体。类似于:

String decodeNumericEntities(String s) {
    StringBuffer sb = new StringBuffer();
    Matcher m = Pattern.compile("\\&#(\\d+);").matcher(s);
    while (m.find()) {
        int uc = Integer.parseInt(m.group(1));
        m.appendReplacement(sb, "");
        sb.appendCodepoint(uc);
    }
    m.appendTail(sb);
    return sb.toString();
}

顺便说一句,这些实体可能源自处理过的 HTML 表单,因此在 Web 应用程序的编辑方面。


在有问题的代码之后:

我已将 DataInputStream 替换为文本的(缓冲)Reader。 InputStreams读取二进制数据,字节;读者文本,字​​符串。 InputStreamReader 有一个InputStream 和一个编码作为参数,并返回一个Reader。

try {
    BufferedReader input = new BufferedReader(
            new InputStreamReader(urlConn.getInputStream(), "UTF-8")); 
    StringBuilder strB = new StringBuilder();
    String str;
    while (null != (str = input.readLine())) {
        strB.append(str).append("\r\n"); 
    }
    input.close();
} catch (IOException e) {
    e.printStackTrace();
}

The HTML page is in UTF-8, and could use arabic characters and such. But those characters above Unicode 127 are still encoded as numeric entities like ú. An Accept-Encoding will not, help, and loading as UTF-8 is entirely right.

You have to decode the entities yourself. Something like:

String decodeNumericEntities(String s) {
    StringBuffer sb = new StringBuffer();
    Matcher m = Pattern.compile("\\&#(\\d+);").matcher(s);
    while (m.find()) {
        int uc = Integer.parseInt(m.group(1));
        m.appendReplacement(sb, "");
        sb.appendCodepoint(uc);
    }
    m.appendTail(sb);
    return sb.toString();
}

By the way those entities could stem from processed HTML forms, so on the editing side of the web app.


After code in question:

I have replaced DataInputStream with a (Buffered)Reader for text. InputStreams read binary data, bytes; Readers text, Strings. An InputStreamReader has as parameter an InputStream and an encoding, and returns a Reader.

try {
    BufferedReader input = new BufferedReader(
            new InputStreamReader(urlConn.getInputStream(), "UTF-8")); 
    StringBuilder strB = new StringBuilder();
    String str;
    while (null != (str = input.readLine())) {
        strB.append(str).append("\r\n"); 
    }
    input.close();
} catch (IOException e) {
    e.printStackTrace();
}
荒路情人 2025-01-03 20:29:23

尝试将用户代理添加到您的 URLConnection 中:

urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36");

这就像魅力一样解决了我的解码问题。

Try adding also the user agent to your URLConnection:

urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36");

This solved my decoding problem like a charm.

臻嫒无言 2025-01-03 20:29:23

好吧,我认为问题出在您从流中读取内容时。您应该在 DataInputStream 上调用 readUTF 方法,而不是调用 readLine,或者,我会做的是创建一个 InputStreamReader 并设置编码,然后您可以从 BufferedReader 逐行读取(这将在您现有的 try/catch 中):

Charset charset = Charset.forName("UTF8");
InputStreamReader stream = new InputStreamReader(urlConn.getInputStream(), charset);
BufferedReader reader = new BufferedReader(stream);
StringBuffer responseBuffer = new StringBuffer();

String read = "";
while ((read = reader.readLine()) != null) {
    responseBuffer.append(read);
}

Well I'm thinking the problem is when you are reading from the stream. You should either call the readUTF method on the DataInputStream instead of calling readLine or, what I would do, would be to create an InputStreamReader and set the encoding, then you can read from the BufferedReader line by line (this would be inside your existing try/catch):

Charset charset = Charset.forName("UTF8");
InputStreamReader stream = new InputStreamReader(urlConn.getInputStream(), charset);
BufferedReader reader = new BufferedReader(stream);
StringBuffer responseBuffer = new StringBuffer();

String read = "";
while ((read = reader.readLine()) != null) {
    responseBuffer.append(read);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文