Java UTF-8 编码未设置为 URLConnection
我正在尝试从 http://api.freebase.com/api/trans/raw/ 检索数据m/0h47
正如你在文本中看到的那样,有这样的歌声:/ælˈdʒɪəriə/
。
当我尝试从页面获取源代码时,我会收到带有 ú
等字符的文本。
到目前为止,我已尝试使用以下代码:
urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");
我做错了什么?
我的整个代码:
URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}
try {
urlConn = url.openConnection();
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
urlConn.setDoInput(true);
urlConn.setUseCaches(false);
StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
strBseznam.deleteCharAt(strBseznam.length() - 1);
try {
input = new DataInputStream(urlConn.getInputStream());
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
while (null != ((str = input.readLine())))
{
strB.append(str);
}
input.close();
} catch (IOException e) { e.printStackTrace(); }
I'm trying to retrieve data from http://api.freebase.com/api/trans/raw/m/0h47
As you can see in text there are sings like this: /ælˈdʒɪəriə/
.
When I try to get source from the page I get text with sings like ú
etc.
So far I've tried with the following code:
urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");
What am I doing wrong?
My entire code:
URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}
try {
urlConn = url.openConnection();
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
urlConn.setDoInput(true);
urlConn.setUseCaches(false);
StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
strBseznam.deleteCharAt(strBseznam.length() - 1);
try {
input = new DataInputStream(urlConn.getInputStream());
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
while (null != ((str = input.readLine())))
{
strB.append(str);
}
input.close();
} catch (IOException e) { e.printStackTrace(); }
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
HTML 页面采用 UTF-8 格式,可以使用阿拉伯字符等。但 Unicode 127 以上的字符仍被编码为数字实体,如
ú
。 Accept-Encoding 不会有帮助,并且加载为 UTF-8 是完全正确的。您必须自己解码实体。类似于:
顺便说一句,这些实体可能源自处理过的 HTML 表单,因此在 Web 应用程序的编辑方面。
在有问题的代码之后:
我已将 DataInputStream 替换为文本的(缓冲)Reader。 InputStreams读取二进制数据,字节;读者文本,字符串。 InputStreamReader 有一个InputStream 和一个编码作为参数,并返回一个Reader。
The HTML page is in UTF-8, and could use arabic characters and such. But those characters above Unicode 127 are still encoded as numeric entities like
ú
. An Accept-Encoding will not, help, and loading as UTF-8 is entirely right.You have to decode the entities yourself. Something like:
By the way those entities could stem from processed HTML forms, so on the editing side of the web app.
After code in question:
I have replaced DataInputStream with a (Buffered)Reader for text. InputStreams read binary data, bytes; Readers text, Strings. An InputStreamReader has as parameter an InputStream and an encoding, and returns a Reader.
尝试将用户代理添加到您的 URLConnection 中:
这就像魅力一样解决了我的解码问题。
Try adding also the user agent to your URLConnection:
This solved my decoding problem like a charm.
好吧,我认为问题出在您从流中读取内容时。您应该在
DataInputStream
上调用readUTF
方法,而不是调用readLine
,或者,我会做的是创建一个InputStreamReader
并设置编码,然后您可以从BufferedReader
逐行读取(这将在您现有的 try/catch 中):Well I'm thinking the problem is when you are reading from the stream. You should either call the
readUTF
method on theDataInputStream
instead of callingreadLine
or, what I would do, would be to create anInputStreamReader
and set the encoding, then you can read from theBufferedReader
line by line (this would be inside your existing try/catch):