JSoup字符编码问题
我正在使用 JSoup 解析来自 http://www.latijnengrieks.com/vertaling.php 的内容?id=5368 。这是第三方网站,未指定正确的编码。我正在使用以下代码加载数据:
public class Loader {
public static void main(String[] args){
String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";
Document doc;
try {
doc = Jsoup.connect(url).timeout(5000).get();
Element content = doc.select("div.kader").first();
Element contenttableElement = content.getElementsByClass("kopje").first().parent().parent();
String contenttext = content.html();
String tabletext = contenttableElement.html();
contenttext = Jsoup.parse(contenttext).text();
contenttext = contenttext.replace("br2n", "\n");
tabletext = Jsoup.parse(tabletext.replaceAll("(?i)<br[^>]*>", "br2n")).text();
tabletext = tabletext.replace("br2n", "\n");
String text = contenttext.substring(tabletext.length(), contenttext.length());
System.out.println(text);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
这给出了以下输出:
Aeneas dwaalt rond in Troje en zoekt Cre?sa. Cre?sa is echter op de vlucht gestorven Plotseling verschijnt er een schim. Het is de schim van Cre?sa. De schim zegt:'De oorlog woedt!' Troje is ingenomen! Cre?sa is gestorven:'Vlucht!' Aeneas vlucht echter niet. Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.' Dan pas gehoorzaamt Aeneas en vlucht.
有什么办法吗?输出中的标记可以再次是原来的 (ü) 吗?
I am using JSoup to parse content from http://www.latijnengrieks.com/vertaling.php?id=5368 . this is a third party website and does not specify proper encoding. i am using the following code to load the data:
public class Loader {
public static void main(String[] args){
String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";
Document doc;
try {
doc = Jsoup.connect(url).timeout(5000).get();
Element content = doc.select("div.kader").first();
Element contenttableElement = content.getElementsByClass("kopje").first().parent().parent();
String contenttext = content.html();
String tabletext = contenttableElement.html();
contenttext = Jsoup.parse(contenttext).text();
contenttext = contenttext.replace("br2n", "\n");
tabletext = Jsoup.parse(tabletext.replaceAll("(?i)<br[^>]*>", "br2n")).text();
tabletext = tabletext.replace("br2n", "\n");
String text = contenttext.substring(tabletext.length(), contenttext.length());
System.out.println(text);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
this gives the following output:
Aeneas dwaalt rond in Troje en zoekt Cre?sa. Cre?sa is echter op de vlucht gestorven Plotseling verschijnt er een schim. Het is de schim van Cre?sa. De schim zegt:'De oorlog woedt!' Troje is ingenomen! Cre?sa is gestorven:'Vlucht!' Aeneas vlucht echter niet. Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.' Dan pas gehoorzaamt Aeneas en vlucht.
is there any way the ? marks can be the original (ü) again in the output?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
HTTP 响应
Content-Type
标头中缺少charset
属性。 Jsoup 在解析 HTML 时将采用平台默认字符集。Document.OutputSettings#charset()
将不起作用,因为它仅用于演示(在html()
和text()
上),不是为了解析数据(换句话说,已经太晚了)。您需要将 URL 读取为
InputStream
并在Jsoup#parse()
方法中手动指定字符集。这导致这里
The
charset
attribute is missing in HTTP responseContent-Type
header. Jsoup will resort to platform default charset when parsing the HTML. TheDocument.OutputSettings#charset()
won't work as it's used for presentation only (onhtml()
andtext()
), not for parsing the data (in other words, it's too late already).You need to read the URL as
InputStream
and manually specify the charset inJsoup#parse()
method.this results here in
好吧,我想出了另一种方法来做到这一点。就我而言,我有一个 Jsoup Connection 对象,我想从使用“ISO-8859”编码的网站中的 post() 请求检索 html 响应。由于 JSOUP 的默认编码是 UTF-8,因此响应(html)中的内容会用 � 替换一些字母。我需要以某种方式将其转换为 ISO-8859-15。为此,我创建了连接。
之后,我创建了一个包含帖子答案的响应文档。由于不清楚如何在 Jsoup 中设置响应的编码,我选择执行帖子,然后将响应保存为字节,保留编码属性。之后,我创建了一个新的字符串,传递此字节数组和必须应用的正确编码。之后,将以正确的编码创建文档。
所以,就有了修改前后的返回,当我们使用response.html()
之前:
62.09-1-00 - Suporte t�cnico, manuten�oe outros servi�os em tecnologia da informa�o
之后:
62.09-1-00 - 支持技术、管理和信息技术方面的服务
Well, I figured out another way to do that. In my case, I had an Jsoup Connection object and I wanted to retrieve the html response from a post() request in a website that was encoded with "ISO-8859". As the default encoding for JSOUP is UTF-8, the content from the response (the html) was coming with � replacing some letters. I needed to somehow convert it to ISO-8859-15. To perform that, I've created the connection
After that, I've created a response Document that holds the answer of the post. Due to the fact that it was not clear how we can set the encoding of the response in Jsoup, I opted to execute the post and then save the response as Bytes, preserving the encoding properties. After that, I've created a new String passing this Byte array and the proper encoding that must be applied. After that, the document will be created with the correct encoding.
So, there is the return before and after the modification, when we use response.html()
Before:
62.09-1-00 - Suporte t�cnico, manuten��o e outros servi�os em tecnologia da informa��o
After:
62.09-1-00 - Suporte técnico, manutenção e outros serviços em tecnologia da informação
Jsoup 文档指出,Jsoup 在阅读文档时应该自动检测正确的字符集,但由于某种原因,它对我不起作用。然后我尝试使用 outputSettings().charset(...): 手动设置文档的字符集:
但这仍然不起作用,所以也许我做错了(我刚刚学习 Jsoup)。
至少对我来说,一种确实有效的解决方法是使用具有字符集集的扫描仪读取网页:
但我将关注此线程,看看是否有人提出更好的建议,一个不需要使用另一个类来读取 HTML。
The Jsoup documentation states that Jsoup should automatically detect the correct charset when reading in the document, but for some reason, it's not working for me. I then tried to manually set the Document's charset using outputSettings().charset(...):
But that still didn't work, so perhaps I'm doing it wrong (I'm just learning Jsoup).
One work-around that did work, at least for me, was to read in the web page using a Scanner that had its charset set:
But I'll be following this thread to see if anyone comes up with a better suggestion, one that doesn't need the use of another class to read in the HTML.
我使用:
另外,将类保存为 UTF-8
I used:
Also, saved the class as UTF-8