JSoup字符编码问题

发布于 2024-12-08 21:45:12 字数 1764 浏览 3 评论 0原文

我正在使用 JSoup 解析来自 http://www.latijnengrieks.com/vertaling.php 的内容?id=5368 。这是第三方网站，未指定正确的编码。我正在使用以下代码加载数据：

public class Loader {

    public static void main(String[] args){
        String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";

        Document doc;
        try {

            doc = Jsoup.connect(url).timeout(5000).get();
            Element content = doc.select("div.kader").first();
            Element contenttableElement = content.getElementsByClass("kopje").first().parent().parent();

            String contenttext = content.html();
            String tabletext = contenttableElement.html();

            contenttext = Jsoup.parse(contenttext).text();
            contenttext = contenttext.replace("br2n", "\n");
            tabletext = Jsoup.parse(tabletext.replaceAll("(?i)<br[^>]*>", "br2n")).text();
            tabletext = tabletext.replace("br2n", "\n");

            String text = contenttext.substring(tabletext.length(), contenttext.length());
            System.out.println(text);


        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }


    }    

}

这给出了以下输出：

Aeneas dwaalt rond in Troje en zoekt Cre?sa. Cre?sa is echter op de vlucht gestorven Plotseling verschijnt er een schim. Het is de schim van Cre?sa. De schim zegt:'De oorlog woedt!' Troje is ingenomen! Cre?sa is gestorven:'Vlucht!' Aeneas vlucht echter niet. Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.' Dan pas gehoorzaamt Aeneas en vlucht.

有什么办法吗？输出中的标记可以再次是原来的 (ü) 吗？

原文

I am using JSoup to parse content from http://www.latijnengrieks.com/vertaling.php?id=5368 . this is a third party website and does not specify proper encoding. i am using the following code to load the data:

public class Loader {

    public static void main(String[] args){
        String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";

        Document doc;
        try {

            doc = Jsoup.connect(url).timeout(5000).get();
            Element content = doc.select("div.kader").first();
            Element contenttableElement = content.getElementsByClass("kopje").first().parent().parent();

            String contenttext = content.html();
            String tabletext = contenttableElement.html();

            contenttext = Jsoup.parse(contenttext).text();
            contenttext = contenttext.replace("br2n", "\n");
            tabletext = Jsoup.parse(tabletext.replaceAll("(?i)<br[^>]*>", "br2n")).text();
            tabletext = tabletext.replace("br2n", "\n");

            String text = contenttext.substring(tabletext.length(), contenttext.length());
            System.out.println(text);


        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }


    }    

}

this gives the following output:

Aeneas dwaalt rond in Troje en zoekt Cre?sa. Cre?sa is echter op de vlucht gestorven Plotseling verschijnt er een schim. Het is de schim van Cre?sa. De schim zegt:'De oorlog woedt!' Troje is ingenomen! Cre?sa is gestorven:'Vlucht!' Aeneas vlucht echter niet. Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.' Dan pas gehoorzaamt Aeneas en vlucht.

is there any way the ? marks can be the original (ü) again in the output?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

故人爱我别走 2024-12-15 21:45:12

HTTP 响应 Content-Type 标头中缺少 charset 属性。 Jsoup 在解析 HTML 时将采用平台默认字符集。 Document.OutputSettings#charset() 将不起作用，因为它仅用于演示（在 html() 和 text() 上），不是为了解析数据（换句话说，已经太晚了）。

您需要将 URL 读取为 InputStream 并在 Jsoup#parse() 方法中手动指定字符集。

String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";
Document document = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url);
Element paragraph = document.select("div.kader p").first();

for (Node node : paragraph.childNodes()) {
    if (node instanceof TextNode) {
        System.out.println(((TextNode) node).text().trim());
    }
}

这导致这里

Aeneas dwaalt rond in Troje en zoekt Creüsa.
Creüsa is echter op de vlucht gestorven
Plotseling verschijnt er een schim.
Het is de schim van Creüsa.
De schim zegt:'De oorlog woedt!'
Troje is ingenomen!
Creüsa is gestorven:'Vlucht!'
Aeneas vlucht echter niet.
Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.'
Dan pas gehoorzaamt Aeneas en vlucht.

The charset attribute is missing in HTTP response Content-Type header. Jsoup will resort to platform default charset when parsing the HTML. The Document.OutputSettings#charset() won't work as it's used for presentation only (on html() and text()), not for parsing the data (in other words, it's too late already).

You need to read the URL as InputStream and manually specify the charset in Jsoup#parse() method.

String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";
Document document = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url);
Element paragraph = document.select("div.kader p").first();

for (Node node : paragraph.childNodes()) {
    if (node instanceof TextNode) {
        System.out.println(((TextNode) node).text().trim());
    }
}

this results here in

Aeneas dwaalt rond in Troje en zoekt Creüsa.
Creüsa is echter op de vlucht gestorven
Plotseling verschijnt er een schim.
Het is de schim van Creüsa.
De schim zegt:'De oorlog woedt!'
Troje is ingenomen!
Creüsa is gestorven:'Vlucht!'
Aeneas vlucht echter niet.
Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.'
Dan pas gehoorzaamt Aeneas en vlucht.

回复收藏 0 原文

晒暮凉 2024-12-15 21:45:12

好吧，我想出了另一种方法来做到这一点。就我而言，我有一个 Jsoup Connection 对象，我想从使用“ISO-8859”编码的网站中的 post() 请求检索 html 响应。由于 JSOUP 的默认编码是 UTF-8，因此响应（html）中的内容会用 � 替换一些字母。我需要以某种方式将其转换为 ISO-8859-15。为此，我创建了连接。

Connection connectionTest = Jsoup.connect("URL")
.cookie("cookiereference", "cookievalue")
.method(Method.POST);

之后，我创建了一个包含帖子答案的响应文档。由于不清楚如何在 Jsoup 中设置响应的编码，我选择执行帖子，然后将响应保存为字节，保留编码属性。之后，我创建了一个新的字符串，传递此字节数组和必须应用的正确编码。之后，将以正确的编码创建文档。

Document response = Jsoup.parse(new String(
connectionTest.execute().bodyAsBytes(),"ISO-8859-15"));

所以，就有了修改前后的返回，当我们使用response.html()

之前：

62.09-1-00 - Suporte t�cnico, manuten�oe outros servi�os em tecnologia da informa�o

之后：

62.09-1-00 - 支持技术、管理和信息技术方面的服务

Well, I figured out another way to do that. In my case, I had an Jsoup Connection object and I wanted to retrieve the html response from a post() request in a website that was encoded with "ISO-8859". As the default encoding for JSOUP is UTF-8, the content from the response (the html) was coming with � replacing some letters. I needed to somehow convert it to ISO-8859-15. To perform that, I've created the connection

Connection connectionTest = Jsoup.connect("URL")
.cookie("cookiereference", "cookievalue")
.method(Method.POST);

After that, I've created a response Document that holds the answer of the post. Due to the fact that it was not clear how we can set the encoding of the response in Jsoup, I opted to execute the post and then save the response as Bytes, preserving the encoding properties. After that, I've created a new String passing this Byte array and the proper encoding that must be applied. After that, the document will be created with the correct encoding.

Document response = Jsoup.parse(new String(
connectionTest.execute().bodyAsBytes(),"ISO-8859-15"));

So, there is the return before and after the modification, when we use response.html()

Before:

62.09-1-00 - Suporte t�cnico, manuten��o e outros servi�os em tecnologia da informa��o

After:

62.09-1-00 - Suporte técnico, manutenção e outros serviços em tecnologia da informação

回复收藏 0 原文

硪扪都還晓 2024-12-15 21:45:12

Jsoup 文档指出，Jsoup 在阅读文档时应该自动检测正确的字符集，但由于某种原因，它对我不起作用。然后我尝试使用 outputSettings().charset(...): 手动设置文档的字符集：

doc.outputSettings().charset("ISO-8859-1");

但这仍然不起作用，所以也许我做错了（我刚刚学习 Jsoup）。

至少对我来说，一种确实有效的解决方法是使用具有字符集集的扫描仪读取网页：

     String charset = "ISO-8859-1";

     URL myUrl = new URL(url);
     Scanner urlScanner = new Scanner(myUrl.openStream(), charset);
     StringBuilder sb = new StringBuilder();
     while (urlScanner.hasNextLine()) {
        sb.append(urlScanner.nextLine() + "\n");
     }
     urlScanner.close();

     doc = Jsoup.parse(sb.toString());

但我将关注此线程，看看是否有人提出更好的建议，一个不需要使用另一个类来读取 HTML。

The Jsoup documentation states that Jsoup should automatically detect the correct charset when reading in the document, but for some reason, it's not working for me. I then tried to manually set the Document's charset using outputSettings().charset(...):

doc.outputSettings().charset("ISO-8859-1");

But that still didn't work, so perhaps I'm doing it wrong (I'm just learning Jsoup).

One work-around that did work, at least for me, was to read in the web page using a Scanner that had its charset set:

     String charset = "ISO-8859-1";

     URL myUrl = new URL(url);
     Scanner urlScanner = new Scanner(myUrl.openStream(), charset);
     StringBuilder sb = new StringBuilder();
     while (urlScanner.hasNextLine()) {
        sb.append(urlScanner.nextLine() + "\n");
     }
     urlScanner.close();

     doc = Jsoup.parse(sb.toString());

But I'll be following this thread to see if anyone comes up with a better suggestion, one that doesn't need the use of another class to read in the HTML.

回复收藏 0 原文

诗酒趁年少 2024-12-15 21:45:12

我使用：

public static String charset = "UTF-8";
doc = Jsoup.parse(new URL(theURL).openStream(), charset, theURL);

另外，将类保存为 UTF-8

I used:

public static String charset = "UTF-8";
doc = Jsoup.parse(new URL(theURL).openStream(), charset, theURL);

Also, saved the class as UTF-8

回复收藏 0 原文

~没有更多了~

关于作者

不知所踪

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

JSoup字符编码问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

JSoup字符编码问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。