jsoup 的奇怪编码行为

发布于 2024-12-08 19:40:30 字数 909 浏览 1 评论 0原文

我用jsoup从不同页面的html源代码中提取一些信息。大多数都是UTF-8编码的。其中一个是用 ISO-8859-1 编码的,这会导致一个奇怪的错误(在我看来)。

包含错误的页面是: http://www.gudi.ch/armbanduhr-metal -wasserdicht-1280x960-million MPs-p-560.html

我用以下代码读取了所需的字符串:

Document doc = Jsoup.connect("http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html").userAgent("Mozilla").get();
String title = doc.getElementsByClass("products_name").first().text();

问题是字符串“HD Armbanduhr aus Metall 4GB Wasserdicht 1280X960 – 5 Megapixels”中的连字符。正常的元音变音如 öäü 可以正确读取。仅此单个字符,不会输出为“-”造成问题。

我尝试使用 out.outputSettings().charset("ISO-8859-1") 覆盖(正确设置的)页面编码,但这也没有帮助。

接下来,我尝试手动将 Charset 类的字符串编码从 utf8 和 iso-8859-1 更改为 utf8 和 iso-8859-1。也没有运气。

有人提示我在使用 jsoup 解析 html 文档后可以尝试获取正确的字符吗?

谢谢

I extract some information from the html sourcecode of different pages with jsoup. Most of them are UTF-8 encoded. One of them is encoded with ISO-8859-1, which leads to a strange error (in my optinion).

The page that contains the error is:
http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html

I read the needed String with the following piece of code:

Document doc = Jsoup.connect("http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html").userAgent("Mozilla").get();
String title = doc.getElementsByClass("products_name").first().text();

The problem is the hyphen in the String "HD Armbanduhr aus Metall 4GB Wasserdicht 1280X960 – 5 Megapixels". Normal umlauts like öäü are read correctly. Only this single character, which is not outputed as "& #45;" makes the problem.

I tried to override the (correctly set) page-encoding with out.outputSettings().charset("ISO-8859-1") but that didn't help either.

Next, i tried do change the encoding of the string with the Charset class from and to utf8 and iso-8859-1 manually. Also no luck.

Has someone a tip on what i can try to get the correct character after parsing the html document with jsoup?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

初懵 2024-12-15 19:40:30

这是网站本身的错误。这实际上是三个错误:

  1. 提供的页面在 HTTP Content-Type 响应标头中没有任何字符集。 HTML 元标记中有 ISO-8859-1,但是当页面通过 HTTP 提供服务时,它会被忽略!一般的网页浏览器要么尝试智能检测,要么使用平台默认编码对网页进行编码,在 Windows 机器上为 CP1252。

  2. 标签假装内容是 ISO-8859-1 编码的,但实际字符 (U+2013 EN DASH) 是不是 完全被该字符集覆盖。然而,它被 CP1252 字符集覆盖0x0096 .

  3. 根据网页源代码,产品名称使用文字字符 ,而不是同一网页上其他地方发现的 HTML 实体 。< /p>

Jsoup 可以透明地修复许多开发不良的网页,但这一个确实超出了 Jsoup 的范围。您需要手动读入它,然后将其作为 CP1252 提供给 Jsoup。

String url = "http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html";
InputStream input = new URL(url).openStream();
Document doc = Jsoup.parse(input, "CP1252", url);
String title = doc.select(".products_name").first().text();
// ...

This is a mistake of the website itself. It are actually three mistakes:

  1. The page is served without any charset in the HTTP Content-Type response header. There's ISO-8859-1 in the HTML meta tag, but this is ignored when the page is served over HTTP! The average webbrowser will either try smart detection or use platform default encoding to encode the webpage, which is CP1252 on Windows machines.

  2. The <meta> tag pretends that the content is ISO-8859-1 encoded, but the actual character (U+2013 EN DASH) is not covered by that charset at all. It is however covered by the CP1252 charset as 0x0096.

  3. According to the webpage source code, the product name uses the literal character instead of the HTML entity as spotted elsewhere on the same webpage.

Jsoup can fix many badly developed webpages transparently, but this one goes really beyond Jsoup. You need to manually read it in and then feed it as CP1252 to Jsoup.

String url = "http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html";
InputStream input = new URL(url).openStream();
Document doc = Jsoup.parse(input, "CP1252", url);
String title = doc.select(".products_name").first().text();
// ...
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文