jsoup 的奇怪编码行为
我用jsoup从不同页面的html源代码中提取一些信息。大多数都是UTF-8编码的。其中一个是用 ISO-8859-1 编码的,这会导致一个奇怪的错误(在我看来)。
包含错误的页面是: http://www.gudi.ch/armbanduhr-metal -wasserdicht-1280x960-million MPs-p-560.html
我用以下代码读取了所需的字符串:
Document doc = Jsoup.connect("http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html").userAgent("Mozilla").get();
String title = doc.getElementsByClass("products_name").first().text();
问题是字符串“HD Armbanduhr aus Metall 4GB Wasserdicht 1280X960 – 5 Megapixels”中的连字符。正常的元音变音如 öäü 可以正确读取。仅此单个字符,不会输出为“-”造成问题。
我尝试使用 out.outputSettings().charset("ISO-8859-1") 覆盖(正确设置的)页面编码,但这也没有帮助。
接下来,我尝试手动将 Charset 类的字符串编码从 utf8 和 iso-8859-1 更改为 utf8 和 iso-8859-1。也没有运气。
有人提示我在使用 jsoup 解析 html 文档后可以尝试获取正确的字符吗?
谢谢
I extract some information from the html sourcecode of different pages with jsoup. Most of them are UTF-8 encoded. One of them is encoded with ISO-8859-1, which leads to a strange error (in my optinion).
The page that contains the error is:
http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html
I read the needed String with the following piece of code:
Document doc = Jsoup.connect("http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html").userAgent("Mozilla").get();
String title = doc.getElementsByClass("products_name").first().text();
The problem is the hyphen in the String "HD Armbanduhr aus Metall 4GB Wasserdicht 1280X960 – 5 Megapixels". Normal umlauts like öäü are read correctly. Only this single character, which is not outputed as "& #45;" makes the problem.
I tried to override the (correctly set) page-encoding with out.outputSettings().charset("ISO-8859-1") but that didn't help either.
Next, i tried do change the encoding of the string with the Charset class from and to utf8 and iso-8859-1 manually. Also no luck.
Has someone a tip on what i can try to get the correct character after parsing the html document with jsoup?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是网站本身的错误。这实际上是三个错误:
提供的页面在 HTTP
Content-Type
响应标头中没有任何字符集。 HTML 元标记中有ISO-8859-1
,但是当页面通过 HTTP 提供服务时,它会被忽略!一般的网页浏览器要么尝试智能检测,要么使用平台默认编码对网页进行编码,在 Windows 机器上为 CP1252。标签假装内容是 ISO-8859-1 编码的,但实际字符
–
(U+2013 EN DASH) 是不是 完全被该字符集覆盖。然而,它被 CP1252 字符集覆盖为0x0096
.根据网页源代码,产品名称使用文字字符
–
,而不是同一网页上其他地方发现的 HTML 实体–
。< /p>Jsoup 可以透明地修复许多开发不良的网页,但这一个确实超出了 Jsoup 的范围。您需要手动读入它,然后将其作为 CP1252 提供给 Jsoup。
This is a mistake of the website itself. It are actually three mistakes:
The page is served without any charset in the HTTP
Content-Type
response header. There'sISO-8859-1
in the HTML meta tag, but this is ignored when the page is served over HTTP! The average webbrowser will either try smart detection or use platform default encoding to encode the webpage, which is CP1252 on Windows machines.The
<meta>
tag pretends that the content is ISO-8859-1 encoded, but the actual character–
(U+2013 EN DASH) is not covered by that charset at all. It is however covered by the CP1252 charset as0x0096
.According to the webpage source code, the product name uses the literal character
–
instead of the HTML entity–
as spotted elsewhere on the same webpage.Jsoup can fix many badly developed webpages transparently, but this one goes really beyond Jsoup. You need to manually read it in and then feed it as CP1252 to Jsoup.