HTML 混合编码?

发布于 2024-12-06 20:31:07 字数 670 浏览 1 评论 0 原文

首先,我想对您提前提供的帮助表示感谢。

我目前正在编写一个网络爬虫,它可以解析 HTML 内容,剥离 HTML 标签,然后对从解析中检索到的文本进行拼写检查。

使用 JSoup 和 Google 拼写检查 API,剥离 HTML 标签和拼写检查不会造成任何问题。

我能够从 URL 中提取内容,并将这些信息传递到 byte[] 中,最后传递到 String 中,以便可以对其进行剥离和拼写检查。我遇到了字符编码问题。

例如,在解析 http://www.testwareinc.com/ 时...

原始文本:< /strong> 我们扩展了移动网络和移动应用程序测试服务。

...该页面根据元标记使用 ISO-8859-1...

ISO-8859-1 解析: 我们扩展了移动 Web 和移动应用测试服务。

...然后尝试使用 UTF-8...

UTF-8 解析: 我们扩展了移动 Web 和移动应用测试服务。

问题 网页的 HTML 是否可以包含混合编码?如何检测到这一点?

First I would like to say thank you for the help in advance.

I am currently writing a web crawler that parses HTML content, strips HTML tags, and then spell checks the text which is retrieved from the parsing.

Stripping HTML tags and spell checking has not caused any problems, using JSoup and Google Spell Check API.

I am able to pull down content from a URL and passing this information into a byte[] and then ultimately a String so that it can be stripped and spell checked. I am running into a problem with character encoding.

For example when parsing http://www.testwareinc.com/...

Original Text: We’ve expanded our Mobile Web and Mobile App testing services.

... the page is using ISO-8859-1 according to meta tag...

ISO-8859-1 Parse: Weve expanded our Mobile Web and Mobile App testing services.

... then trying using UTF-8...

UTF-8 Parse: We�ve expanded our Mobile Web and Mobile App testing services.

Question
Is it possible that HTML of a webpage can include a mix of encodings? And how can that be detected?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

余厌 2024-12-13 20:31:07

看起来撇号被编码为 0x92 字节,根据 Wikipedia 是未分配/私有代码点。

从那时起,浏览器似乎会假设它是一个未编码的 1 字节 Unicode 代码点: +0092私人使用二),似乎表示为撇号。不等等,如果是一个字节,则更有可能cp1252:浏览器必须有后备策略根据公布的CP,如ISO-8859-1 -> CP1252。

因此,这里没有混合编码,但正如其他人所说,这是一个损坏的文档。但采用后备启发法有时会有所帮助,有时则没有帮助。

如果您足够好奇,您可能想深入研究 FF 或 Chrome 的源代码,看看它们在这种情况下到底做了什么。

It looks like the apostrophe is coded as a 0x92 byte, which according to Wikipedia is an unassigned/private code point.

From there on, it looks like the browser falls back by assuming it's a non-encoded 1-byte Unicode code point : +0092 (Private Use Two) which appears to be represented as an apostrophe. No wait, if it's one byte, it's more probably cp1252: Browsers must have a fallback strategy according to the advertised CP, such as ISO-8859-1 -> CP1252.

So no mix of encoding here but as others said a broken document. But with a fallback heuristic that will sometimes help, sometimes not.

If you're curious enough, you may want to dive into FF or Chrome's source code to see exactly what they do in such a case.

止于盛夏 2024-12-13 20:31:07

文档中具有超过 1 种编码并不是混合文档,而是损坏的文档。

不幸的是,有很多网页使用的编码与文档定义不匹配,或者包含一些在给定编码中有效的数据和一些无效的内容。

没有什么好的办法来处理这个问题。可以尝试猜测文档的编码,但这很困难并且不是 100% 可靠。在像您这样的情况下,最简单的解决方案就是忽略文档中无法解码的部分。

Having more than 1 encoding in a document isn't a mixed document, it is a broken document.

Unfortunately there are a lot of web pages that use an encoding that doesn't match the document definition, or contains some data that is valid in the given encoding and some content that is invalid.

There is no good way to handle this. It is possible to try and guess the encoding of a document, but it is difficult and not 100% reliable. In cases like yours, the simplest solution is just to ignore parts of the document that can't be decoded.

溇涏 2024-12-13 20:31:07

Apache Tika 有一个编码检测器。如果您需要 C++ 中的某些东西并且有能力花钱,也可以使用商业替代品。

我几乎可以保证每个网页都采用一种编码,但是很容易弄错哪种编码。

Apache Tika has an encoding detector. There are also commercial alternatives if you need, say, something in C++ and are in a position to spend money.

I can pretty much guarantee that each web page is in one encoding, but it's easy to be mistaken about which one.

小猫一只 2024-12-13 20:31:07

似乎是特殊字符的问题。检查这个 StringEscapeUtils.escapeHtml 如果有帮助的话。或其中编辑的任何方法

:添加此逻辑,因为他无法使代码正常工作

public static void main(String[] args) throws FileNotFoundException {
        String asd = "’";
        System.out.println(StringEscapeUtils.escapeXml(asd)); //output - ’
    System.out.println(StringEscapeUtils.escapeHtml(asd)); //output - ’
}

seems like issue with special characters. Check this StringEscapeUtils.escapeHtml if it helps. or any method there

edited: added this logic as he was not able to get code working

public static void main(String[] args) throws FileNotFoundException {
        String asd = "’";
        System.out.println(StringEscapeUtils.escapeXml(asd)); //output - ’
    System.out.println(StringEscapeUtils.escapeHtml(asd)); //output - ’
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文