在 Android 上使用 HtlmCleaner 无法正确解码非英语字符

发布于 2024-10-13 04:02:18 字数 1330 浏览 4 评论 0原文

我正在使用 HtmlCleaner 在 Android 中抓取 ISO-8859-1 编码的网站。

我已在导入到 Android 应用程序中的外部 jar 文件中实现了此功能。

当我在 Eclipse 中运行单元测试时,它正确处理挪威字母 (æ,ø,å)(我可以在调试器中验证这一点),但在 Android 应用程序中,这些字符看起来像倒置的问号。

如果我将调试器附加到我的 Android 应用程序,我可以看到这些字母在从 Eclipse 运行单元测试时正确的位置不正确,因此这不是 Android 应用程序中的显示/渲染/视图问题。

当我从调试器复制文本时,我得到以下结果:

Java 进程(单元测试):«Blårek»、«Benny»

Android 进程(在模拟器中):«Bl “rek”、“Benny”

我希望这些字符串相等,但请注意 Android 中“å”如何被倒置的问号替换。

我尝试运行 htmlCleaner.getProperties().setRecognizeUnicodeChars(true) 但没有任何运气。另外,我发现没有办法在 html clean 中强制使用 UTF-8 或 ISO-8859-1 编码,但我不确定这是否会产生影响。

这是我运行的代码:

HtmlCleaner htmlCleaner = new HtmlCleaner();

// connect to url and get root TagNode from HtmlCleaner
InputSteram is = new URL( url ).openConnection().getInputStream();
TagNode rootNode = htmlCleaner.clean( is );

// navigate through some TagNodes, getting the ContentNode 
ContentNode cn = rootNode... 

// This String contains the incorrectly decoded characters on Android. 
// Good in Oracle JVM though..
String value = cn.toString().trim();

有谁知道什么可能导致 Android 上的解码行为不同?我想这两个环境之间的主要区别在于 Android 应用程序使用 Android 的 java.io 堆栈,而我的单元测试使用 Sun/Oracle 的堆栈。

谢谢,
盖尔

I'm using HtmlCleaner to scrape a ISO-8859-1 encoded web site in Android.

I've implemented this in an external jar file that I import into my Android app.

When I run the unit tests in Eclipse it handles Norwegian letters (æ,ø,å) correct (I can verify that in the debugger), but in the Android app these characters look like inverted question marks.

If I attach the debugger to my Android app I can see that these letters are not correct in the exact same places they were good when running unit test from Eclipse, so it's not a display/render/view issue in the Android app.

When I copy the text from the debuggers I get these results:

Java Process (Unit Test): «Blårek», «Benny»

Android Process (In emulator): «Bl�rek», «Benny»

I would expect these Strings to be equal, but notice how the "å" is replaed by the inverted question marks in Android.

I have tried running htmlCleaner.getProperties().setRecognizeUnicodeChars(true) without any luck. Also, I found no way of forcing UTF-8 or ISO-8859-1 encoding in html cleaner, but I' not sure if that would have made a difference.

Here is the code i run:

HtmlCleaner htmlCleaner = new HtmlCleaner();

// connect to url and get root TagNode from HtmlCleaner
InputSteram is = new URL( url ).openConnection().getInputStream();
TagNode rootNode = htmlCleaner.clean( is );

// navigate through some TagNodes, getting the ContentNode 
ContentNode cn = rootNode... 

// This String contains the incorrectly decoded characters on Android. 
// Good in Oracle JVM though..
String value = cn.toString().trim();

Does anyone knows what could cause the decoding behavoir to be different on Android? I guess the main difference between the two environments is that the Android app uses Android's java.io stack while my unit tests use Sun/Oracle's stack.

Thanks,
Geir

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

谎言 2024-10-20 04:02:18

HtmlCleaner 无法判断要使用什么编码;您仅在 InputStream 中传递响应正文,但编码位于“content-type”标头中。

您可以设置字符编码HtmlCleaner 的属性上从 HTTP 连接进行正确的编码。但这需要您从内容类型标头中解析正确的参数。或者,您可以 传递 URL 实例到 HtmlCleaner 并让它管理连接。然后,它将能够访问正确解码所需的所有信息。

HtmlCleaner can't tell what encoding to use; you are passing only the body of the response in the InputStream, but the encoding is in the "content-type" header.

You can set the character encoding on the properties of the HtmlCleaner to the correct encoding from the HTTP connection. But that would require you to parse the correct parameter from the content-type header. Alternatively, you can pass a URL instance to HtmlCleaner and let it manage the connection. Then, it will have access to all the information it needs to decode properly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文