当前位置：文江博客话题详情

在 Android 上使用 HtlmCleaner 无法正确解码非英语字符

发布于 2024-10-13 04:02:18 字数 1330 浏览 4 评论 0原文

我正在使用 HtmlCleaner 在 Android 中抓取 ISO-8859-1 编码的网站。

我已在导入到 Android 应用程序中的外部 jar 文件中实现了此功能。

当我在 Eclipse 中运行单元测试时，它正确处理挪威字母 (æ,ø,å)（我可以在调试器中验证这一点），但在 Android 应用程序中，这些字符看起来像倒置的问号。

如果我将调试器附加到我的 Android 应用程序，我可以看到这些字母在从 Eclipse 运行单元测试时正确的位置不正确，因此这不是 Android 应用程序中的显示/渲染/视图问题。

当我从调试器复制文本时，我得到以下结果：

Java 进程（单元测试）：«Blårek»、«Benny»

Android 进程（在模拟器中）：«Bl “rek”、“Benny”

我希望这些字符串相等，但请注意 Android 中“å”如何被倒置的问号替换。

我尝试运行 htmlCleaner.getProperties().setRecognizeUnicodeChars(true) 但没有任何运气。另外，我发现没有办法在 html clean 中强制使用 UTF-8 或 ISO-8859-1 编码，但我不确定这是否会产生影响。

这是我运行的代码：

HtmlCleaner htmlCleaner = new HtmlCleaner();

// connect to url and get root TagNode from HtmlCleaner
InputSteram is = new URL( url ).openConnection().getInputStream();
TagNode rootNode = htmlCleaner.clean( is );

// navigate through some TagNodes, getting the ContentNode 
ContentNode cn = rootNode... 

// This String contains the incorrectly decoded characters on Android. 
// Good in Oracle JVM though..
String value = cn.toString().trim();

有谁知道什么可能导致 Android 上的解码行为不同？我想这两个环境之间的主要区别在于 Android 应用程序使用 Android 的 java.io 堆栈，而我的单元测试使用 Sun/Oracle 的堆栈。

谢谢，
盖尔

原文

I'm using HtmlCleaner to scrape a ISO-8859-1 encoded web site in Android.

I've implemented this in an external jar file that I import into my Android app.

When I run the unit tests in Eclipse it handles Norwegian letters (æ,ø,å) correct (I can verify that in the debugger), but in the Android app these characters look like inverted question marks.

If I attach the debugger to my Android app I can see that these letters are not correct in the exact same places they were good when running unit test from Eclipse, so it's not a display/render/view issue in the Android app.

When I copy the text from the debuggers I get these results:

Java Process (Unit Test): «Blårek», «Benny»

Android Process (In emulator): «Bl�rek», «Benny»

I would expect these Strings to be equal, but notice how the "å" is replaed by the inverted question marks in Android.

I have tried running htmlCleaner.getProperties().setRecognizeUnicodeChars(true) without any luck. Also, I found no way of forcing UTF-8 or ISO-8859-1 encoding in html cleaner, but I' not sure if that would have made a difference.

Here is the code i run:

HtmlCleaner htmlCleaner = new HtmlCleaner();

// connect to url and get root TagNode from HtmlCleaner
InputSteram is = new URL( url ).openConnection().getInputStream();
TagNode rootNode = htmlCleaner.clean( is );

// navigate through some TagNodes, getting the ContentNode 
ContentNode cn = rootNode... 

// This String contains the incorrectly decoded characters on Android. 
// Good in Oracle JVM though..
String value = cn.toString().trim();

Does anyone knows what could cause the decoding behavoir to be different on Android? I guess the main difference between the two environments is that the Android app uses Android's java.io stack while my unit tests use Sun/Oracle's stack.

Thanks,
Geir

分享到QQ

分享到微博