Java中检测RTF文档的编码
我的 Java 程序使用 RTFEditorKit 对 RTF 文件进行文本提取。某些 RTF 文件包含西里尔字符(俄语),根据 RTF 版本,提取的文本要么正常,要么包含乱码。当它是乱码时,我可以使用它来获取正常文本:
String text = ... // extracted text
String demodText = new String(text.getBytes("ISO-8859-1") ), "cp1251");
现在的问题是我找不到一种方法来自动检测文件的编码,即提取的文本是否必须解码。有人知道该怎么做吗?提前致谢!
编辑:在RTF文件的第一行中,我看到一些看起来像编码的东西:
- 我得到乱码的文件:{\ rtf1 \ ansi \ ansicpg1251 \ deff0 \ deflang1049
- 文本正常的文件:{\ rtf1\ansi\ansicpg1251\deff0
My Java program does text extraction on RTF files using the RTFEditorKit. Some of the RTF files contain cyrillic characters (Russian), and depending on the RTF version, the extracted text is either okay or contains gibberish. When it's gibberish, I can use this to get normal text:
String text = ... // extracted text
String decodedText = new String(text.getBytes("ISO-8859-1"), "cp1251");
Now the problem is that I couldn't find a way to automatically detect the encoding of the file, i.e. whether the extracted text must be decoded or not. Does anybody know how to do this? Thanks in advance!
EDIT: In the first lines of the RTF files I see something that looks like an encoding:
- Files where I get gibberish: {\rtf1\ansi\ansicpg1251\deff0\deflang1049
- Files with okay text: {\rtf1\ansi\ansicpg1251\deff0
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我不相信文件本身有编码。来自维基百科页面:
所以我怀疑您必须自己提取文本,然后使用上述规则进一步解析。
I don't believe the file itself has an encoding. From the Wikipedia page:
so I suspect you'll have to extract the text yourself and then parse further using the above rules.
RTF 文件以两个控制序列开头,第一个控制序列指定 RTF 版本(不是标准的,但几乎总是 cs
\rtf1
),第二个指定字符集,即一个\ansi
(通常)、\mac
、\pc
或pca
(几乎从未遇到过)。紧接着,可以指定 Unicode 代码页来修改由\ansicpg
给出的默认字符解释。我找不到关于此的大量文档。尝试查看 http://msdn.microsoft.com/ en-us/library/aa140301(office.10).aspx,AbiWord 开发者邮件列表中的好心人花费了大量时间来破译各种 RTF 规范。
RTF files begin with two control sequences, the first of which specifies the RTF version (not the standard, but almost always the cs
\rtf1
), and the second of which specifies the character set, which is one of\ansi
(usual),\mac
,\pc
, orpca
(almost never encountered). Immediately after this, it is possible to specify Unicode codepages that modify the default interpretation of characters, given by\ansicpg
.There's not a whole lot of documentation I can find on this. Try looking at http://msdn.microsoft.com/en-us/library/aa140301(office.10).aspx, and the nice folks on the AbiWord developer's mailing list have spent a lot of time deciphering the various RTF specs.
我不相信 Java 的标准库中有任何东西可以做到这一点。
查看 ICU 组件。它有一个 Java 变体,您可以使用 CharsetDetector< /a> 获取文档编码。
I don't believe Java has anything within the standard libraries to do this.
Check out the ICU component. It has a Java variant and you can use the CharsetDetector to get the document encoding.
Internet Explorer 使用字符频率计数来猜测所使用的语言和编码。这有点管用。做类似的事情。
Internet Explorer uses character frequency count to guess the language and the encoding used. It sort of works. Do something similar.