Java中检测RTF文档的编码

发布于 2024-08-16 15:16:18 字数 517 浏览 6 评论 0原文

我的 Java 程序使用 RTFEditorKit 对 RTF 文件进行文本提取。某些 RTF 文件包含西里尔字符(俄语),根据 RTF 版本,提取的文本要么正常,要么包含乱码。当它是乱码时,我可以使用它来获取正常文本:

String text = ... // extracted text

String demodText = new String(text.getBytes("ISO-8859-1") ), "cp1251");

现在的问题是我找不到一种方法来自动检测文件的编码,即提取的文本是否必须解码。有人知道该怎么做吗?提前致谢!

编辑:在RTF文件的第一行中,我看到一些看起来像编码的东西:

  • 我得到乱码的文件:{\ rtf1 \ ansi \ ansicpg1251 \ deff0 \ deflang1049
  • 文本正常的文件:{\ rtf1\ansi\ansicpg1251\deff0

My Java program does text extraction on RTF files using the RTFEditorKit. Some of the RTF files contain cyrillic characters (Russian), and depending on the RTF version, the extracted text is either okay or contains gibberish. When it's gibberish, I can use this to get normal text:

String text = ... // extracted text

String decodedText = new String(text.getBytes("ISO-8859-1"), "cp1251");

Now the problem is that I couldn't find a way to automatically detect the encoding of the file, i.e. whether the extracted text must be decoded or not. Does anybody know how to do this? Thanks in advance!

EDIT: In the first lines of the RTF files I see something that looks like an encoding:

  • Files where I get gibberish: {\rtf1\ansi\ansicpg1251\deff0\deflang1049
  • Files with okay text: {\rtf1\ansi\ansicpg1251\deff0

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

一个人的旅程 2024-08-23 15:16:18

我不相信文件本身有编码。来自维基百科页面

RTF 是一种 8 位格式。那会
限制为 ASCII,但 RTF 可以编码
通过转义超出 ASCII 的字符
序列。字符转义是
有两种类型:代码页转义和
Unicode 转义。在代码页中
转义符,两个十六进制数字
后面的撇号用于
表示取自 a 的字符
Windows 代码页。例如,如果
指定 Windows-1256 的控制代码
存在,序列 \'c8 将
编码阿拉伯字母 beh (a)。

如果需要 Unicode 转义,则
使用控制字 \u,后跟一个
16 位有符号十进制整数给出
Unicode 代码点编号。

所以我怀疑您必须自己提取文本,然后使用上述规则进一步解析。

I don't believe the file itself has an encoding. From the Wikipedia page:

RTF is an 8-bit format. That would
limit it to ASCII, but RTF can encode
characters beyond ASCII by escape
sequences. The character escapes are
of two types: code page escapes and
Unicode escapes. In a code page
escape, two hexadecimal digits
following an apostrophe are used for
denoting a character taken from a
Windows code page. For example, if
control codes specifying Windows-1256
are present, the sequence \'c8 will
encode the Arabic letter beh (ب).

If a Unicode escape is required, the
control word \u is used, followed by a
16-bit signed decimal integer giving
the Unicode codepoint number.

so I suspect you'll have to extract the text yourself and then parse further using the above rules.

谎言 2024-08-23 15:16:18

RTF 文件以两个控制序列开头,第一个控制序列指定 RTF 版本(不是标准的,但几乎总是 cs \rtf1),第二个指定字符集,即一个\ansi(通常)、\mac\pcpca(几乎从未遇到过)。紧接着,可以指定 Unicode 代码页来修改由 \ansicpg 给出的默认字符解释。

我找不到关于此的大量文档。尝试查看 http://msdn.microsoft.com/ en-us/library/aa140301(office.10).aspx,AbiWord 开发者邮件列表中的好心人花费了大量时间来破译各种 RTF 规范。

RTF files begin with two control sequences, the first of which specifies the RTF version (not the standard, but almost always the cs \rtf1), and the second of which specifies the character set, which is one of \ansi (usual), \mac, \pc, or pca (almost never encountered). Immediately after this, it is possible to specify Unicode codepages that modify the default interpretation of characters, given by \ansicpg.

There's not a whole lot of documentation I can find on this. Try looking at http://msdn.microsoft.com/en-us/library/aa140301(office.10).aspx, and the nice folks on the AbiWord developer's mailing list have spent a lot of time deciphering the various RTF specs.

菊凝晚露 2024-08-23 15:16:18

我不相信 Java 的标准库中有任何东西可以做到这一点。

查看 ICU 组件。它有一个 Java 变体,您可以使用 CharsetDetector< /a> 获取文档编码。

I don't believe Java has anything within the standard libraries to do this.

Check out the ICU component. It has a Java variant and you can use the CharsetDetector to get the document encoding.

白芷 2024-08-23 15:16:18

Internet Explorer 使用字符频率计数来猜测所使用的语言和编码。这有点管用。做类似的事情。

Internet Explorer uses character frequency count to guess the language and the encoding used. It sort of works. Do something similar.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文