从多重编码文件中获取数据

发布于 2025-01-08 10:23:35 字数 2203 浏览 0 评论 0原文

我正在为 Thunderbird 邮件编写解析器。

输入: 我有一个包含电子邮件负载的文件(主要部分用 ANSI - WINDOWS 1250 编写,但内容采用 utf-8 或 iso-8859-2,它是用邮件的 Content-Type 标记编写的)。

输出: 消息内容(正文)的集合。

这就是我所做的:

  1. 将整个文件读入 byte[] 变量。 (仍然是 ANSI)
  2. 将其转换为字符串。 (utf-16 但来自 ANSI 的字节) - 我现在需要转换为字符串,因为我需要到达下一点(划分消息束 -> 唯一消息)
  3. 将消息束划分为单独的消息并添加每个消息进入集合(utf-16)。
  4. 检查消息的内容类型。
  5. 使用JavaMail API,我使用mail.getContent(我猜是utf-16,但我不确定里面的编码)。
  6. 这是我的问题:我猜我有一个UTF-16字符串,它的内容是例如iso-8859-2,那么我现在应该做什么?

我正在使用 Charset 和 new String(byte[],String (charset name) ),但我的尝试都没有成功。

我的尝试:

  1. 从 UTF-16 转换最终字符串 -> UTF-8(因为它与 8859-2 中的字节数相同)
  2. 从 utf-8 获取字节并将其编码为 ANSI
  3. 将 ANSI 解码为 utf-8
  4. 将 utf-8 编码为 ISO-8859-2(或者保留它,如果它已经是 utf-8)
  5. 从 ISO-8859-2 解码。 但这并没有给我任何好的结果。

我该如何处理?对我来说解码太多了,我感到头晕。

输入(这是作为 cp1250 文件保存的,但我将其转换为 utf-8,):

  From - Thu Dec 08 15:06:14 2011
(some mail header stuff....)
Content-Type: text/html; charset="iso-8859-2"
<table border="0" cellspacing="0" width="600"><tbody><tr><th class="ffield2"><span class="cald-word">clich&eacute;d</span> </th><td class="field1"><br>
banal; <b>banalny<b>
<br>
She made a <span class="cald-word">clich&eacute;d remark about the importance of friendship.</span>
<br>
<b>Wygԯsiԡ jakѶ banalnѠuwagꡯ wadze przyjaݮi . <br>
<b>
<b> <b><br>
</td></tr></tbody></table>
From - Thu Dec 08 15:42:09 2011
Content-Type: text/html; charset=utf-8
(some mail header stuff....)
<table border="0" cellspacing="0" width="600"><tbody><tr><th class="ffield2">nosiness</th><td class="field1"><br>
<br>
interest in somebody else's business; <b>wścibstwo<b>
<br>
Nosiness is something I can't stand, so stop asking such questions.
<br>
<b>Nie znoszę wścibstwa, więc przestań zadawać takie pytania. <b><b> <br>
<b>
</td></tr></tbody></table>

I'm writing parser for Thunderbird mails.

Input:
I've got a file with load of emails (main part written in ANSI - WINDOWS 1250, but the content is in utf-8 or iso-8859-2, it is written in mail's Content-Type markup).

Output:
Collection of messages content (body).

So that's what I do:

  1. Read whole file into a byte[] variable. (still ANSI)
  2. Convert it to String. (utf-16 but bytes as from ANSI) - I need to convert to String now, because i need to get to the next point (divide bunch of messages -> sole message)
  3. Divide bunch of messages into a separate message and add every message into Collection (utf-16).
  4. Check Content-Type of a message.
  5. Using JavaMail API i use mail.getContent(utf-16 I guess, but I'm not sure of encoding inside).
  6. This is my problem: I have a String in UTF-16 i guess, and it's content is e.g. iso-8859-2, so what should I do now?

I was using Charset, and new String(byte[],String (charset name) ), but none of my tries made it.

My try:

  1. Convert final String from UTF-16 -> UTF-8 (cause it's the same amount of bytes as in 8859-2)
  2. Get bytes from utf-8 and encode it as ANSI
  3. Decode ANSI to utf-8
  4. Encode utf-8 to ISO-8859-2 (or leave it, if it already has been utf-8)
  5. Decode from ISO-8859-2.
    But it's not giving me any good results.

how may I deal with it? Too many decodings for me, and I feel dizzy.

Input (this was hold as a cp1250 file, but i converted it to utf-8, ):

  From - Thu Dec 08 15:06:14 2011
(some mail header stuff....)
Content-Type: text/html; charset="iso-8859-2"
<table border="0" cellspacing="0" width="600"><tbody><tr><th class="ffield2"><span class="cald-word">clichéd</span> </th><td class="field1"><br>
banal; <b>banalny<b>
<br>
She made a <span class="cald-word">clichéd remark about the importance of friendship.</span>
<br>
<b>Wygԯsiԡ jakѶ banalnѠuwagꡯ wadze przyjaݮi . <br>
<b>
<b> <b><br>
</td></tr></tbody></table>
From - Thu Dec 08 15:42:09 2011
Content-Type: text/html; charset=utf-8
(some mail header stuff....)
<table border="0" cellspacing="0" width="600"><tbody><tr><th class="ffield2">nosiness</th><td class="field1"><br>
<br>
interest in somebody else's business; <b>wścibstwo<b>
<br>
Nosiness is something I can't stand, so stop asking such questions.
<br>
<b>Nie znoszę wścibstwa, więc przestań zadawać takie pytania. <b><b> <br>
<b>
</td></tr></tbody></table>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文