从多重编码文件中获取数据
我正在为 Thunderbird 邮件编写解析器。
输入: 我有一个包含电子邮件负载的文件(主要部分用 ANSI - WINDOWS 1250 编写,但内容采用 utf-8 或 iso-8859-2,它是用邮件的 Content-Type 标记编写的)。
输出: 消息内容(正文)的集合。
这就是我所做的:
- 将整个文件读入 byte[] 变量。 (仍然是 ANSI)
- 将其转换为字符串。 (utf-16 但来自 ANSI 的字节) - 我现在需要转换为字符串,因为我需要到达下一点(划分消息束 -> 唯一消息)
- 将消息束划分为单独的消息并添加每个消息进入集合(utf-16)。
- 检查消息的内容类型。
- 使用JavaMail API,我使用
mail.getContent
(我猜是utf-16,但我不确定里面的编码)。 - 这是我的问题:我猜我有一个UTF-16字符串,它的内容是例如iso-8859-2,那么我现在应该做什么?
我正在使用 Charset 和 new String(byte[],String (charset name) ),但我的尝试都没有成功。
我的尝试:
- 从 UTF-16 转换最终字符串 -> UTF-8(因为它与 8859-2 中的字节数相同)
- 从 utf-8 获取字节并将其编码为 ANSI
- 将 ANSI 解码为 utf-8
- 将 utf-8 编码为 ISO-8859-2(或者保留它,如果它已经是 utf-8)
- 从 ISO-8859-2 解码。 但这并没有给我任何好的结果。
我该如何处理?对我来说解码太多了,我感到头晕。
输入(这是作为 cp1250 文件保存的,但我将其转换为 utf-8,):
From - Thu Dec 08 15:06:14 2011
(some mail header stuff....)
Content-Type: text/html; charset="iso-8859-2"
<table border="0" cellspacing="0" width="600"><tbody><tr><th class="ffield2"><span class="cald-word">clichéd</span> </th><td class="field1"><br>
banal; <b>banalny<b>
<br>
She made a <span class="cald-word">clichéd remark about the importance of friendship.</span>
<br>
<b>Wygԯsiԡ jakѶ banalnѠuwagꡯ wadze przyjaݮi . <br>
<b>
<b> <b><br>
</td></tr></tbody></table>
From - Thu Dec 08 15:42:09 2011
Content-Type: text/html; charset=utf-8
(some mail header stuff....)
<table border="0" cellspacing="0" width="600"><tbody><tr><th class="ffield2">nosiness</th><td class="field1"><br>
<br>
interest in somebody else's business; <b>wścibstwo<b>
<br>
Nosiness is something I can't stand, so stop asking such questions.
<br>
<b>Nie znoszę wścibstwa, więc przestań zadawać takie pytania. <b><b> <br>
<b>
</td></tr></tbody></table>
I'm writing parser for Thunderbird mails.
Input:
I've got a file with load of emails (main part written in ANSI - WINDOWS 1250, but the content is in utf-8 or iso-8859-2, it is written in mail's Content-Type markup).
Output:
Collection of messages content (body).
So that's what I do:
- Read whole file into a byte[] variable. (still ANSI)
- Convert it to String. (utf-16 but bytes as from ANSI) - I need to convert to String now, because i need to get to the next point (divide bunch of messages -> sole message)
- Divide bunch of messages into a separate message and add every message into Collection (utf-16).
- Check Content-Type of a message.
- Using JavaMail API i use
mail.getContent
(utf-16 I guess, but I'm not sure of encoding inside). - This is my problem: I have a String in UTF-16 i guess, and it's content is e.g. iso-8859-2, so what should I do now?
I was using Charset, and new String(byte[],String (charset name) ), but none of my tries made it.
My try:
- Convert final String from UTF-16 -> UTF-8 (cause it's the same amount of bytes as in 8859-2)
- Get bytes from utf-8 and encode it as ANSI
- Decode ANSI to utf-8
- Encode utf-8 to ISO-8859-2 (or leave it, if it already has been utf-8)
- Decode from ISO-8859-2.
But it's not giving me any good results.
how may I deal with it? Too many decodings for me, and I feel dizzy.
Input (this was hold as a cp1250 file, but i converted it to utf-8, ):
From - Thu Dec 08 15:06:14 2011
(some mail header stuff....)
Content-Type: text/html; charset="iso-8859-2"
<table border="0" cellspacing="0" width="600"><tbody><tr><th class="ffield2"><span class="cald-word">clichéd</span> </th><td class="field1"><br>
banal; <b>banalny<b>
<br>
She made a <span class="cald-word">clichéd remark about the importance of friendship.</span>
<br>
<b>Wygԯsiԡ jakѶ banalnѠuwagꡯ wadze przyjaݮi . <br>
<b>
<b> <b><br>
</td></tr></tbody></table>
From - Thu Dec 08 15:42:09 2011
Content-Type: text/html; charset=utf-8
(some mail header stuff....)
<table border="0" cellspacing="0" width="600"><tbody><tr><th class="ffield2">nosiness</th><td class="field1"><br>
<br>
interest in somebody else's business; <b>wścibstwo<b>
<br>
Nosiness is something I can't stand, so stop asking such questions.
<br>
<b>Nie znoszę wścibstwa, więc przestań zadawać takie pytania. <b><b> <br>
<b>
</td></tr></tbody></table>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论