如何在java中读取或解析MHTML(.mht)文件
我需要挖掘大多数已知文档文件的内容,例如:
- html
- doc/docx 等。
对于我计划使用的大多数文件格式:
但截至目前 Tika
不支持 MHTML (*.mht) 文件.. (http://en.wikipedia.org/wiki/MHTML) C# 中的示例很少( http://www.codeproject.com/KB/files/ MhtBuilder.aspx ),但我在 Java 中没有找到。
我尝试在 7Zip 中打开 *.mht 文件,但失败了...尽管 WinZip 能够将文件解压缩为图像和文本(CSS、HTML、脚本)作为文本和二进制文件...
根据 MSDN 页面 ( < a href="http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content" rel="noreferrer">http://msdn.microsoft.com/en- us/library/aa767785%28VS.85%29.aspx#compress_content )和我之前提到的代码项目
页面...mht文件使用GZip压缩....
尝试解压缩在java中会导致以下异常: 使用 java.uti.zip.GZIPInputStream
java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:16)
和 java.util.zip.ZipFile
java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(Unknown Source)
at java.util.zip.ZipFile.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:21)
请建议如何解压缩它......
谢谢......
I need to mine the content of most of known document files like:
- html
- doc/docx etc.
For most of these file formats I am planning to use:
But as of now Tika
does not support MHTML (*.mht) files.. ( http://en.wikipedia.org/wiki/MHTML )
There are few examples in C# ( http://www.codeproject.com/KB/files/MhtBuilder.aspx ) but I found none in Java.
I tried opening the *.mht file in 7Zip and it failed...Although the WinZip was able to decompress the file into images and text (CSS, HTML, Script) as text and binary files...
As per MSDN page ( http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content ) and the code project
page i mentioned earlier ... mht files use GZip compression ....
Attempting to decompress in java results in following exceptions:
With java.uti.zip.GZIPInputStream
java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:16)
And with java.util.zip.ZipFile
java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(Unknown Source)
at java.util.zip.ZipFile.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:21)
Kindly suggest how to decompress it....
Thanks....
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
坦率地说,我没想到会在不久的将来找到解决方案,并打算放弃,但我是如何偶然发现此页面的:
http://en.wikipedia.org/wiki/MIME#Multipart_messages
http://msdn.microsoft.com/en-us/library/ms527355%28EXCHG.10%29.aspx
虽然乍一看并不是很吸引人。但如果你仔细观察,你就会发现线索。读完本文后,我启动了 IE,并随机开始将页面保存为
*.mht
文件。让我逐行进行...但是让我事先解释一下,我的最终目标是分离/提取
html
内容并解析它...该解决方案本身并不完整,因为它取决于在我保存时选择的字符集
或编码
上。但即使它会提取单个文件,但会出现一些小问题...我希望这对尝试解析/解压缩
*.mht/MHTML
文件的人有用:)===== ==说明========
** 取自 mht 文件 **
它是用于保存文件
主题、日期和 mime 版本的软件……很像邮件格式
这是告诉我们它是一个
multipart
的部分文档。多部分文档将一组或多组不同的数据组合在一个正文中,multipart
Content-Type 字段必须出现在实体的标头中。在这里,我们还可以看到类型为“text/html”
。其中最重要的部分是。这是划分两个不同部分(html、图像、css、脚本等)的唯一分隔符。 一旦掌握了这一点,一切都会变得容易......现在,我只需迭代文档并找出不同的部分并根据其
Content-Transfer-Encoding< 保存它们/code> (base64、引用打印等)...
。
。
。
示例
** JAVA 代码 **
用于定义常量的接口。
主解析器类...
问候,
Frankly, I wasn't expecting a solution in near future and was about to give up, but some how I stumbled on this page:
http://en.wikipedia.org/wiki/MIME#Multipart_messages
http://msdn.microsoft.com/en-us/library/ms527355%28EXCHG.10%29.aspx
Although, not a very catchy in first look. But if you look carefully you will get clue. After reading this I fired up my IE and at random started saving pages as
*.mht
file. Let me go line by line...But let me explain beforehand that my ultimate goal was to separate/extract out the
html
content and parse it... the solution is not complete in itself as it depends on thecharacter set
orencoding
I choose while saving. But even though it will extract the individual files with minor hitches...I hope this will be useful for anyone who is trying to parse/decompress
*.mht/MHTML
files :)======= Explanation ========
** Taken from a mht file **
It is the software used for saving the file
Subject, date and mime-version … much like the mail format
This is the part which tells us that it is a
multipart
document. A multipart document has one or more different sets of data combined in a single body, amultipart
Content-Type field must appear in the entity's header. Here, we can also see the type as"text/html"
.Out of all this is the most important part. This is the unique delimiter which divides two different parts (html,images,css,script etc). Once you get hold of this, everything gets easy... Now, I just have to iterate through the document and finding out different sections and saving them as per their
Content-Transfer-Encoding
(base64, quoted-printable etc) ....
.
.
SAMPLE
** JAVA CODE **
An interface for defining constants.
The main parser class...
Regards,
使用 Java Mail API 的更紧凑的代码
A more compact code using Java Mail APIs
您不必自己做。
通过依赖
滚动你的 mht 文件
MessageTree
然后你可以查看它。
;-)
You don't have to do it on you own.
With dependency
Roll you mht file
MessageTree
willThen you can look into it.
;-)
聚会迟到了,但扩展了 @wener 为其他遇到此问题的人提供的答案。
Apache Mime4J 库似乎拥有最容易访问的解决方案 EML 或 MHTML 处理,比自己动手容易得多!
下面的原型“parseMhtToFile”函数从 Cognos 活动报告“mht”文件中提取 html 文件和其他工件,但可以根据其他目的进行定制。
这是用 Groovy 编写的,需要 Apache Mime4J 'core' 和 'dom' jars(当前为 0.7.2)。
用法很简单:
输出是:
对其他改进的想法:
For '文本'mime 部分,您可以访问
Reader
而不是Stream
,这可能更适合OP请求的文本挖掘。对于生成的文件扩展名,我会使用另一个库来查找适当的扩展名,而不是假设 mime 子类型足够。
对于生成的文件扩展名,
处理单体(非多部分)和递归多部分 mhtml 文件以及其他复杂性。这些可能需要 MimeStreamParser自定义内容处理程序实现。< /p>
Late to the party, but expanding on @wener's answer for anyone else stumbling across this.
The Apache Mime4J library seems to have the most readily accessible solution for EML or MHTML processing, much easier than rolling-your-own!
My prototype '
parseMhtToFile
' function below rips html files and other artifacts out of a Cognos active report 'mht' file, but could be tailored to other purposes.This is written in Groovy and requires Apache Mime4J 'core' and 'dom' jars (currently 0.7.2).
Usage is simply:
Output is:
Thoughts on other improvements:
For 'text' mime parts, you can access a
Reader
instead of aStream
which might be more appropriate for text mining as the OP requested.For generated filename extensions, I'd use another library to lookup appropriate extension, not assume the mime sub-type is adequate.
Handle Single-body (non-Multipart) and Recursive Multipart mhtml files and other complexities. These may require a MimeStreamParser with custom Content Handler implementation.
你可以尝试 http://www.chilkatsoft.com/mht-features.asp ,它可以打包/解包,您可以像普通文件一样处理它。下载链接为:http://www.chilkatsoft.com/java.asp
U can try http://www.chilkatsoft.com/mht-features.asp , it can pack/unpack and you can handle it after as normal files. The download link is: http://www.chilkatsoft.com/java.asp
我被用来 http://jtidy.sourceforge.net 来解析/读取/索引 mht 文件(但正常文件,非压缩文件)
i was used http://jtidy.sourceforge.net to parse/read/index mht files (but as normal files, not compressed files)