将 doc/docx 转换为语义 HTML

发布于 2024-08-03 04:52:55 字数 475 浏览 12 评论 0原文

我想将 doc/docx 文档转换为语义 HTML。

一些愿望/要求:

  1. 语义 HTML,使得文档中的标题为

    等。等等,表格是 等。

  2. 最好能够处理标题、列表、表格和图像。图表和数学公式是一个很好的附加功能。

• 不必直接从doc/docx 转换为html,可以使用中间格式,例如xml 或docbook。

• 应以编程方式工作并处理大量文档。

到目前为止,我发现的最接近解决方案的是 http://holloway.co.nz /docvert/index.html,但不幸的是,有很多错误,用户基数较小,并且无法处理大量文档。更多的是概念验证。

I would like to convert doc/docx documents to semantic HTML.

Some wishes/requirements:

  1. Semantic HTML such that headers in the document are <h1>, <h2> etc., tables are <table> and so forth.

  2. Should preferably be possible to handle headings, lists, tables and images. Graphs and math formulas is a nice extra.

• Doesn't have to be converted straight from doc/docx to html, could use an intermediary format, such as xml or docbook.

• Should work programatically, and with large number of documents.

The closest thing to a solution I've found so far is http://holloway.co.nz/docvert/index.html, but unfortunately there are many a few bugs, small user base and it can't handle a lot of documents. More of a proof of concept.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

蓝眼泪 2024-08-10 04:52:55

“文档中的标题是”
我认为这是不可能的。
因为MS Word只记录结果,所以有不同样式的


就像纸上的印刷文本一样,原始信息不会被记录。

您的其他愿望可以得到满足。
有两个商业工具可以做到这一点
(不要相信那些免费工具或在线工具,它们不做真正的工作。)

1 Zapadoo 的 Word Cleaner
www.zapadoo.com

2 Wonder Studio 的 HTML Cleaner for Word
www.htmlcleaner.com

我更喜欢去年刚刚发布的第二个。你可以两者都尝试一下。

" headers in the document are "
I think this is impossible.
Because MS Word only write down the result, with different styles of <p>
just like printed text on paper, the original info are not recorded.

Your other wishes could be approached.
There're two commercial tools can do this
(don't believe those free tools or online tools, they don't do the real work.)

1 Word Cleaner by Zapadoo
www.zapadoo.com

2 HTML Cleaner for Word by wonder Studio
www.htmlcleaner.com

I prefer the second one which released just last year. You can try them both.

夏末 2024-08-10 04:52:55

有一个名为 upCast 的工具,它可以将 Word 文档转换为 XML。

There's a tool called upCast which is able to convert Word documents into XML.

水水月牙 2024-08-10 04:52:55

docx4j(仅适用于 docx,不适用于 doc)编写干净的 HTML 输出。如果你想要

,你需要稍微改变一下。而不是

,但它是开源的,因此您可以做到这一点。

docx4j (for docx only, not doc) writes clean HTML output. You'd need to change things a bit if you wanted <h1> instead of <p class="h1">, but its open source so you can do that.

奈何桥上唱咆哮 2024-08-10 04:52:55

我编写了一个实用程序,可以实现您列出的要求,但不包括图像、图表和数学公式。它是测试版质量(即它可以在我的机器上运行)。我将其发布于 http://www.modeltext.com/word

I wrote a utility which implements the requirements you listed, excluding images, graphs and maths formulas. It's beta quality (i.e., it works on my machine). I published it at http://www.modeltext.com/word

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文