批量将 docx 转换为干净的 HTML
我开始怀疑这是否可能。我在谷歌上搜索了解决方案,但没有找到完全符合我要求的解决方案。
我认为解释一下这意味着什么是有好处的。我在大学 IT 部门的数据库小组工作。我的主要工作是获取 docx 文件中的报告规格,将其复制到 Dreamweaver,修复一些格式,然后将其放到他们的网站上。我的问题是,一遍又一遍地这样做是非常乏味的。我想,嘿,我已经有一段时间没有用 C# 编写任何内容了,也许我可以编写一个应用程序来获取 docx 文件,将其转换为 HTML,修复 CSS,将网页中的页眉和页脚粘贴到那里,并保存结果。我本来打算让它一个一个地做,但是让它输入文件列表并批量转换可能并不困难。
我已经找到了有关如何实现此目标的相关主题,但它们不足以满足我的需求。
这对于一些文档来说可能没问题,但由于它只是自动化 Word 实例,我觉得就像它会很慢并且需要占用大量内存一样。我宁愿避免打开和关闭 Word 实例 50 次以上。
http://openxmldeveloper.org/articles/333.aspx
这是我开始使用的。 XSLT 的优点是不需要为每个文件安装或运行 word。经过一番搜索后,我得到了一个有效的概念证明。它接收一个 docx 文件,对其进行解压缩,从中获取 document.xml,并使用我从 OpenXML 查看器中抓取的 DocX2Html.xsl 文件。我相信它最初是由 MS 为共享点服务器提供的,以提供在浏览器中呈现 Word 文档的能力。或者类似的东西。
在调整该代码以满足我的需要之后,并且遇到了 objXSLT.Load () 方法的问题,我最终使用 IlMerge 将 XSL 制作成 DLL。不知道为什么我在使用普通的旧 XSL 文件时不断收到编译错误,但 DLL 工作正常,所以我很满意。这里(http://pastebin.com/a5HBAakJ)是我当前的代码。它可以很好地将 docx 转换为 HTML(除了一些单词之间的随机空格),但结果文件的 HTML 语法极其丑陋。这个怪物的一个例子可以在这里找到(http://pastebin.com/b8sPGmFE)。
有谁知道我该如何解决这个问题?我想也许我需要创建一个新的 XSL 文件,因为 MS 提供的文件负责将所有这些标签和额外代码粘贴到其中。我的问题是我不知道如何做到这一点。也许已经有替代版本了。我所需要的只是一个能够保留表格和文本格式的工具。不需要图像。
I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.
I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.
I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.
http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190
This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.
http://openxmldeveloper.org/articles/333.aspx
This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.
After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).
Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这看起来正是您所需要的: http:// msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx
作者 Eric White 在博客中介绍了他开发该工具的经验。您可以在此处查看他博客上的帖子列表:http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of -contents.aspx#Open_XML_to_XHtml
This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx
The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml
因为我是 Aspose 的忠实粉丝.Words,一个用于创建/处理Word文档的商业库,我会做类似的事情:
既然你写到你在大学工作,我不确定商业套餐是否是一个选择。
Since I'm a big fan of Aspose.Words, a commercial library to create/process Word documents, I would do something like:
Since you wrote you work at an university, I'm not sure whether commercial packages are an option, though.
您好,不确定推广您自己的解决方案的规则是什么,所以如果我不合规矩,请告诉我。
我是一名网络开发人员,也遇到了同样的问题,所以我创建了自己的工具:
http://www.convertwordtohtml.com
我们还在开发一个新版本,该版本将具有更好的转换质量和一键转换,例如您可以右键单击word文件,它将直接转换为html并将代码放入剪贴板。当前版本还支持命令行访问,新版本将有服务器版本。
该网站提供免费试用版,如果您有任何疑问,请随时与我联系。
Hi not sure what the rules are on promoting your own solutions, so do let me know if I am out of line.
I am a web developer who had the same issues, so I created my own tool:
http://www.convertwordtohtml.com
We are also working on a new version that will have even better conversion quality and one click conversion eg you can right click on a word file and it will be directly converted to html and the code placed into the clipboard. The current version also supports command line access and the new version will have a server version to.
There is a free trial version downloadable from the site , and if you have any questions do contact me any time.