从 Word 文档转换为 HTML

发布于 2024-08-22 07:43:31 字数 81 浏览 0 评论 0原文

我想使用 Word Viewer 将 Word 文档保存为 HTML 格式,而无需在我的计算机中安装 Word。在C#中有什么办法可以实现这一点吗?

I want to save the Word document in HTML using Word Viewer without having Word installed in my machine. Is there any way to accomplish this in C#?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

挽你眉间 2024-08-29 07:43:31

要将 .docx 文件转换为 HTML 格式,您可以使用 OpenXmlPowerTools。确保添加对 OpenXmlPowerTools.dll 的引用。

using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Wordprocessing;

byte[] byteArray = File.ReadAllBytes(DocxFilePath);
using (MemoryStream memoryStream = new MemoryStream())
{
     memoryStream.Write(byteArray, 0, byteArray.Length);
     using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
     {
          HtmlConverterSettings settings = new HtmlConverterSettings()
          {
               PageTitle = "My Page Title"
          };
          XElement html = HtmlConverter.ConvertToHtml(doc, settings);

          File.WriteAllText(HTMLFilePath, html.ToStringNewLineOnAttributes());
     }
}

For converting .docx file to HTML format, you can use OpenXmlPowerTools. Make sure to add a reference to OpenXmlPowerTools.dll.

using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Wordprocessing;

byte[] byteArray = File.ReadAllBytes(DocxFilePath);
using (MemoryStream memoryStream = new MemoryStream())
{
     memoryStream.Write(byteArray, 0, byteArray.Length);
     using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
     {
          HtmlConverterSettings settings = new HtmlConverterSettings()
          {
               PageTitle = "My Page Title"
          };
          XElement html = HtmlConverter.ConvertToHtml(doc, settings);

          File.WriteAllText(HTMLFilePath, html.ToStringNewLineOnAttributes());
     }
}
七婞 2024-08-29 07:43:31

您可以尝试使用 Microsoft.Office.Interop.Word;

   using Word = Microsoft.Office.Interop.Word;

    public static void ConvertDocToHtml(object Sourcepath, object TargetPath)
    {

        Word._Application newApp = new Word.Application();
        Word.Documents d = newApp.Documents;
        object Unknown = Type.Missing;
        Word.Document od = d.Open(ref Sourcepath, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown, ref Unknown);
        object format = Word.WdSaveFormat.wdFormatHTML;



        newApp.ActiveDocument.SaveAs(ref TargetPath, ref format,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown);

        newApp.Documents.Close(Word.WdSaveOptions.wdDoNotSaveChanges);


    }

You can try with Microsoft.Office.Interop.Word;

   using Word = Microsoft.Office.Interop.Word;

    public static void ConvertDocToHtml(object Sourcepath, object TargetPath)
    {

        Word._Application newApp = new Word.Application();
        Word.Documents d = newApp.Documents;
        object Unknown = Type.Missing;
        Word.Document od = d.Open(ref Sourcepath, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown, ref Unknown);
        object format = Word.WdSaveFormat.wdFormatHTML;



        newApp.ActiveDocument.SaveAs(ref TargetPath, ref format,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown);

        newApp.Documents.Close(Word.WdSaveOptions.wdDoNotSaveChanges);


    }
心清如水 2024-08-29 07:43:31

我编写了 Mammoth for .NET,它是一个将 docx 文件转换为 HTML 的库,并且是 < a href="https://www.nuget.org/packages/Mammoth" rel="nofollow">可在 NuGet 上使用。

Mammoth 尝试通过查看语义信息来生成干净的 HTML - 例如,将 Word 中的段落样式(例如 Heading 1)映射到 HTML/CSS 中的适当标签和样式(例如 < ;h1>)。如果您想要产生精确视觉副本的东西,那么猛犸象可能不适合您。如果您有一些已经结构良好的内容并希望将其转换为整洁的 HTML,Mammoth 可能会满足您的要求。

I wrote Mammoth for .NET, which is a library that converts docx files to HTML, and is available on NuGet.

Mammoth tries to produce clean HTML by looking at semantic information -- for instance, mapping paragraph styles in Word (such as Heading 1) to appropriate tags and style in HTML/CSS (such as <h1>). If you want something that produces an exact visual copy, then Mammoth probably isn't for you. If you have something that's already well-structured and want to convert that to tidy HTML, Mammoth might do the trick.

三生一梦 2024-08-29 07:43:31

根据这个Stack Overflow问题,使用Word查看器是不可能的。您将需要 Word 才能使用 COM Interop 与 Word 进行交互。

According to this Stack Overflow question, it isn't possible with word viewer. You will need Word to use COM Interop to interact with Word.

忘羡 2024-08-29 07:43:31

我认为这取决于Word文档的版本。如果您有 docx 格式的文件,我相信它们会以 XML 数据的形式存储在文件中(但自从我查看规范以来已经很久了,我非常高兴能对此进行更正)。

I think this will depend on the version of the Word document. If you have them in docx format, I believe they are stored within the file as XML data (but it is so long since I looked at the specification I am perfectly happy to be corrected on that).

轻拂→两袖风尘 2024-08-29 07:43:31

如果您愿意不使用 C#,您可以使用 PrimoPDF 进行打印到文件之类的操作(这会将 .doc 更改为 .pdf),然后使用 PDF 到 HTML 转换器完成剩下的工作。之后,您可以根据需要编辑 html。

If you're open to not using C#, you could do something like print to file using PrimoPDF (which would change the .doc into a .pdf) and then use a PDF to HTML converter to go the rest of the way. After that you can edit your html however you like.

冷血 2024-08-29 07:43:31

Gembox 效果很好。它甚至可以将 Word 文档中的图像转换为 img 标签中的 base64 编码字符串。

Gembox works pretty well. It even converts images in the Word doc to base64 encoded strings in img tags.

凡间太子 2024-08-29 07:43:31

使用 OpenOffice.org 中提供的文档转换工具可能是唯一可能的选择 - .doc 格式只能通过 Microsoft 产品打开,因此任何处理它的库都需要对整个格式进行逆向工程。

Using the document conversion tools available in OpenOffice.org is probably the only possible option - the .doc format is only designed to be opened via Microsoft products so any libraries dealing with it will need to have reverse engineered the entire format.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文