转换 Html Contenttpo Word 文档时出现错误

发布于 2024-12-08 10:17:28 字数 6358 浏览 0 评论 0原文

大家好，我正在使用 HTml Agility 和 Openxml 将我的 html 内容转换为 word 文件内容。

<div>
<div id="container">
<div>
<div>
<!--content starts here//-->
<form name="questions" method="post">
<img src="../../content/0/Static UPload/Divya_3LevelLeftMenu_Operating System v8.0 English/unit9/lesson27/../../images/less_title_27.jpg" width="750" height="75">
<div id="title">Exercise
<table border="0" cellspacing="20" cellpadding="0">
  <tr>
    <td><b> Student's Name:&nbsp;</b><br>
      <input type="text" name="b1" size="45"></td>
    <td><b>Class:</b><br>
      <input type="text" name="b2" size="45"></td>
  </tr>
</table>
<td width="176" align="left">&nbsp;</td>
    <tr><td width="779" align="left">&nbsp;</td>
    </tr>
       <ol>
      <li>Describe the purpose of Windows Update. 
      <p align="left"><textarea name="a1" rows="10" wrap="VIRTUAL" cols="55"></textarea></p>
      </li>
    </ol>

    <ol start="2">
      <li>Explain why using Windows Update is critical to maintaining an operating system.
        <p align="left"><textarea name="a2" rows="10" wrap="VIRTUAL" cols="55"></textarea></p>
      </li>
    </ol>
    <ol start="3">
      <li>Summarize the process used to access and install Windows Updates.  
        <p align="left"><textarea name="a3" rows="10" wrap="VIRTUAL" cols="55"></textarea></p>
      </li>
    </ol>
    <ol start="4">
      <li>Compare and contrast using Windows Update and using a Windows Service Pack. 
        <p align="left"><textarea name="a4" rows="10" wrap="VIRTUAL" cols="55"></textarea></p>
      </li>
    </ol>
    <center><p><b>Note: You must print your completed exercise
    to submit to your instructor.</b><br>
    <b class="style1"><u>Do Not</u></b> close this window without printing your exercise or your answers will be lost.<br><br>
            <input onclick="reLoadMe(document.questions) " type="button" value="Print Preview">
      </p>
    </center>
</form>
    <div align="center"><a href="#top"><img src="../../content/0/Static UPload/Divya_3LevelLeftMenu_Operating System v8.0 English/unit9/lesson27/../../images/back_to_top.jpg" alt="" width="40" height="21" border="0"></a>

</div></div></div></div></div></div>

这是我用来转换的 html 内容。但我在解析它时收到以下错误。

   at NotesFor.HtmlToOpenXml.TableContext.get_CurrentTable()
   at NotesFor.HtmlToOpenXml.HtmlConverter.ProcessTableColumn(HtmlEnumerator en)
   at NotesFor.HtmlToOpenXml.HtmlConverter.ProcessHtmlChunks(HtmlEnumerator en, String endTag)
   at NotesFor.HtmlToOpenXml.HtmlConverter.Parse(String html)
   at WebApplication3.WebForm3.Button1_Click(Object sender, EventArgs e) in C:\Users\USER\Documents\Visual Studio 2008\Projects\Piyush_training\WebApplication3\WebForm3.aspx.cs:line 102

我的代码如下。

   using DocumentFormat.OpenXml.Drawing;
    using NotesFor.HtmlToOpenXml;
    using System.IO;
    using DocumentFormat.OpenXml.Packaging;
    using DocumentFormat.OpenXml.Wordprocessing;
    using wp = DocumentFormat.OpenXml.Drawing.Wordprocessing;
    using DocumentFormat.OpenXml;
    using HtmlAgilityPack;
    using System.Text;
 protected void Button1_Click(object sender, EventArgs e)
    {
        const string filename = "C:/Temp/test.docx";
        Response.ContentEncoding = System.Text.Encoding.UTF7;
        System.Text.StringBuilder SB = new System.Text.StringBuilder();
        System.IO.StringWriter SW = new System.IO.StringWriter();

string pagecontent=html 内容之上； HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(页面内容); 如果（文档==空）； doc.OptionCheckSyntax = true; doc.OptionAutoCloseOnEnd = true; doc.OptionFixNestedTags = true; int errorCount = doc.ParseErrors.Count(); 字符串输出=“”；

            doc.Save(SW);
            System.Web.UI.HtmlTextWriter htmlTW = new System.Web.UI.HtmlTextWriter(SW);
            strBody = "<html>" + "<body>" + "<div><b>" + htmlTW.InnerWriter.ToString() + "</b></div>" + "</body>" + "</html>";

            string html = strBody; 

           try
            {
                using (MemoryStream generatedDocument = new MemoryStream())
                {
                    using (WordprocessingDocument package = WordprocessingDocument.Create(generatedDocument, WordprocessingDocumentType.Document))
                    {
                        MainDocumentPart mainPart = package.MainDocumentPart;
                        if (mainPart == null)
                        {
                            mainPart = package.AddMainDocumentPart();
                            new Document(new Body()).Save(mainPart);
                        }

                        HtmlConverter converter = new HtmlConverter(mainPart);
                        converter.ExcludeLinkAnchor = true;
                        converter.RefreshStyles();
                        converter.ImageProcessing = ImageProcessing.AutomaticDownload;
                        Body body = mainPart.Document.Body;
                        converter.ConsiderDivAsParagraph = false;

                        var paragraphs = converter.Parse(html);
                        for (int i = 0; i < paragraphs.Count; i++)
                        {
                            body.Append(paragraphs[i]);
                        }

                        mainPart.Document.Save();
                    }

                    File.WriteAllBytes(filename, generatedDocument.ToArray());
                }

                System.Diagnostics.Process.Start(filename);
            }
            catch (Exception ex)
            {
                Response.Write(ex.ToString());
            }
        }

原文

Hello every one i am using HTml Agility and Openxml to convert my html content to word file content.

<div>
<div id="container">
<div>
<div>
<!--content starts here//-->
<form name="questions" method="post">
<img src="../../content/0/Static UPload/Divya_3LevelLeftMenu_Operating System v8.0 English/unit9/lesson27/../../images/less_title_27.jpg" width="750" height="75">
<div id="title">Exercise
<table border="0" cellspacing="20" cellpadding="0">
  <tr>
    <td><b> Student's Name: </b><br>
      <input type="text" name="b1" size="45"></td>
    <td><b>Class:</b><br>
      <input type="text" name="b2" size="45"></td>
  </tr>
</table>
<td width="176" align="left"> </td>
    <tr><td width="779" align="left"> </td>
    </tr>
       <ol>
      <li>Describe the purpose of Windows Update. 
      <p align="left"><textarea name="a1" rows="10" wrap="VIRTUAL" cols="55"></textarea></p>
      </li>
    </ol>

    <ol start="2">
      <li>Explain why using Windows Update is critical to maintaining an operating system.
        <p align="left"><textarea name="a2" rows="10" wrap="VIRTUAL" cols="55"></textarea></p>
      </li>
    </ol>
    <ol start="3">
      <li>Summarize the process used to access and install Windows Updates.  
        <p align="left"><textarea name="a3" rows="10" wrap="VIRTUAL" cols="55"></textarea></p>
      </li>
    </ol>
    <ol start="4">
      <li>Compare and contrast using Windows Update and using a Windows Service Pack. 
        <p align="left"><textarea name="a4" rows="10" wrap="VIRTUAL" cols="55"></textarea></p>
      </li>
    </ol>
    <center><p><b>Note: You must print your completed exercise
    to submit to your instructor.</b><br>
    <b class="style1"><u>Do Not</u></b> close this window without printing your exercise or your answers will be lost.<br><br>
            <input onclick="reLoadMe(document.questions) " type="button" value="Print Preview">
      </p>
    </center>
</form>
    <div align="center"><a href="#top"><img src="../../content/0/Static UPload/Divya_3LevelLeftMenu_Operating System v8.0 English/unit9/lesson27/../../images/back_to_top.jpg" alt="" width="40" height="21" border="0"></a>

</div></div></div></div></div></div>

this is the html content i am using to convert.
But i am getting the following error while parsing it.

   at NotesFor.HtmlToOpenXml.TableContext.get_CurrentTable()
   at NotesFor.HtmlToOpenXml.HtmlConverter.ProcessTableColumn(HtmlEnumerator en)
   at NotesFor.HtmlToOpenXml.HtmlConverter.ProcessHtmlChunks(HtmlEnumerator en, String endTag)
   at NotesFor.HtmlToOpenXml.HtmlConverter.Parse(String html)
   at WebApplication3.WebForm3.Button1_Click(Object sender, EventArgs e) in C:\Users\USER\Documents\Visual Studio 2008\Projects\Piyush_training\WebApplication3\WebForm3.aspx.cs:line 102

my code is as follows.

   using DocumentFormat.OpenXml.Drawing;
    using NotesFor.HtmlToOpenXml;
    using System.IO;
    using DocumentFormat.OpenXml.Packaging;
    using DocumentFormat.OpenXml.Wordprocessing;
    using wp = DocumentFormat.OpenXml.Drawing.Wordprocessing;
    using DocumentFormat.OpenXml;
    using HtmlAgilityPack;
    using System.Text;
 protected void Button1_Click(object sender, EventArgs e)
    {
        const string filename = "C:/Temp/test.docx";
        Response.ContentEncoding = System.Text.Encoding.UTF7;
        System.Text.StringBuilder SB = new System.Text.StringBuilder();
        System.IO.StringWriter SW = new System.IO.StringWriter();

string pagecontent=above html Content;
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(pagecontent);
if (doc == null) ;
doc.OptionCheckSyntax = true;
doc.OptionAutoCloseOnEnd = true;
doc.OptionFixNestedTags = true;
int errorCount = doc.ParseErrors.Count();
string output = "";

            doc.Save(SW);
            System.Web.UI.HtmlTextWriter htmlTW = new System.Web.UI.HtmlTextWriter(SW);
            strBody = "<html>" + "<body>" + "<div><b>" + htmlTW.InnerWriter.ToString() + "</b></div>" + "</body>" + "</html>";

            string html = strBody; 

           try
            {
                using (MemoryStream generatedDocument = new MemoryStream())
                {
                    using (WordprocessingDocument package = WordprocessingDocument.Create(generatedDocument, WordprocessingDocumentType.Document))
                    {
                        MainDocumentPart mainPart = package.MainDocumentPart;
                        if (mainPart == null)
                        {
                            mainPart = package.AddMainDocumentPart();
                            new Document(new Body()).Save(mainPart);
                        }

                        HtmlConverter converter = new HtmlConverter(mainPart);
                        converter.ExcludeLinkAnchor = true;
                        converter.RefreshStyles();
                        converter.ImageProcessing = ImageProcessing.AutomaticDownload;
                        Body body = mainPart.Document.Body;
                        converter.ConsiderDivAsParagraph = false;

                        var paragraphs = converter.Parse(html);
                        for (int i = 0; i < paragraphs.Count; i++)
                        {
                            body.Append(paragraphs[i]);
                        }

                        mainPart.Document.Save();
                    }

                    File.WriteAllBytes(filename, generatedDocument.ToArray());
                }

                System.Diagnostics.Process.Start(filename);
            }
            catch (Exception ex)
            {
                Response.Write(ex.ToString());
            }
        }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

欢烬 2024-12-15 10:17:28

您可能想尝试使用不同的方法从 HTML 组装 Word 文档。根据您的要求，您可以采取以下几种方法之一：

像您所做的那样使用 OpenXmlSdk 组装文档，或：
使用 altChunk 方法

altChunk，是Open XML文字处理的一个特殊功能
使您能够嵌入整个 Open XML 文档或
html 页面位于文档中的特定位置

Eric White 有许多博客文章描述了此过程，下面是他的文章的摘录，其中重点介绍了嵌入 html：

使用 Open XML SDK V2：

using (WordprocessingDocument myDoc = WordprocessingDocument.Open("Test1.docx", true))
{
    string altChunkId = "AltChunkId1";
    MainDocumentPart mainPart = myDoc.MainDocumentPart;
    AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart(
        AlternativeFormatImportPartType.WordprocessingML, altChunkId);

    using (FileStream fileStream = File.Open("TestInsertedContent.docx", FileMode.Open))
        chunk.FeedData(fileStream);
     AltChunk altChunk = new AltChunk();
     altChunk.Id = altChunkId;
     mainPart.Document
         .Body
         .InsertAfter(altChunk, mainPart.Document.Body.Elements<Paragraph>().Last());
     mainPart.Document.Save();
 }

整篇文章以及示例代码（在底部）：如何使用用于文档组装的 altChunk

You might want to try a different approach for assembling your word document from HTML. Depending on your requirements you can take one of a couple of approaches:

Assemble the document using the OpenXmlSdk as you have done, or:
Use the altChunk method

altChunk, is a special feature of Open XML word processing
markup that enables you to embed an entire Open XML document or an
html page at a specific location in a document

Eric White has a number of blog posts describing this process, below is an extract from his article highlighting embedding html:

Using V2 of the Open XML SDK:

using (WordprocessingDocument myDoc = WordprocessingDocument.Open("Test1.docx", true))
{
    string altChunkId = "AltChunkId1";
    MainDocumentPart mainPart = myDoc.MainDocumentPart;
    AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart(
        AlternativeFormatImportPartType.WordprocessingML, altChunkId);

    using (FileStream fileStream = File.Open("TestInsertedContent.docx", FileMode.Open))
        chunk.FeedData(fileStream);
     AltChunk altChunk = new AltChunk();
     altChunk.Id = altChunkId;
     mainPart.Document
         .Body
         .InsertAfter(altChunk, mainPart.Document.Body.Elements<Paragraph>().Last());
     mainPart.Document.Save();
 }

The whole article along with sample code (at the bottom): How to Use altChunk for Document Assembly

回复收藏 0 原文

始终不够爱げ你 2024-12-15 10:17:28

使用它来获取图像有效的内容。

要使用 AltChunk 方法，您必须使用现有文件。首先使用任何内容动态创建文件，因为 altChunk 不接受空白文件。

创建一个内容较少的 .docx 文件。
附加 html 内容。

try
{
    var domainNameURL = "yoursite.com/";
    var strBody = "<html>" + "<body>" + "<div> Word File </div>" + "</body>" + "</html>";
    using (MemoryStream generatedDocument = new MemoryStream())
    {
        using (WordprocessingDocument package = WordprocessingDocument.Create(generatedDocument, WordprocessingDocumentType.Document))
        {
            MainDocumentPart mainPart = package.MainDocumentPart;
            if (mainPart == null)
            {
                mainPart = package.AddMainDocumentPart();
                new Document(new Body()).Save(mainPart);
            }

            HtmlConverter converter = new HtmlConverter(mainPart);
            converter.ExcludeLinkAnchor = true;
            converter.RefreshStyles();
            converter.ImageProcessing = ImageProcessing.AutomaticDownload;
            converter.BaseImageUrl = new Uri(domainNameURL + "Images/");

            Body body = mainPart.Document.Body;
            converter.ConsiderDivAsParagraph = false;

            var paragraphs = converter.Parse(strBody);
                for (int i = 0; i < paragraphs.Count; i++)
                {
                    body.Append(paragraphs[i]);
                }

            mainPart.Document.Save();
        }

        File.WriteAllBytes(filename, generatedDocument.ToArray());
    }

    using (WordprocessingDocument myDoc = WordprocessingDocument.Open(filename, true))
    {
        XNamespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
        XNamespace r = "http://schemas.openxmlformats.org/officeDocument/2006/relationships";
        string altChunkId = "AltChunkId1";
        MainDocumentPart mainPart = myDoc.MainDocumentPart;
        AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart("application/xhtml+xml", altChunkId);

        using (Stream chunkStream = chunk.GetStream(FileMode.Create, FileAccess.Write))
        using (StreamWriter stringStream = new StreamWriter(chunkStream))
            stringStream.Write(html);
        XElement altChunk = new XElement(w + "altChunk",
        new XAttribute(r + "id", altChunkId)
        );
        XDocument mainDocumentXDoc = GetXDocument(myDoc);
        mainDocumentXDoc.Root
            .Element(w + "body")
            .Elements(w + "p")
            .Last()
            .AddAfterSelf(altChunk);
        SaveXDocument(myDoc, mainDocumentXDoc);
    }
    System.Diagnostics.Process.Start(filename);
}
catch (Exception ex)
{
    Response.Write(ex.ToString());
}

Use this to get content with images working.

To use the AltChunk method you have to use an existent file. Create the file dynamically with any content first, because altChunk doesn't accept a blank file.

Create a .docx file with a small content.
Append the html content.

try
{
    var domainNameURL = "yoursite.com/";
    var strBody = "<html>" + "<body>" + "<div> Word File </div>" + "</body>" + "</html>";
    using (MemoryStream generatedDocument = new MemoryStream())
    {
        using (WordprocessingDocument package = WordprocessingDocument.Create(generatedDocument, WordprocessingDocumentType.Document))
        {
            MainDocumentPart mainPart = package.MainDocumentPart;
            if (mainPart == null)
            {
                mainPart = package.AddMainDocumentPart();
                new Document(new Body()).Save(mainPart);
            }

            HtmlConverter converter = new HtmlConverter(mainPart);
            converter.ExcludeLinkAnchor = true;
            converter.RefreshStyles();
            converter.ImageProcessing = ImageProcessing.AutomaticDownload;
            converter.BaseImageUrl = new Uri(domainNameURL + "Images/");

            Body body = mainPart.Document.Body;
            converter.ConsiderDivAsParagraph = false;

            var paragraphs = converter.Parse(strBody);
                for (int i = 0; i < paragraphs.Count; i++)
                {
                    body.Append(paragraphs[i]);
                }

            mainPart.Document.Save();
        }

        File.WriteAllBytes(filename, generatedDocument.ToArray());
    }

    using (WordprocessingDocument myDoc = WordprocessingDocument.Open(filename, true))
    {
        XNamespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
        XNamespace r = "http://schemas.openxmlformats.org/officeDocument/2006/relationships";
        string altChunkId = "AltChunkId1";
        MainDocumentPart mainPart = myDoc.MainDocumentPart;
        AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart("application/xhtml+xml", altChunkId);

        using (Stream chunkStream = chunk.GetStream(FileMode.Create, FileAccess.Write))
        using (StreamWriter stringStream = new StreamWriter(chunkStream))
            stringStream.Write(html);
        XElement altChunk = new XElement(w + "altChunk",
        new XAttribute(r + "id", altChunkId)
        );
        XDocument mainDocumentXDoc = GetXDocument(myDoc);
        mainDocumentXDoc.Root
            .Element(w + "body")
            .Elements(w + "p")
            .Last()
            .AddAfterSelf(altChunk);
        SaveXDocument(myDoc, mainDocumentXDoc);
    }
    System.Diagnostics.Process.Start(filename);
}
catch (Exception ex)
{
    Response.Write(ex.ToString());
}

回复收藏 0 原文

鞋纸虽美，但不合脚ㄋ〞 2024-12-15 10:17:28

在阅读了之前的答案和此处的答案后，我使用此函数将巨大的 HTML（带有内联图像）转换为 Word：https:// stackoverflow.com/a/18152334/1863970

public static byte[] HtmlToWord(string html)
{
    using (var generatedDocument = new MemoryStream())
    {
        using (var package = WordprocessingDocument.Create(generatedDocument, WordprocessingDocumentType.Document))
        {
            MainDocumentPart mainPart = package.MainDocumentPart;
            if (mainPart == null)
            {
                mainPart = package.AddMainDocumentPart();
                new Document(new Body()).Save(mainPart);
            }

            HtmlConverter converter = new HtmlConverter(mainPart);
            Body body = mainPart.Document.Body;

            string altChunkId = "myId";

            var memoryStream = new MemoryStream(Encoding.UTF8.GetBytes("<html><head></head><body>" + html + "</body></html>"));

            // Create alternative format import part.
            var formatImportPart = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Html, altChunkId);

            // Feed HTML data into format import part (chunk).
            formatImportPart.FeedData(memoryStream);
            var altChunk = new AltChunk();
            altChunk.Id = altChunkId;

            mainPart.Document.Body.Append(altChunk);

            mainPart.Document.Save();
        }

        return generatedDocument.ToArray();
    }
}

I used this function to convert a huge HTML (with inline images) to Word, after reading previous answers and the one here: https://stackoverflow.com/a/18152334/1863970

public static byte[] HtmlToWord(string html)
{
    using (var generatedDocument = new MemoryStream())
    {
        using (var package = WordprocessingDocument.Create(generatedDocument, WordprocessingDocumentType.Document))
        {
            MainDocumentPart mainPart = package.MainDocumentPart;
            if (mainPart == null)
            {
                mainPart = package.AddMainDocumentPart();
                new Document(new Body()).Save(mainPart);
            }

            HtmlConverter converter = new HtmlConverter(mainPart);
            Body body = mainPart.Document.Body;

            string altChunkId = "myId";

            var memoryStream = new MemoryStream(Encoding.UTF8.GetBytes("<html><head></head><body>" + html + "</body></html>"));

            // Create alternative format import part.
            var formatImportPart = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Html, altChunkId);

            // Feed HTML data into format import part (chunk).
            formatImportPart.FeedData(memoryStream);
            var altChunk = new AltChunk();
            altChunk.Id = altChunkId;

            mainPart.Document.Body.Append(altChunk);

            mainPart.Document.Save();
        }

        return generatedDocument.ToArray();
    }
}

回复收藏 0 原文

~没有更多了~