XML 解析器从 Word 文件中读取 xml 标签 C#

发布于 2024-08-15 03:17:26 字数 421 浏览 1 评论 0原文

我有一些包含 xml 标签和纯文本的 Word 模板(dot/dotx)文件。
在运行时,我需要将 xml 标记替换为其各自的邮件合并字段。

因此,需要解析文档中的这些 xml 标记并将其替换为合并字段。 我使用 Regex 来查找并替换这些 xml 标签。但建议我使用 XML 解析器来解析 XML 标签(Regex for string returned in < ;*>, C#)

现在我已经更好地介绍了我的案例,
您能否指导一下 XML 解析器是否是实现上述目标的正确工具?
如果是,我是否需要将word文档保存为xml文件,然后需要解析xml标签?

请指导。

I have some word templates(dot/dotx) files that contain xml tags along with plain text.
At run time, I need to replace the xml tags with their respective mail merge fields.

So, need to parse the document for these xml tags and replace them with merge fields.
I was using Regex to find and replace these xml tags. But I was suggested to use XML parser to parse for XML tags (Regex for string enclosed in <*>, C#)

Now that I have presented my case better,
could you please guide if XML parser will be a right tool to achive above?
if yes, do I need to save the word document as xml file and then need to parse for xml tags?

Please guide.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

为你鎻心 2024-08-22 03:17:26

您需要使用 Word API。这比你想象的更复杂。

Word 2003 文件(.doc、点)以专有的二进制格式存储。通过阅读规范来读取这种格式几乎是不可能的,为此投资一个 SDK,或者通过 COM 直接连接到 Word 来进行处理是非常值得的。

Word 2007 文件(.docx、.dotx)确实采用 XML 格式,但 .docx 文件实际上是分段创建文档的文件夹和文件的压缩层次结构。为此,OpenXML SDK 可以处理 .docx,我认为也可以处理它们的等效模板。

2007 格式的替代方法是使用 Word 创建模板,了解文件的层次结构并适当处理它们。将 .docx 或 .dotx 扩展名更改为 .zip,解压缩,然后查找查找和替换标记所在的位置。您也许只需替换标签、重新压缩层次结构并重命名扩展即可。

You need to use the Word APIs. This is more complicated than you think.

Word 2003 files (.doc, dot) are stored in a proprietary, binary format. Reading this format by reading the specification is near impossible, and it's well worth it to invest in an SDK for this, or to connect directly to Word through COM to handle the processing.

Word 2007 files (.docx, .dotx) are indeed in XML, but a .docx file is actually a zipped heirarchy of folders and files creating the document in pieces. For this, the OpenXML SDK can handle .docx, and I assume can also handle their equivalent templates.

An alternative for the 2007 format is to create your template using Word, and learn the heirarchy of files and handle them appropriately. Change the .docx or .dotx extension to .zip, unzip, and find where your find-and-replace tags are located. You may be able to just replace the tags, rezip the heirarchy and rename the extension.

冷情妓 2024-08-22 03:17:26

为什么不使用 Word API 来执行此操作?我无法想象有什么方法可以在不使用专门为此目的而设计的 API 的情况下安全地完成此操作。

Why don't you use the Word APIs to do this? I can't imagine any way to do this safely without using the APIs that were designed for the purpose.

你是年少的欢喜 2024-08-22 03:17:26

是的,您可以使用 System.Xml.XmlDocument 类来读取 XML 源。您还需要声明处理该 XML 内容所需的所有名称空间。

Yes, you can to use System.Xml.XmlDocument class to read your XML source. You'll also need to declare all namespaces required to deal with that XML content.

花间憩 2024-08-22 03:17:26

首先,我认为 Regex 应该没问题。

但如果您确实想使用 XML 解析器,我喜欢 .NET 中的 XmlDocument/XmlNode。 SelectSingleNode 和 SelectNodes 这两个函数非常有用。不幸的是,我面前没有 Word XML 示例,所以让我们假设这个 XML:

<Document>
  <MergeField name="phone"></MergeField>
  <MergeField name="email"></MergeField>
</Document>

然后您将使用如下代码:

XmlDocument wordDoc = new XmlDocument();
wordDoc.Load(fileName);

XmlNodeList mergeNodes = wordDoc.SelectNodes("//MergeField");

foreach(XmlNode mergeNode in mergeNodes)
{
   string fieldName = mergeNode.Attributes["name"].Value;
   // Do something here based on field name
   // e.g.:

   mergeNode.InnerText = GetFieldValue(fielName);
}

doc.Save(fileName);

棘手的部分是 Word XML 到处都使用 XML 命名空间,因此您需要使用XmlNamespaceManager 类是 .NET 告诉 XML 文档哪个命名空间是哪个,所以它更像是:

XmlDocument wordDoc = new XmlDocument();
wordDoc.Load(fileName);

XmlNamespaceManager nsm = new XmlNamespaceManager(doc.NameTable);
nsm.AddNamespace("o", "http://somenamepaceurl.com");
XmlNodeList mergeNodes = wordDoc.SelectNodes("//o:MergeField", nsm);

foreach(XmlNode mergeNode in mergeNodes)
{
   string fieldName = mergeNode.Attributes["name"].Value;
   // Do something here based on field name
   // e.g.:

   mergeNode.InnerText = GetFieldValue(fielName);
}

doc.Save(fileName);

First of all, I think Regex should be just fine.

But if you really want to use an XML parser I love XmlDocument/XmlNode in .NET. The two functions SelectSingleNode and SelectNodes are infinitely useful. Unfortunately, I do not have a Word XML example in front of me, so let's assume this XML:

<Document>
  <MergeField name="phone"></MergeField>
  <MergeField name="email"></MergeField>
</Document>

You would then use code as follows:

XmlDocument wordDoc = new XmlDocument();
wordDoc.Load(fileName);

XmlNodeList mergeNodes = wordDoc.SelectNodes("//MergeField");

foreach(XmlNode mergeNode in mergeNodes)
{
   string fieldName = mergeNode.Attributes["name"].Value;
   // Do something here based on field name
   // e.g.:

   mergeNode.InnerText = GetFieldValue(fielName);
}

doc.Save(fileName);

The tricky part is that Word XML uses XML namespaces all over the place, so you need to use the XmlNamespaceManager class is .NET to tell the XML document which namespace is which, so it would be more like:

XmlDocument wordDoc = new XmlDocument();
wordDoc.Load(fileName);

XmlNamespaceManager nsm = new XmlNamespaceManager(doc.NameTable);
nsm.AddNamespace("o", "http://somenamepaceurl.com");
XmlNodeList mergeNodes = wordDoc.SelectNodes("//o:MergeField", nsm);

foreach(XmlNode mergeNode in mergeNodes)
{
   string fieldName = mergeNode.Attributes["name"].Value;
   // Do something here based on field name
   // e.g.:

   mergeNode.InnerText = GetFieldValue(fielName);
}

doc.Save(fileName);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文