使用 MS Word XML

发布于 2024-08-12 07:05:31 字数 2975 浏览 2 评论 0原文

我总是很难理解（尤其是用不是我的母语的英语）解释我的问题是什么，所以对于复杂或过于琐碎的问题我提前表示歉意;)。

我需要做的是以特定的方式“解析”Word XML 文档。转换为 xml 的文档有一些部分将放在一些固定标记之间，例如 [ ... ] 或 /* ... */ 或其他标记，我需要它们分别作为一个文本块，而 Word 来自：

[SOME_TEXT.SOME_OTHER_TEXT]

制作类似：

<w:r>
    <w:rPr><not relevant /></w:rPr>
    <w:t>
        [SOME_TEXT.
    </w:t>
</w:r>
<w:r>
    <w:rPr><not relevant /></w:rPr>
    <w:t>
        SOME_OTHER_TEXT
    </w:t>
</w:r>
<w:r>
    <w:rPr><not relevant /></w:rPr>
    <w:t>
        ]
    </w:t>
</w:r>

而不是例如：

<w:r>
    <w:rPr><not relevant /></w:rPr>
    <w:t>
        [SOME_TEXT.SOME_OTHER_TEXT]
    </w:t>
</w:r>

我尝试将 Application.Options.StoreRSIDOnSave 设置为 false，对所有文本使用通用格式，关闭拼写检查等，但 Word 仍然“随机”分割一些字符串（尤其是当它们是从其他地方粘贴的，而不是手工编写的） - 我无法告诉将要创建这些 xml 文档的人们，在他们可以在我的应用程序中使用他们的文件之前做一百件其他事情。所以我需要自己准备文件。我想知道什么是最好且尽可能简单的解决方案 - 通过 XmlDocument 读取它，循环遍历节点并删除它们，注意关闭需要关闭的节点并放置 /* ... * / 之间 clean 或执行相同操作，但将文件读取为纯文本。或者也许有人有更好的主意（比如一些聪明的正则表达式；））？我将非常感谢所有的帮助。

//编辑 我设法解决了这个问题。我的解决方案可能有点“蹩脚”，但效果很好；）

Dim MyMarkedString As Boolean = False
Dim MyTextOpened As Boolean = False
Dim MyFile As String = File.ReadAllText(pFileName)
Dim MyFileCopy As String = String.Empty
For Each foundPart As Match In Regex.Matches(MyFile, "((<\??/?)(?:[^:\s>]+:)?(\w+).*?(/?\??>))|(?!<)(\[?((?!<).)+\]?)")
    If (foundPart.Value.Equals("<w:t>") OrElse foundPart.Value.Contains("<w:t ")) AndAlso Not MyMarkedString Then
        MyTextOpened = True
        MyFileCopy += foundPart.Value
    ElseIf (foundPart.Value.Equals("</w:t>") OrElse foundPart.Value.Contains("</w:t ")) AndAlso Not MyMarkedString Then
        MyTextOpened = False
        MyFileCopy += foundPart.Value
    ElseIf (foundPart.Value.Equals("<w:t>") OrElse foundPart.Value.Contains("<w:t ")) AndAlso MyMarkedString Then
        MyTextOpened = True
        MyFileCopy += ""
    ElseIf (foundPart.Value.Equals("</w:t>") OrElse foundPart.Value.Contains("</w:t ")) AndAlso MyMarkedString Then
        MyTextOpened = False
        MyFileCopy += ""
    Else
        If MyTextOpened AndAlso Not MyMarkedString Then
            If foundPart.Value.Contains("[") AndAlso Not foundPart.Value.Contains("]") Then MyMarkedString = True
            MyFileCopy += foundPart.Value
        ElseIf MyTextOpened AndAlso MyMarkedString Then
            If foundPart.Value.Contains("]") AndAlso Not foundPart.Value.Contains("[") Then MyMarkedString = False
            MyFileCopy += foundPart.Value
        ElseIf Not MyTextOpened And MyMarkedString Then
            MyFileCopy += ""
        Else
            MyFileCopy += foundPart.Value
        End If
    End If
Next
File.WriteAllText(pCopyName, MyFileCopy)

原文

It's always hard for me to understandably (especially in English which isn't my first language) explain, what my problem is, so I'm sorry in advance for intricacy or excessive triviality ;).

What I need to do is to 'parse' Word XML document in a specific way. The document converted to xml has some parts that will be put between some fixed marks like [ ... ] or /* ... */ or whatever and I need them to stay as a one block of text each, while Word from:

[SOME_TEXT.SOME_OTHER_TEXT]

makes something like:

<w:r>
    <w:rPr><not relevant /></w:rPr>
    <w:t>
        [SOME_TEXT.
    </w:t>
</w:r>
<w:r>
    <w:rPr><not relevant /></w:rPr>
    <w:t>
        SOME_OTHER_TEXT
    </w:t>
</w:r>
<w:r>
    <w:rPr><not relevant /></w:rPr>
    <w:t>
        ]
    </w:t>
</w:r>

instead of e.g.:

<w:r>
    <w:rPr><not relevant /></w:rPr>
    <w:t>
        [SOME_TEXT.SOME_OTHER_TEXT]
    </w:t>
</w:r>

I've tried to set Application.Options.StoreRSIDOnSave to false, use common formatting for all the text, switch off the spell checking, etc. but Word still "randomly" splits some strings (especially when they're pasted from somewhere else, not written by hand) - and I cannot tell people, who are going to create those xml docs, to do a hundred other things before they can use their file in my app. So I need to take care of preparing the document by myself. I'm wondering what would be the best and as simple as possible solution to do this - read it through XmlDocument, loop through the nodes and remove them taking care to close the ones that need to be closed and put /* ... */ between clean or do the same but by reading the file as pure text. Or maybe someone has some better idea (like some clever regex ;))? I'll be very grateful for all the help.

//edit
I managed to solve the problem. My solution is maybe a little 'lame' but works perfectly ;)

Dim MyMarkedString As Boolean = False
Dim MyTextOpened As Boolean = False
Dim MyFile As String = File.ReadAllText(pFileName)
Dim MyFileCopy As String = String.Empty
For Each foundPart As Match In Regex.Matches(MyFile, "((<\??/?)(?:[^:\s>]+:)?(\w+).*?(/?\??>))|(?!<)(\[?((?!<).)+\]?)")
    If (foundPart.Value.Equals("<w:t>") OrElse foundPart.Value.Contains("<w:t ")) AndAlso Not MyMarkedString Then
        MyTextOpened = True
        MyFileCopy += foundPart.Value
    ElseIf (foundPart.Value.Equals("</w:t>") OrElse foundPart.Value.Contains("</w:t ")) AndAlso Not MyMarkedString Then
        MyTextOpened = False
        MyFileCopy += foundPart.Value
    ElseIf (foundPart.Value.Equals("<w:t>") OrElse foundPart.Value.Contains("<w:t ")) AndAlso MyMarkedString Then
        MyTextOpened = True
        MyFileCopy += ""
    ElseIf (foundPart.Value.Equals("</w:t>") OrElse foundPart.Value.Contains("</w:t ")) AndAlso MyMarkedString Then
        MyTextOpened = False
        MyFileCopy += ""
    Else
        If MyTextOpened AndAlso Not MyMarkedString Then
            If foundPart.Value.Contains("[") AndAlso Not foundPart.Value.Contains("]") Then MyMarkedString = True
            MyFileCopy += foundPart.Value
        ElseIf MyTextOpened AndAlso MyMarkedString Then
            If foundPart.Value.Contains("]") AndAlso Not foundPart.Value.Contains("[") Then MyMarkedString = False
            MyFileCopy += foundPart.Value
        ElseIf Not MyTextOpened And MyMarkedString Then
            MyFileCopy += ""
        Else
            MyFileCopy += foundPart.Value
        End If
    End If
Next
File.WriteAllText(pCopyName, MyFileCopy)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不知在何时 2024-08-19 07:05:31

我可以建议另一种方法：将 XML 作为纯字符串读取，删除所有 XML 元素并检查生成的字符串。

Imports System.IO
Imports System.text.RegularExpressions

Dim readFile As String = File.ReadAlltext("yourPathToFile.doc")
readFile = Regex.Replace(readFile, "<[a-zA-Z0-9/:]+>", String.Empty)

For Each foundPart As Match In Regex.Matches(readFile, "\[[a-zA-Z0-9]+\]")
        ' do something here with the things we found'
Next

可能需要一些额外的东西，例如替换空格等。

编辑：是的，我知道正则表达式远非完美......

编辑2： 正则表达式删除带有内容的 XML 标签

May i suggest another way: Read the XML as a pure String, remove all XML-Elements and check the resulting string.

Imports System.IO
Imports System.text.RegularExpressions

Dim readFile As String = File.ReadAlltext("yourPathToFile.doc")
readFile = Regex.Replace(readFile, "<[a-zA-Z0-9/:]+>", String.Empty)

For Each foundPart As Match In Regex.Matches(readFile, "\[[a-zA-Z0-9]+\]")
        ' do something here with the things we found'
Next

Some additional things might be needed, f.e. replacing spaces etc.

Edit: Yes, I understand that the RegEx Expression is far from perfect for this...

Edit2: RegEx to remove XML Tags with content

回复收藏 0 原文