如何在 WordprocessingML 中搜索/替换文本
在WordprocessingML(MS Word文档保存的格式)中,是否可以轻松搜索文本?
我遇到的主要问题是 WordprocessingML 格式将每个段落分解为“runs”,例如:
为了存储句子 “模块 1:某些部分标题”,WordprocessingML 将 XML 标记指定为:
<w:p w:rsidR="00F9529C" w:rsidRDefault="00F9529C" w:rsidP="00F9529C">
<w:pPr>
<w:pStyle w:val="Heading1_5019"/>
</w:pPr>
<w:bookmarkStart w:id="0" w:name="_Toc247333659"/>
<w:r>
<w:t>M</w:t>
</w:r>
<w:r w:rsidRPr="007D2739">
<w:t xml:space="preserve">odule 1: </w:t>
</w:r>
<w:r>
<w:t>Some Section Title</w:t>
</w:r>
<w:bookmarkEnd w:id="0"/>
</w:p>
正如您所看到的,该句子被分成“M”、“模块 1:”、“某些部分标题”。这种安排使得无法搜索整个句子。有办法解决这个问题吗?
为了澄清这一点,我尝试使用 DomDocument 在 PHP 中执行此操作。
In WordprocessingML (the format MS Word documents saves in), is there anyway to search through the text easily?
The main problem I run into is that WordprocessingML format break down each paragraph into "runs", for example:
To store the sentence "Module 1: Some Section Title", WordprocessingML specifies the XML markup to be:
<w:p w:rsidR="00F9529C" w:rsidRDefault="00F9529C" w:rsidP="00F9529C">
<w:pPr>
<w:pStyle w:val="Heading1_5019"/>
</w:pPr>
<w:bookmarkStart w:id="0" w:name="_Toc247333659"/>
<w:r>
<w:t>M</w:t>
</w:r>
<w:r w:rsidRPr="007D2739">
<w:t xml:space="preserve">odule 1: </w:t>
</w:r>
<w:r>
<w:t>Some Section Title</w:t>
</w:r>
<w:bookmarkEnd w:id="0"/>
</w:p>
As you can see, the sentence was broken into "M", "odule 1:", "Some Section Title". This arrangement make it impossible to search for the sentence as a whole. Is there anyway to get around this?
To clarify, I am trying to do this in PHP using DomDocument.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我编写了一些示例代码,演示如何搜索和替换 Open XML WordprocessingML 文档中的文本。我的方法是:一旦找到包含需要替换的文本的段落,就将该段落中的所有运行分解为单个字符的运行。然后就可以直接找到与您的搜索字符串匹配的连续运行集。然后,您可以使用替换文本创建新的运行,然后删除与搜索字符串匹配的单个字符运行。我已经使用 XML DOM(使用 System.Xml.XmlDocument)实现了这一点。您可以在博客文章 在 Open XML WordprocessingML 中搜索和替换文本中找到示例代码文档。此外,我还录制了一个简短的截屏视频,展示了该算法的工作原理:http:// www.youtube.com/watch?v=w128hJUu3GM
I've written some example code that shows how to search and replace text in an Open XML WordprocessingML document. My approach is: once you have found a paragraph that contains text that needs to be replaced, you break up all runs in the paragraph into runs of a single character. It then is straightforward to find the set of consecutive runs that match your search string. You can then create a new run with the replacement text, and then delete the single character runs that match the search string. I've implemented this using XML DOM (using System.Xml.XmlDocument). You can find example code in a blog post, Search and Replace Text in an Open XML WordprocessingML document. In addition, I've recorded a short screen-cast that shows how the algorithm works: http://www.youtube.com/watch?v=w128hJUu3GM
是的,这就是直接使用 WordML 的痛苦,而不是使用单词对象模型。
不幸的是,我没有发现任何可以缓解这一问题的方法(openxml sdk、Aspose 等似乎本质上只是将 WordML xml 包裹在一层薄薄的饰面中)。
您可以对 ML 进行一些有限的预处理并解析出许多内容(例如所有这些 rsidRPr 元素等),但解析出足够的格式化元素以始终能够搜索文本仍然很棘手。
或者,您可以使用 XPATH 只提取 w:t 元素,然后将它们全部串在一起并搜索结果,但是您会遇到如何知道最终找到的内容实际存在于文档中的位置的问题。
如果您不关心这一点(例如,如果您只是数据挖掘),那么这可能是最快的解决方案。
Yep, that's the pain of working directly with WordML, vs say, using the word object model.
Unfortunately, I've found nothing that eases that (the openxml sdk, Aspose, etc all appear to essentially just wrap the WordML xml in a thin veneer).
You CAN do some limited preprocessing on the ML and resolve out lots of stuff (like all those rsidRPr elements, etc), but it's still going to be tricky to resolve out enough of the formatting elements to consistently be able to search the text.
Alternately, you could use XPATH to extract JUST the w:t elements, then string them all together and search the results, but then you've got the problem of how to know where in the document what you ended up finding actually lives.
if you don't care about that (for instance, if you're just data mining) then that might be the fastest solution.