以编程方式从邮件合并的 Word 文档中检索 MergeField 值
我有大量的 MSWord 文档(大约 40,000 个),它们是邮件合并的结果(相同的主文档,不同的数据源)。
合并字段之一是文本字段,可以包含文本“是”或“否”。
有没有一种简单的方法来列出哪些文档的合并字段设置为值“是”? (我预计大约有 10,000 个“是”文档。)
我对任何方法都感兴趣,无论是使用 Word 本身、办公自动化、十六进制转储二进制文件并 grep 某些魔法,还是任何现成的工具(perl 脚本、 .NET 应用程序等)可以执行此类操作。
这些文件位于可以从 Linux 和 Windows 机器访问的网络共享上(如果需要的话,我可能可以偷用 Mac 一段时间),所以我不太担心这些工具在哪个平台上运行......
I have a large collection of MSWord documents (approximately 40,000), which are the results of mailmerges (same main document, different data sources).
One of the merge fields is a text field which could have the text "Yes" or "No".
Is there an easy way to list which of the documents have that merge field set to the value "Yes"? (I'm expecting approximately 10,000 "Yes" documents.)
I'd be interested in any approach, whether using Word itself, Office Automation, hexdumping the binary files and grepping for certain magic, or any ready-made tools (perl scripts, .NET apps, etc) which can do this sort of thing.
The files are on a network share accessible from both Linux and Windows boxes (and I can probably steal a Mac for a little while if necessary), so I'm not too worried about which platform the tools run on...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果它们是 Word 2007 文档,那就容易多了,因为文件格式是 XML。 (即使使用 Word 2003,您也可以另存为 XML 文档,尽管这不是默认设置)。 不过,我假设这些是使用默认(二进制)文件格式的标准 Word 2003 文档。
我相信有一些工具可以直接处理二进制文件格式,并且可能能够将文档转换为文本文件,然后您可以处理它们 - 大概您可以搜索出现在字段之前的一些文本,例如“你是认真的:”。
然而,最简单/最简单的方法(但就执行时间而言最慢)是编写一个 VBA 程序来打开每个文档、搜索字段并提取结果。 这将是非常简单的 VBA,您可以在 Word 本身中完成(这意味着代码可以使用 Word 的现有运行实例)。 我想说你可以在几个小时内启动并运行它 - 然后你可以在它完成工作时再多呆几个小时:-)
If they were Word 2007 documents it'd be much easier, as the file format is XML. (Even with Word 2003 you can save as an XML document, though it's not the default). I assume however that these are standard Word 2003 documents using the default (binary) file format.
I believe that there are tools out there which can process the binary file format directly, and which might be able to convert the docs into text files which you could then process - presumably you could search for some text appearing just before the field, e.g. "Are you serious:".
However, the easiest/simplest way (but slowest, in terms of execution time) would be to write a VBA program to open each doc, search for the field, and extract the result. It'd be pretty straightforward VBA, and you could do it in Word itself (which would mean that the code could use the existing running instance of Word). I'd say you could get that up and running in a couple of hours - then you could put your feet up for a few more hours while it did its work :-)