区分Word文档中的目录
有谁知道如何以编程方式迭代Word文档时,您可以判断一个段落是否构成目录的一部分(或者实际上,构成字段一部分的任何其他内容)。
我问这个问题的原因是,我有一个 VB 程序,该程序应该从文档中提取实质性文本的前几段 - 它是通过迭代 Word.Paragraphs 集合来实现的。我不希望结果包含目录或其他字段,我只想要人类会识别为标题、标题或普通文本段落的内容。然而,事实证明,如果有目录,那么不仅目录本身,而且目录中的每一行都在 Word.Paragraphs 中显示为单独的项目。我不想要这些,但无法在 Paragraph 对象上找到任何可以让我区分的属性,因此忽略它们(我猜我也需要将解决方案应用于其他字段类型,例如 table of数字和权威表,我还没有实际遇到过,但我想可能会导致同样的问题)
Does anyone know how when programmatically iterating through a word document, you can tell if a paragraph forms part of a table of contents (or indeed, anything else that forms part of a field).
My reason for asking is that I have a VB program that is supposed to extract the first couple of paragraphs of substantive text from a document - it's doing so by iterating through the Word.Paragraphs collection. I don't want the results to include tables of contents or other fields, I only want stuff that a human being would recognize as a header, title or a normal text paragraph. However it turns out that if there's a table of contents, then not only the table of contents itself but EVERY line in the table of contents appears as a separate item in Word.Paragraphs. I don't want these but haven't been able to find any property on the Paragraph object that would allow me to distinguish and so ignore them (I'm guessing I need the solution to apply to other field types too, like table of figures and table of authorities, which I haven't yet actually encountered but I guess potentially would cause the same problem)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
由于 Word 对象模型的限制,我认为实现此目的的最佳方法是暂时删除 TOC 字段代码,迭代 Word 文档,然后重新插入 TOC。在 VBA 中,它看起来像这样:
迭代代码以提取段落,然后
如果您在 .NET 中编码,这应该非常接近地翻译。此外,这应该适用于 Word 2003 及更早版本,但对于 Word 2007/2010,目录(根据其创建方式)有时会在其周围有一个类似内容控制的区域,可能需要您编写额外的检测和删除代码。
Because of the limitations in the Word object model I think the best way to achieve this would be to temporarily remove the TOC field code, iterate through the Word document, and then re-insert the TOC. In VBA, it would look like this:
Iterate through the code to extract paragraphs and then
If you are coding in .NET this should translate pretty closely. Also, this should work for Word 2003 and earlier as is, but for Word 2007/2010 the TOC, depending on how it is created, sometimes has a Content Control-like region surrounding it that may require you to write additional detect and remove code.
不能保证这一点,但如果目录使用标准 Word 样式(极有可能),并且没有人添加自己的带有“TOC”前缀的样式,那么就可以了。这是一个粗略的方法,但是可行。
This is not guaranteed, but if the standard Word styles are being used for the TOC (highly likely), and if no one has added their own style prefixed with "TOC", then it is OK. This is a crude approach, but workable.
您可以做的是为文档的每个部分创建自定义样式。
Word 2003 中的自定义样式(不确定您使用的是哪个版本的 Word)
然后,在迭代时通过您的段落集合,您可以检查 .Style 属性,如果它等于您的 TOCStyle,则可以安全地忽略它。
我相信同样的技术也适用于表格。
What you could do is create a custom style for each section of your document.
Custom styles in Word 2003 (not sure which version of Word you're using)
Then, when iterating through your paragraph collection you can check the .Style property and safely ignore it if it equals your TOCStyle.
I believe the same technique would work fine for Tables as well.
以下函数将返回一个在任何目录或图表之后开始的 Range 对象。然后,您可以使用返回的 Range 的 Paragraphs 属性:
The following Function will return a Range object that begins after any Table of Contents or Table of Figures. You can then use the Paragraphs property of the returned Range: