如何准备 Word 2007 文档以便 C# 可以从语义上提取数据？

发布于 2024-09-13 17:00:14 字数 779 浏览 4 评论 0原文

我有一个朋友正在 Microsoft Word 2007 中编写一本400 页的书。

在整本书中，他有200 个故事，每个故事都包含许多段落。

当他写完这本书时，他想要将嵌入 Word 文档中的每个故事的文本复制到数据库表中，例如：

Title, varchar(200)
Description, text
Content, text

我们不想复制并粘贴每个故事到数据库中，但希望有一个程序自动将标记的数据从 Word 文件提取到数据库中的相应字段中。

他必须在 Microsoft Word 中做什么来将每组段落表示为“故事内容”，将每个标题表示为“故事标题”等。先决条件是此标记不能可见在文件中。我知道 Word 2007 文件基本上是压缩的 XML 文件，所以我认为这是可能的，并且我认为 样式表 就是我们所需要的，但是我需要如何精确地准备 Word 文档，以便他补充道故事是否已正确标记？
我假设 C# 4.0 的新 COM Interop 功能是我分析 Word 文件并仅检索嵌入故事中的标题、描述和内容所需的功能，但我该怎么做这在技术上？有人有示例吗？

是否有人有过这样的项目（将 Microsoft Word 读取为语义数据文件）的经验可以分享？

原文

I have a friend who is writing a 400-page book in Microsoft Word 2007.

Throughout the book he has 200 stories each which consist of numerous paragraphs.

When he is finished writing the book, he wants to copy the text of each story that is embedded in his Word document into a database table such as:

Title, varchar(200)
Description, text
Content, text

We do not want to have to copy and paste each story into the database but want to have a program automatically pull the marked up data from the Word file into the appropriate fields in the database.

What does he have to do in Microsoft Word to denote each group of paragraphs as "story content" and each title as a "story title" etc. A prerequisite is that this markup cannot be visible in the document. I know that Word 2007 files are basically zipped XML files so I assume this is possible and I assume that stylesheets are what we need, but how do I need to prepare the Word document precisely so that as he adds stories they are properly marked up?
I assume that the new COM Interop features of C# 4.0 is what I need to analyze the Word file and retrieve only the title, description, and content from the embedded stories, but how do I do this technically? Does anyone have examples?

Does anyone have experience doing a project like this (reading Microsoft Word as a semnatic data file) that they could share?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷清清 2024-09-20 17:00:14

我要做的就是使用样式。每种类型的内容都有一种样式，并编写一个宏来逐段遍历文档并输出相应的文本文件。

回复收藏 0 原文

dawn曙光 2024-09-20 17:00:14

好吧，这个问题可以通过多种方式解决。

首先，我建议您将文件保存为 *.txt，以便解析一些纯文本。

然后，你的朋友在写作过程中必须保持一致，因为你将创建的内容（文本解析器）需要一致性。

制定一些规则，例如：

第一行标题，然后 2 个换行符；
所有段落均以 1 个换行符分隔；
最后一段后有 3 个换行符；

之后，加载文件，并使用上面的规则解析它。

{享受}

回复收藏 0 原文

世态炎凉 2024-09-20 17:00:14

以下是 docx 文档的 xml，其中包含一个包含单词“Title”的标题和两个包含单词“Content”的段落。在你的朋友写小说的同时研究一下小说的示例文件，对所有标题和段落元素使用统一的格式，你将能够很容易地解析它。内容位于压缩的 docx 文件的 word/document.xml 中。

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="005C78DC" w:rsidRDefault="00350339" w:rsidP="00350339"><w:pPr><w:pStyle w:val="Heading1"/></w:pPr><w:r><w:t>Title</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRPr="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:sectPr w:rsidR="00350339" w:rsidRPr="00350339" w:rsidSect="005C78DC"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>

Following is the xml for a docx document, which contains a heading containing the word "Title" and two paragraphs containing the word "Content". Study a sample file of the novel while your friend is writing it, use a uniform format for all heading and paragraph elelments and you will be able to parse it pretty easily.The content is in the word/document.xml of the zipped docx file.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="005C78DC" w:rsidRDefault="00350339" w:rsidP="00350339"><w:pPr><w:pStyle w:val="Heading1"/></w:pPr><w:r><w:t>Title</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRPr="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:sectPr w:rsidR="00350339" w:rsidRPr="00350339" w:rsidSect="005C78DC"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>

回复收藏 0 原文