如何准备 Word 2007 文档以便 C# 可以从语义上提取数据?

发布于 2024-09-13 17:00:14 字数 779 浏览 4 评论 0原文

我有一个朋友正在 Microsoft Word 2007 中编写一本400 页的书

在整本书中,他有200 个故事,每个故事都包含许多段落。

当他写完这本书时,他想要将嵌入 Word 文档中的每个故事的文本复制到数据库表中,例如:

Title, varchar(200)
Description, text
Content, text

我们不想复制并粘贴每个故事到数据库中,但希望有一个程序自动将标记的数据从 Word 文件提取到数据库中的相应字段中。

  1. 他必须在 Microsoft Word 中做什么来将每组段落表示为“故事内容”,将每个标题表示为“故事标题”等。先决条件是此标记不能可见在文件中。我知道 Word 2007 文件基本上是压缩的 XML 文件,所以我认为这是可能的,并且我认为 样式表 就是我们所需要的,但是我需要如何精确地准备 Word 文档,以便他补充道故事是否已正确标记?

  2. 我假设 C# 4.0 的新 COM Interop 功能是我分析 Word 文件并仅检索嵌入故事中的标题、描述和内容所需的功能,但我该怎么做这在技术上?有人有示例吗?

是否有人有过这样的项目(将 Microsoft Word 读取为语义数据文件)的经验可以分享?

I have a friend who is writing a 400-page book in Microsoft Word 2007.

Throughout the book he has 200 stories each which consist of numerous paragraphs.

When he is finished writing the book, he wants to copy the text of each story that is embedded in his Word document into a database table such as:

Title, varchar(200)
Description, text
Content, text

We do not want to have to copy and paste each story into the database but want to have a program automatically pull the marked up data from the Word file into the appropriate fields in the database.

  1. What does he have to do in Microsoft Word to denote each group of paragraphs as "story content" and each title as a "story title" etc. A prerequisite is that this markup cannot be visible in the document. I know that Word 2007 files are basically zipped XML files so I assume this is possible and I assume that stylesheets are what we need, but how do I need to prepare the Word document precisely so that as he adds stories they are properly marked up?

  2. I assume that the new COM Interop features of C# 4.0 is what I need to analyze the Word file and retrieve only the title, description, and content from the embedded stories, but how do I do this technically? Does anyone have examples?

Does anyone have experience doing a project like this (reading Microsoft Word as a semnatic data file) that they could share?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

冷清清 2024-09-20 17:00:14

我要做的就是使用样式。每种类型的内容都有一种样式,并编写一个宏来逐段遍历文档并输出相应的文本文件。

What I would do is use styles. Have one style for each type of content, and write a macro that traverses your document paragraph-by-paragraph and spits out the corresponding text file.

dawn曙光 2024-09-20 17:00:14

好吧,这个问题可以通过多种方式解决。

首先,我建议您将文件保存为 *.txt,以便解析一些纯文本。

然后,你的朋友在写作过程中必须保持一致,因为你将创建的内容(文本解析器)需要一致性。

制定一些规则,例如:

  1. 第一行标题,然后 2 个换行符;
  2. 所有段落均以 1 个换行符分隔;
  3. 最后一段后有 3 个换行符;

之后,加载文件,并使用上面的规则解析它。

{享受}

Okay, this can be resolved in numerous ways.

First of all, I would suggest that you save the file to a *.txt, to have some plain text to parse.

Then, your friend will have to be really consistent during the writing, because what you will create, (text parser) will need consistency.

Make some rules like :

  1. Title on first line, then 2 linebreaks;
  2. All the paragraphs separated with 1 linebreak;
  3. Then 3 linebreaks after the last paragraph;

After that, load the file, and parse it using the rules above.

{enjoy}

世态炎凉 2024-09-20 17:00:14

以下是 docx 文档的 xml,其中包含一个包含单词“Title”的标题和两个包含单词“Content”的段落。在你的朋友写小说的同时研究一下小说的示例文件,对所有标题和段落元素使用统一的格式,你将能够很容易地解析它。内容位于压缩的 docx 文件的 word/document.xml 中。

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="005C78DC" w:rsidRDefault="00350339" w:rsidP="00350339"><w:pPr><w:pStyle w:val="Heading1"/></w:pPr><w:r><w:t>Title</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRPr="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:sectPr w:rsidR="00350339" w:rsidRPr="00350339" w:rsidSect="005C78DC"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>

Following is the xml for a docx document, which contains a heading containing the word "Title" and two paragraphs containing the word "Content". Study a sample file of the novel while your friend is writing it, use a uniform format for all heading and paragraph elelments and you will be able to parse it pretty easily.The content is in the word/document.xml of the zipped docx file.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="005C78DC" w:rsidRDefault="00350339" w:rsidP="00350339"><w:pPr><w:pStyle w:val="Heading1"/></w:pPr><w:r><w:t>Title</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRPr="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:sectPr w:rsidR="00350339" w:rsidRPr="00350339" w:rsidSect="005C78DC"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>
筱果果 2024-09-20 17:00:14

使用书签标记每个故事的开始和结束

我强烈建议使用这种技术。

使用 Word 的书签功能标记每个“故事”的开始和结束。要查看“书签”,请转到“Word 选项”、“高级”、“显示文档内容”,然后选中“显示书签”。

然后只需浏览文档,收集书签之间的内容即可。

相当简单,这是我从 Word 6.x 开始就一直使用的技术。唯一的问题是必须想出 200 个书签名称。然而,这可能是一个优点,因为书签名称可以迁移到数据库中的“名称”字段。

使用样式来标记故事内容

另一种技术是定义构成故事的特定样式或多个样式。然后您提取样式。这有点困难,如果作者不遵守纪律,可能很容易出错。

使用包含故事内容的文本框

最后,如果可以将这些“故事”放入“文本框”中,则可以简单地提取文本框内容。这种方法的问题是文本框和文档布局更改的局限性,作者可能无法应用这些更改。

注释

还有其他方法,但书签方法是最容易使用和实现的。我会尽力回复您的任何评论/问题。

Use Bookmarks for Start and Stop of Each Story

I strongly suggest this technique.

Mark the start and end of each "story" with Word's Bookmark feature. To see "bookmarks", go to Word Options, Advanced, Show document content, and check Show bookmarks.

Then just go through the document collecting the content between the bookmarks.

Fairly easy and a technique I been using since Word 6.x. The only issue is having to come up with 200 bookmark names. Yet, this may be an advantage because the bookmark name could be the migrated to a "name" field in the database.

Using Styles to Mark Story Content

Another technique is to define specific style or styles that make up the story. You then extract the styles. This is a little harder and can be error prone if the author is not disciplined.

Using Text Boxes That Contain Story Content

Lastly, if these "stories" can be placed into a "text box", you can simply extract the text-boxes content. The problem with this approach is the limitations of the text-box and document layout changes which the author may not what to apply.

Notes

There are others ways, but the bookmark approach is the easiest to use and implement. I will try to respond to any comments/questions you have.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文