尝试将 MSWord 2007 文档转换为 XML 格式

发布于 2024-10-26 21:45:43 字数 1048 浏览 6 评论 0原文

我希望我可以放弃历史,但请相信我:

  1. 我有几个人可以立即访问 MSWord 2007
  2. 我们正在尝试准备一个可以在人与人之间传递的通用 Word 文档 在几个月的时间里,他们可以向其中“添加”新内容。

无论下面的答案如何 - 无论它是多么可怕的想法,或者你可能有什么更好的想法,上面的答案都将保持不变......我已经沿着这条路走了:P。

  1. 我的“想法”是设置(在 Word 中)一个 XML 架构,以便我们可以“标记”特定内容区域的内容(例如项目编号、项目描述、项目主干、项目选项、项目答案等)
  2. 我自学了 XML在不到 6 个小时的时间内完成了架构,显然我是一位糟糕的老师:我有 XML 架构文件,我已将其导入到 Word 中,我能够按照所有在线教程标记区域...
  3. 我希望保存到“XML”文件(来自 Word)并使其看起来像:
<前><代码><注意>; <致>托芙 <来自>贾尼 <标题>提醒 这个周末别忘了我!

(只是从一个随机站点上拉下来,以证明我想从 Word 文档中保存填充了数据的 XML 结构)

希望我可以用 Python 进行解析,或者将 XML 文件发送给供应商,然后供应商可以上传将信息存入数据库(不,我们不能只是上传到数据库,它必须从 Word 文档转换为 XML 到供应商)。

问题:每当我将文件从 MSWord 2007 保存为 XML 时,它都会给我带来所有这些可怕的 XML 垃圾 - 我检查了是否可以解析它,希望找到嵌入的 XML 标签,我发现但它被所有 Office 标签/垃圾弄乱了,解析它会浪费大量时间。

最后:我怎样才能让word自动填充XML标签(并且自动地我知道有人必须“选择文本”,“分配XML”......更多地谈论“保存”到XML)我开发一个模式(或者我可以创建一个没有模式的示例 XML 树吗?)并导出准备上传/解析的内容?

感谢您阅读我的短篇小说:P(希望我说得足够清楚!)

-J

I'm hoping I can forgo the history, but trust me on the following:

  1. I have several people who have immediate access to MSWord 2007
  2. We are trying to prep a generic Word document that can be passed from person to person
    over the course of several months and they can "add" new content to it.

Regardless of the answers below - the above will stay the same no matter how horrible an idea it is, or what better idea you may have... I've already been down this road :P.

  1. My 'thoughts' were to setup (within Word) an XML Schema so we could 'flag' the content for the specific content areas (e.g. item number, item description, item stem, item options, item answer, etc)
  2. I taught myself XML schema in a little under 6 hours, and apparently I'm a horrible teacher: I have the XML Schema file, I have imported it into Word, I am able to flag the areas as per all the online tutorials...
  3. I was HOPING to save out to an "XML" file (from Word) and have it look like:
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

(just pulled that off a random site to demonstrate I wanted to save out from the word document the XML structure with the data filled in)

The hope was I then could parse with Python, or send the XML file to a vendor who could then upload the information into a datebase (and no - we can't just upload to the database - it has to go from the Word Document to XML to the Vendor).

The problem: Whenever I save the file to XML from MSWord 2007 it gives me all this horrible horrible XML crap all over the place - I've checked to see if I could parse that, hoping to find my XML tags embedded, and I find them, but it's so garbled by all of Offices tags/crap that parsing it out would be a huge waste of time.

Finally: How can I have word automatically fill in the XML tags (and by automatically I understand that someone has to "select the text", "assign the XML"... talking more about the 'saving' out to an XML) from a schema I develop (or can I just create a sample XML tree without the schema?) and export the contents ready for uploading/parsing?

Thanks for reading my short novel :P (hope I was clear enough!)

-J

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

方觉久 2024-11-02 21:45:43

如果数据与您提供的示例一样统一(即只有 note 元素,具有固定数量的字段),您可能可以在 Word 文档中使用一个大表, tofromheadingbody 等列。然后,您可以在 Python 中使用以下命令解析它: 此问题中描述的方法之一并输出您的自定义XML。由于 .docx 文件已经是 XML,这可能会或可能不会使您的工作变得更简单。

如果数据变得更加复杂,一种想法可能是使用 Word 样式将文本映射到正确的标签。您可以为每个标签创建自定义样式,这样用户可以快速轻松地单击(并且可能具有不同的颜色和/或字体)。然后,在解析文档时,您可以根据应用的段落样式过滤所有内容。不过我想这条路会很痛苦。

另一种选择可能是以结构化语法编写文档,例如 YAML,这很容易读/写手动,您只需将文件保存为纯文本即可进行解析,例如

# plaintext_export.txt
------------------
Notes:
- From: Somebody
  To: Somebody-else
  Heading: This is a heading
  Message: > 
    Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
    tempor incididunt ut labore et dolore magna aliqua. 

- From: Another guy
  To: Me
  Heading: Huh?
  Message: >
    Some other message content.

解析将非常简单:

>>> import yaml
>>> from pprint import pprint
>>> with open("plaintext_export.txt", 'r') as f:
...     data = yaml.load(f)
...
>>> pprint(data)
{'Notes': [{'From': 'Somebody',
            'Heading': 'This is a heading',
            'Message': 'Lorem ipsum dolor sit amet, consectetur adipisicing elit
, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \n',
            'To': 'Somebody-else'},
           {'From': 'Another guy',
            'Heading': 'Huh?',
            'Message': 'Some other message content.\n',
            'To': 'Me'}]}

If the data will be as uniform as the example you provided (i.e. just note elements, with a fixed number of fields) You might be able to get away with having one big table in the Word document, with columns for to, from, heading, body, etc. Then, you could parse it out in Python using one of the methods described in this question and output your custom XML. Since .docx files are XML already, that may or may not make your job simpler.

If the data are going to be more complex, one idea might be using Word styles to map text to the correct tags. You could make a custom style for each tag, which would be quick and easy for the user to click (and perhaps have a different color and/or font). Then when parsing the document you could filter everything based on the paragraph style applied. I'm thinking this route would be painful, though.

Another option might be writing the document in a structured syntax like YAML, which is easy enough to read/write by hand and you could parse just from saving the file as plaintext, e.g.

# plaintext_export.txt
------------------
Notes:
- From: Somebody
  To: Somebody-else
  Heading: This is a heading
  Message: > 
    Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
    tempor incididunt ut labore et dolore magna aliqua. 

- From: Another guy
  To: Me
  Heading: Huh?
  Message: >
    Some other message content.

Parsing would be as simple as:

>>> import yaml
>>> from pprint import pprint
>>> with open("plaintext_export.txt", 'r') as f:
...     data = yaml.load(f)
...
>>> pprint(data)
{'Notes': [{'From': 'Somebody',
            'Heading': 'This is a heading',
            'Message': 'Lorem ipsum dolor sit amet, consectetur adipisicing elit
, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \n',
            'To': 'Somebody-else'},
           {'From': 'Another guy',
            'Heading': 'Huh?',
            'Message': 'Some other message content.\n',
            'To': 'Me'}]}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文